Python Pandas and Spreadsheets #6

Published Nov. 3, 2020, 10:54 p.m.

Last time we stored a spreadsheet for a single team.  This time we will create a 'set' of team names and then create a spreadsheet for all the american baseball teams that existed from 1954 to 2019 (according to our source).

First we will start by just moving the variables 'col_name' and 'col_val' so that they are above the line that creates our 'new_df'.  Since we want all the teams that existed from 1954 to 2019 we need a way to figure out how to get that!  I propose that we use our python to read the spreadsheets and complie all the team names into one big list.  Then we will make that list into a 'set'.  A 'set' in python (like in mathematics)  does not contain duplicates.  So the set function is a convenient way to get rid of duplicate entries in a list.

Create an empty list called "Team_Names".  We want to fill this list with all team names from all lists, so we will use a 'for' loop:

Team_Names = []
for year in years:
    Team_Names += stats[year][col_name].to_list()

This should make a complete list, but will contain many duplicates!  To remedy the duplicates we'll use the python *class* 'set()':

Team_Names = list(set(Team_Names))

We encase the 'set' in a 'list' because a python 'set' is a different data type than a 'list' and we want to use the data as a list.

Now that we have the Team_Names we can do a little pre-emptive cleanup.  If you inspect the spreadsheets you'll find a few entries in the 'Tm' column like "LgAvg" and "Tm" and even an empty cell.  We should remove these from our list so we don't cause problems later.  

For "LgAvg" and "Tm" we can simply remove them from the list:

Team_Names.remove('Tm')
Team_Names.remove('LgAvg')

To remove the empty cell we'll have to import numpy as np near the top of our script:

import numpy as np

Then back at Team_Names we can now write:

Team_Names.remove(np.nan)

With those entries removed from our list of Team_Names we can start to modify our script.  Let's start a loop over the team names and let's be a little clever about our loop iterating variable:

for col_val in Team_Names:
    new_df = pd.DataFrame()
    for year in years:
        df = stats[year]
        row = df.loc[df[col_name]==col_val].copy()
        row.loc[:,'year'] = year
        new_df = new_df.append(row, ignore_index = True)

So by using 'col_val' as our variable name and shifting our code from the fourth tutorial over we make a loop that we might think would make a spreadsheet for all the teams!  However, we're not there yet.  It turns out that there will be several teams that don't span all 66 years that we are looking at.  Our code is not yet ready for handling exceptions (like looking for rows that don't exist!) so we can try to build it in.

Here we can use the 'try / except' functionality that python offers.  First we need to populate our empty "new_df" with all the rows that we want to fill.  To do that we use an 'if' statement that only works when the year in question is the first year in our list of years.  When that is the case we loop through the 'keys' of the dataframe and create an empty list.  We must remember that we want to add the 'year' to the standard list so we add that key after the loop through the existing keys.

Now we attempt to find the row like we did in Tutorial #4.  However if we don't find the row we'll throw an error. With the 'try' function if there is an error or exception, then we default to what comes under the 'except' syntax.  In that case we append an empty row.  We can do this by appending 'pd.Series()' with 'ignore_index' = True.  But we're not done yet!  Since we know what team and what year we are in, we might as well fill those values in!  We can use the [-1] indexing to fill in the last value of the list.  The final loop should look like the following:

for col_val in Team_Names:
    new_df = pd.DataFrame()
    for year in years:
        df = stats[year]
        if year == years[0]:  ## only do it the first time!
            for key in df.keys():
                new_df[key] = []
            new_df['year'] = []
        try:
            row = df.loc[df[col_name]==col_val].copy()
            row.loc[:,'year'] = year
            new_df = new_df.append(row, ignore_index = True)
        except:
            new_df = new_df.append(pd.Series(), ignore_index = True)
            new_df[col_name].iloc[-1] = col_val
            new_df['year'].iloc[-1] = year
    
    write_team_spreadsheet(new_df, col_val, f'{col_val}_1954_2019')

Don't forget to indent the call to 'write_team_spreadsheet' at the end (Otherwise you'll only write a spreadsheet for the last team in the list).

Now if we execute the script and wait...  Our computer should turn out some number of spreadsheets (equal to the number of teams that existed in American Baseball from 1954 to 2019) and each one should be populated with the data for that team in the years that data exists. 

One final note about those of you using OpenOffice.  There appears to be an artifact from two things:

  1. Using OpenOffice
  2. Writing a dataframe with index = False

That causes some values in the spreadsheet to be rendered with a leading single quote " ' ".  If that happens then the way to get rid of that annoying single quote (only in OpenOffice) is to:

  1. Open Search and Replace
  2. Enter up carrot followed by a dot "^." as the search critera
  3. Enter an ampersand '&' as the replacement value
  4. Open the dialog for "More Options"
  5. click the box for 'regular expressions'
  6. click 'replace all'

This should get rid of all the single quotes and leave you with a nice spreadsheet.  Next time we'll probably throw up some more plots.  And eventually we will show how to add formatting to cells in our spreadsheet.  (Very useful for mass producing tables).  See you in the next tutorial! 

  • Python Pandas and Spreadsheets #1

  • Python Pandas and Spreadsheets #2

  • Python Pandas and Spreadsheets #3

  • Python Pandas and Spreadsheets #4

  • Python Pandas and Spreadsheets #5

  • Python Pandas and Spreadsheets #6
    (currently viewing)