PythonResource.net

Python Pandas and Spreadsheets #4

Published Oct. 22, 2020, 1:13 p.m.

Welcome to the fourth tutorial in this series. Last time we figured out how to read our data from the spreadsheet and this is a very useful thing to know. We organized our code into a bunch of definitions and then ran a simple code at the end of the script that used only the definition that reads all 66 tabs in our spreadsheet. Now we want to do some data analysis. Perhaps we want to know how a teams batting average has changed over the years. That is what we will figure out how to do today.

But first! A friend of mine informed me that when you use f-strings you do not need to declare the variable as a string inside the curly braces. Thus, we should change our f-string from the no-longer used variable 'website' to become:

website = f"https://www.baseball-reference.com/leagues/MLB/{year}.shtml"

I wanted to show this because its both a time saver and a space saver (spacetime saver?) not to have to render a string inside the curly braces. Ok. Moving on...

Now when we read our spreadsheet as we figured out how to do last time, we give a very minimal amount of information to pandas and it delivered everything in the spreadsheet while ignoring empty cells. However, when we wrote the file into the spreadsheet, we wrote all of the data AND an 'index' for each piece of data. The first column of 'data' in the spreadsheet has no header and is just a zero-indexed run of numbers-- one for each row. Now when we read the data from the spreadsheet we read in that index column as a column of data and pandas gives it the heading 'unnamed'. This is a good time to learn about some of the customizations in pandas so that we can ignore columns that we don't want.

It is always advisable to 'read the docs' and they are only a google search away. What we want in this case is the keyword 'usecols' which can accept a variety of inputs including some that are excel-friendly. Columns in excel start with A and go to Z and then AA, AB, etc. Pandas recognizes this format as well as the excel method of putting a colon in-between two column names to indicate that you want the bookends and everything in-between. So then, we want to modify our 'read_excel_sheet_into_DB' definition to look like the following:

def read_excel_sheet_into_DB(excelfilename, sheetname):
    return pd.read_excel(excelfilename+".xlsx", sheet_name = sheetname, usecols="B:AE")

We have added the 'usecols' keyword to the pd.read_excel function and told excel to start at column "B" and end at column "AE". I think it turns out that column "AE" is empty and so excel will ignore it .

Now that we have a clean version of our data stored into our 'stats' dictionary we want to build a new dataframe. I decided to choose some variable names so that when we use some pandas magic we get a clear idea of what we are trying to do.

First we will establish a new dataframe. Now we must decide what team we want to track. I chose the Yankees because I was sure that they would be listed in every dataframe and I didn't want to have to deal with exceptions. The column with the team name is called "Tm" and the column value that represents the Yankees is 'NYY'. I will call them 'col_name' and 'col_val' respectively.

new_df = pd.DataFrame()
col_val = 'NYY'
col_name = 'Tm'

To make a new dataframe with just one team for all years in it we will make a loop that runs through the years (as before). We will define our dataframe (df) as the stats[year]. The row we want can be isolated by using an attribute called 'loc'. Using df.loc[df[col_name]==col_val] returns any rows where the column value = 'col_val'. In this case that is only one row.

After we find our row, we would like to add to it the year from which the data came. There are multiple ways to do this however, the way that pandas 'wants' you to do it is to provide the index for the new column and then the name for the new column. I suppose this would be a good generic way of replacing columns that exist. Here we want to add to the row so that for the index we simply put a colon ":" and for the column name I put 'year'. Finally, to build up our new_df we will append to it. Pandas would try to preserve the index from which the data originally came. This would cause problems and screw up our goals here so we must use the keyword "ignore_index = True" when appending the row. Also, for reasons I don't yet understand, you cannot just use "new_df.append(row, ignore_index = True)" in the style you would use to append to a list. You seem to need to redefine the "new_df" every time you append. The resulting loop looks like this:

for year in years:
    df = stats[year]
    row = df.loc[df[col_name]==col_val]
    row.loc[:,'year'] = year
    new_df = new_df.append(row, ignore_index = True)

One final note: Python / Pandas will give you some warnings that you are messing with column slices because it wants to faithfully let you know that you are altering the dataframe slice that you started out with. To avoid this warning you can simply add ".copy()" when you define the row:

for year in years:
    df = stats[year]
    row = df.loc[df[col_name]==col_val].copy()
    row.loc[:,'year'] = year
    new_df = new_df.append(row, ignore_index = True)

Now we have a new dataframe (new_df) which is only the stats for a single team. Lets go ahead and make a quick plot! I decided to look at batting average or "BatAge" as the column header is called. In the 'interactive' window we need to use the following commands:

In [2]: import matplotlib.pyplot as plt

then:

In [3]: plt.plot(new_df.year, new_df.BatAge)

The 'plt.plot' is looking for lists of x and y data to make a scatter plot. Pandas makes the columns of a dataframe accessible using the 'dot' operator so that "new_df.year" is the column of year values and "new_df.BatAge" is the column of corresponding batting averages. Normally you would have a following command that says "plt.show()" but the IDE does not require that. Ok! Thank you for watching this tutorial. See you in the next one!

Python Pandas and Spreadsheets #5

Python Pandas and Spreadsheets #1
Python Pandas and Spreadsheets #2
Python Pandas and Spreadsheets #3
Python Pandas and Spreadsheets #4
(currently viewing)
Python Pandas and Spreadsheets #5
Python Pandas and Spreadsheets #6