PythonResource.net

Python Pandas and Spreadsheets #3

Published Oct. 14, 2020, 12:17 a.m.

In the last tutorial we figured out how to store data into a single spreadsheet with 66 tabs. This time we will figure out how to read from an Excel Spreadsheet with 66 tabs. Spoiler Alert: It will end up being similar to how we wrote a spreadsheet. We will also preserve some of our code by creating definitions.

Let's start there. Since we will be reading from the spreadsheet that we stored and not from the website, we no longer need the code that does the reading from the website. However, we may want to use something like that later. We could just "comment out" the code by inserting a "#" before the text, however, in this case, I prefer the option of preserving our code by turning it into a definition. Chances are good that we will want to read from a website again sometime and it will be handy not to start from scratch.

To create the first definition, I will insert the definition name at the top of the code. I will elect here to pass in the three variables that I want to return at the end of the definition.

def get_data_from_website(stats, years, dfs):
    for year in years:
        website = f"https://www.baseball-reference.com/leagues/MLB/{str(year)}.shtml"
    
        baseball_stats = pd.read_html(website)
        stats[year]=baseball_stats
        if year > 1968:
            dfs.append(baseball_stats[0])
        else:
            dfs.append(baseball_stats[1])

    return stats, years, dfs

Next I will create a definition for reading an excel spreadsheet. It will be very similar to what we used to write one.

def read_excel_sheet_into_DB(excelfilename, sheetname):
    return pd.read_excel(excelfilename+".xlsx", sheet_name = sheetname)

It probably seems silly to create a definition that doesn't do very much. This is my personal style choice at the moment. It is not strictly necessary.

Now onto the main point of this particular tutorial. We want to read data from the spreadsheet! In this case, we want to read data from 66 particular tabs of a single spreadsheet. Since each tab is named for the year of the data we will use a simple 'for' loop to read the sheet:

for year in years:
    baseball_stats = read_excel_sheet_into_DB(excelfilename, str(year))
    stats[year] = baseball_stats
    dfs.append(baseball_stats)

We've already been going crazy with creating custom definitions, so why stop now?

def read_spreadsheet(excelfilename, stats, years, dfs):
    for year in years:
        baseball_stats = read_excel_sheet_into_DB(excelfilename, str(year))
        stats[year] = baseball_stats
        dfs.append(baseball_stats)

    return stats, years, dfs

Notice that we pass in the filename, and three variables (stats, years dfs). We must also return the three variables (stats, years, dfs) or else the data stays in the definition (in most cases. It turns out that when using the Spyder IDE that the command line retains the variables that we wanted to store!).

And while we're at it, we don't need to write a spreadsheet everytime either!

def write_spreadsheet(excelfilename, years, dfs):
    writer = pd.ExcelWriter(excelfilename+".xlsx", engine = "xlsxwriter")
    
    for df, sheetname in zip(dfs, years):
        df.to_excel(writer, sheet_name = str(sheetname))
    
    writer.save()
    writer.close()

Here we only need to pass in the filename and the 'years' and 'dfs' variables. No return is needed here because the point of this definition is to write an external spreadsheet.

Well if you made it this far, we now have a bunch of definitions and a code that doesn't do anything! That is because we must call the definitions for them to be used. To organize the code a little better, I take all my remaining variables and move them to the bottom of the code:

stats = {}
years = [y for y in range(1954,2020)]
dfs = []
excelfilename = "another_baseball_stats_file"
stats, years, dfs = read_spreadsheet(excelfilename, stats, years, dfs)

Then I simply call the definition and make sure to assign the variable names that I want returned.

I feel like this seemed like a lot of effort to achieve what we had already done, but don't let that discourage you. It was only that way because we had already gotten the information from the website. There will be lots of cases where a spreadsheet is the original source data that you need to read. In those cases, you will need to know how to do what we just did.

In the next tutorial we may finally get around to sorting through the data and trying to make sense of it. Thanks for watching and I will see you in the next tutorial!

Python Pandas and Spreadsheets #4

Python Pandas and Spreadsheets #1
Python Pandas and Spreadsheets #2
Python Pandas and Spreadsheets #3
(currently viewing)
Python Pandas and Spreadsheets #4
Python Pandas and Spreadsheets #5
Python Pandas and Spreadsheets #6