Published Sept. 21, 2020, 6:21 p.m.
Welcome to the python, pandas and spreadsheets tutorial series. In scientific computing spreadsheets are ubiquitous. They are just too useful to give up. I am not advocating that we give them up. However, it is always advisable to get programs to do the boring stuff. Enter python and the pandas data package. We won't even begin to probe the depths of what pandas can do, but we will be able to take advantage of some ready made usefulness to create, save and work with spreadsheet data.
Pandas works with a data type called a "Data Frame" which is essentially a spreadsheet. Before we get there, first we need to get some data. I wanted to find baseball stats to accumulate and to ultimately make plots. A google search led me to this site:
https://www.baseball-reference.com/leagues/MLB/2019.shtml
Since it had some table data I thought it was a good place to get started.
For this tutorial series I am using the Anaconda version of python and the Spyder IDE. IDE stands for Integrated Developer Environment. I never became accustomed to using one, however, I cannot deny that they are handy for developing code. They have a basic three pane system with one pane for scripting, one with an "interactive" session and a third for plots and other info.
The Anaconda version of python comes with the pandas package already loaded. If you are trying with a version of python that does not have pandas, then you will need to install it with pip.
pip install pandas
To get started we will go ahead and import pandas and give it a shorter name to reference:
import pandas as pd
Next we will create a variable to store our website url:
website = "https://www.baseball-reference.com/leagues/MLB/2019.shtml"
To read the table data from this website we very simply use the pandas extension 'read_html':
baseball_stats = pd.read_html(website)
The 'read_html' function attempts to store table data from html as a list of dataframes. If we check on the type of variable that we created:
type(baseball_stats)
Python tells us that it is a list. If we inquire about the length of the list, in this case it is '1'.
Inquiring further about the first element of the list reveals that it is indeed a python dataframe.
Now that we know where the dataframe is, we can store it as a spreadsheet. Let's define a filename for the excel file we want to create:
excelfilename = "baseball_stats"
Now we will create a 'writer' for the spreadsheet and a sheetname for the data:
writer = pd.ExcelWriter(excelfilename+".xlsx", engine = "xlsxwriter")
sheetname = "Baseball"
I like to add the ".xlsx" extension to the filename as shown so that if and when I want to change the file name, I don't have to worry about forgeting the extension. As far as the sheetname, pandas has a default sheetname of "Sheet1", however, we will be making a multi-tabbed spreadsheet with useful sheetnames in a moment so I want to preview that. Now we call a pandas function 'to_excel' in which we pass the writer and the sheetname:
baseball_stats.to_excel(writer, sheet_name = sheetname)
Finally, we call on the writer to save and close:
writer.save()
writer.close()
Python has now created 'baseball_stats.xlsx' in the folder in which we are running this script! Checking it we see that it does indeed contain the data from the table we previewed from the website.
You might say "Big deal! I could have copied and pasted that into excel in 10 seconds." and you would be mostly correct. However, As we will see, programming makes this action easy to do over and over again. What if we actually wanted the last 66 years of baseball stats?
In that case we might start by created a list of years and an emtpy list to store some dataframes:
years = [y for y in range(1954, 2020)]
dfs = []
And then make a for loop:
for year in years:
website = f"https://www.baseball-reference.com/leagues/MLB/{str(year)}.shtml"
baseball_stats = pd.read_html(website)
dfs.append(baseball_stats[0])
Notice here that I use an "f-string" to substitue the year from my for-loop into the website. I assumed that the website followed a basic pattern in their naming convention and discoverd that I was not wrong. This loop goes to 66 websites, reads the html, stores some dataframes and then appends the first dataframe from that website to the list we created.
Let's rename our excel file:
excelfilename = "another_baseball_stats_file"
Because we used a for-loop to read the data, we probably ought to use a for-loop to write the data too. We replace our previous 'to_excel' function call with:
for df, sheetname in zip(dfs, years):
df.to_excel(writer, sheet_name = str(sheetname))
Here we use the "zip" function in python. This function takes any number of lists (> 1) and creates a link between the values in the list with the same index. The python syntax shows that for each iteration of the loop, I name the linked elements from the list as "df" and "sheetname" corresponding to the dataframe and sheetname respectively. Then I use the 'to_excel' function and provide the unique dataframe and sheetname. I must re-cast the variable 'sheetname' as a string because it exists as an integer and will cause an error if not cast correctly.
The call to save and close the writer remain the same as before and are not part of the last loop, but come after it.
When excuted, this program produces the file "another_baseball_stats_file.xlsx" with 66 tabs; Each tab is named for the year of the data extracted. As usual there are new problems to address, however I save them for the next tutorial. Thanks for watching and see you next time!
skip_nextPython Pandas and Spreadsheets #2