Published Sept. 21, 2020, 8:10 p.m.
In the last tutorial we collected some data from a website (or 66 webpages to be precise), and stored it into a spreadsheet. However, there was a problem. The data from 1954 up to 1968 don't have the same column headings as the post-1968 data. To investigate this we need to do some combing through our data.
In this tutorial we will construct a python dictionary to do just that. The dictionary and the interactive terminal will allow us to see why we have a mismatch in our data set and see what we can do to fix it. Once the data are verfied to be same kind for each year we will re-save the spreadsheet and move on.
To get started we will define a new variable up among our other declared variables. We'll call this one 'stats':
stats = {}
'stats' is an empty dictionary. Now in our loop we will fill our dictionary on each iteration in the following way:
stats[year] = baseball_stats
This uses the 'year' as the dictionary key and the full list of dataframes as the value.
After running the program we can use the interactive session to investigate what exists in entry for 'stats[1968]'.
stats[1968]
Out[3]:
[ Rk Tm Lg G W L ... ExInn 1Run vRHP vLHP ≥.500 <.500
0 1.0 DET AL 164.0 103 59 ... 10-5 35-23 69-44 34-15 57-33 46-26
1 2.0 STL NL 162.0 97 65 ... 15-8 34-28 60-49 37-16 41-31 56-34
2 3.0 BAL AL 162.0 91 71 ... 9-8 23-22 65-51 26-20 46-44 45-27
3 4.0 SFG NL 163.0 88 74 ... 5-8 27-22 64-52 24-22 38-34 50-40
4 5.0 CLE AL 162.0 86 75 ... 12-4 25-20 64-53 22-22 41-49 45-26
5 6.0 CHC NL 163.0 84 78 ... 7-4 30-22 58-62 26-16 35-37 49-41
6 7.0 BOS AL 162.0 86 76 ... 5-7 26-28 68-59 18-17 43-47 43-29
7 8.0 CIN NL 163.0 83 79 ... 12-10 23-27 59-58 24-21 34-38 49-41
8 9.0 NYY AL 164.0 83 79 ... 4-7 36-28 62-54 21-25 39-51 44-28
9 10.0 ATL NL 163.0 81 81 ... 7-13 27-30 61-53 20-28 32-40 49-41
10 11.0 OAK AL 163.0 82 80 ... 7-12 25-30 56-41 26-39 44-46 38-34
11 12.0 PIT NL 163.0 80 82 ... 9-12 27-36 54-59 26-23 41-49 39-33
12 13.0 MIN AL 162.0 79 83 ... 7-8 35-33 60-65 19-18 49-59 30-24
13 14.0 PHI NL 162.0 76 86 ... 9-2 37-29 50-58 26-28 40-50 36-36
14 15.0 LAD NL 162.0 76 86 ... 9-4 35-33 48-58 28-28 42-48 34-38
15 16.0 CAL AL 162.0 67 95 ... 9-5 25-28 46-69 21-26 40-68 27-27
16 17.0 CHW AL 162.0 67 95 ... 9-15 30-44 51-67 16-28 37-71 30-24
17 18.0 NYM NL 163.0 73 89 ... 2-13 26-37 44-62 29-27 37-53 36-36
18 19.0 HOU NL 162.0 72 90 ... 6-7 28-30 48-62 24-28 37-53 35-37
19 20.0 WSA AL 161.0 65 96 ... 7-8 21-25 45-68 20-28 44-63 21-33
20 NaN Avg NaN NaN 80 80 ... 8-8 28-28 56-57 24-23 40-48 40-32
[21 rows x 22 columns],
Tm #Bat BatAge R/G G PA ... GDP HBP SH SF IBB LOB
0 ATL 38 28.7 3.15 163 6119 ... 145 36 86 30 85 1175
1 BAL 33 26.9 3.57 162 6023 ... 102 48 80 49 63 1119
2 BOS 36 26.5 3.79 162 6037 ... 121 33 77 41 58 1108
3 CAL 43 27.3 3.07 162 5944 ... 103 52 75 37 41 1092
4 CHC 42 28.4 3.75 163 6020 ... 103 36 74 37 58 1063
5 CHW 39 28.3 2.86 162 5965 ... 108 40 90 33 47 1059
6 CIN 34 26.5 4.23 163 6287 ... 142 37 64 39 95 1133
7 CLE 42 27.4 3.19 162 5990 ... 96 39 69 37 51 1097
8 DET 33 28.1 4.09 164 6179 ... 124 61 73 34 53 1090
9 HOU 38 25.6 3.15 162 5994 ... 115 48 97 34 81 1114
10 LAD 35 27.9 2.90 162 5935 ... 89 18 79 45 61 1093
11 MIN 33 28.2 3.47 162 5975 ... 106 49 69 38 61 1104
12 NYM 34 26.0 2.90 163 6027 ... 104 43 75 27 50 1089
13 NYY 34 28.1 3.27 164 5993 ... 100 30 56 30 52 1057
14 OAK 34 24.9 3.49 163 6027 ... 87 35 78 35 53 1091
15 PHI 32 29.2 3.35 162 5968 ... 106 22 64 46 60 1059
16 PIT 33 29.7 3.58 163 6163 ... 133 33 96 43 93 1125
17 SFG 31 27.8 3.67 163 6133 ... 117 47 92 44 55 1145
18 STL 32 28.7 3.60 162 6082 ... 91 32 67 42 56 1113
19 WSA 39 26.9 3.25 161 5972 ... 125 39 46 32 50 1088
20 LgAvg 34 27.5 3.42 163 6042 ... 111 39 75 38 61 1101
21 Tm #Bat BatAge R/G G PA ... GDP HBP SH SF IBB LOB
22 NaN 676 27.5 3.42 3250 120833 ... 2217 778 1507 753 1223 22014
[23 rows x 29 columns]]
Ah ha! We can see that this entry contains 2 dataframes in the list of dataframes and that the second one is the one that we actually want to store in our spreadsheet. Now we can just insert a bit of logic into our loop so that we grab the correct dataframes before and after 1968. Our logic should look something like:
for year in years:
website = f"https://www.baseball-reference.com/leagues/MLB/{str(year)}.shtml"
baseball_stats = pd.read_html(website)
stats[year]=baseball_stats
if year > 1968:
dfs.append(baseball_stats[0])
else:
dfs.append(baseball_stats[1])
Now when the program is excecuted, the resulting xlsx file contains consistent headers for all columns in all 66 tabs in our spreadsheet!
Next time we will figure out how to read from our spreadsheet (instead of reading from the website), and start on some data gathering so that we can make a plot for one of the teams over the last 66 years. Thanks for watching and see you in the next tutorial!
skip_nextPython Pandas and Spreadsheets #3