Python Pandas and Spreadsheets #2

Published Sept. 21, 2020, 8:10 p.m.

In the last tutorial we collected some data from a website (or 66 webpages to be precise), and stored it into a spreadsheet.  However, there was a problem.  The data from 1954 up to 1968 don't have the same column headings as the post-1968 data.  To investigate this we need to do some combing through our data. 

In this tutorial we will construct a python dictionary to do just that.  The dictionary and the interactive terminal will allow us to see why we have a mismatch in our data set and see what we can do to fix it.  Once the data are verfied to be same kind for each year we will re-save the spreadsheet and move on.

To get started we will define a new variable up among our other declared variables.  We'll call this one 'stats':

stats = {}

'stats' is an empty dictionary.  Now in our loop we will fill our dictionary on each iteration in the following way:

    stats[year] = baseball_stats

This uses the 'year' as the dictionary key and the full list of dataframes as the value.

After running the program we can use the interactive session to investigate what exists in entry for 'stats[1968]'.

stats[1968]
Out[3]: 
[      Rk   Tm   Lg      G    W   L  ...  ExInn   1Run   vRHP   vLHP  ≥.500  <.500
 0    1.0  DET   AL  164.0  103  59  ...   10-5  35-23  69-44  34-15  57-33  46-26
 1    2.0  STL   NL  162.0   97  65  ...   15-8  34-28  60-49  37-16  41-31  56-34
 2    3.0  BAL   AL  162.0   91  71  ...    9-8  23-22  65-51  26-20  46-44  45-27
 3    4.0  SFG   NL  163.0   88  74  ...    5-8  27-22  64-52  24-22  38-34  50-40
 4    5.0  CLE   AL  162.0   86  75  ...   12-4  25-20  64-53  22-22  41-49  45-26
 5    6.0  CHC   NL  163.0   84  78  ...    7-4  30-22  58-62  26-16  35-37  49-41
 6    7.0  BOS   AL  162.0   86  76  ...    5-7  26-28  68-59  18-17  43-47  43-29
 7    8.0  CIN   NL  163.0   83  79  ...  12-10  23-27  59-58  24-21  34-38  49-41
 8    9.0  NYY   AL  164.0   83  79  ...    4-7  36-28  62-54  21-25  39-51  44-28
 9   10.0  ATL   NL  163.0   81  81  ...   7-13  27-30  61-53  20-28  32-40  49-41
 10  11.0  OAK   AL  163.0   82  80  ...   7-12  25-30  56-41  26-39  44-46  38-34
 11  12.0  PIT   NL  163.0   80  82  ...   9-12  27-36  54-59  26-23  41-49  39-33
 12  13.0  MIN   AL  162.0   79  83  ...    7-8  35-33  60-65  19-18  49-59  30-24
 13  14.0  PHI   NL  162.0   76  86  ...    9-2  37-29  50-58  26-28  40-50  36-36
 14  15.0  LAD   NL  162.0   76  86  ...    9-4  35-33  48-58  28-28  42-48  34-38
 15  16.0  CAL   AL  162.0   67  95  ...    9-5  25-28  46-69  21-26  40-68  27-27
 16  17.0  CHW   AL  162.0   67  95  ...   9-15  30-44  51-67  16-28  37-71  30-24
 17  18.0  NYM   NL  163.0   73  89  ...   2-13  26-37  44-62  29-27  37-53  36-36
 18  19.0  HOU   NL  162.0   72  90  ...    6-7  28-30  48-62  24-28  37-53  35-37
 19  20.0  WSA   AL  161.0   65  96  ...    7-8  21-25  45-68  20-28  44-63  21-33
 20   NaN  Avg  NaN    NaN   80  80  ...    8-8  28-28  56-57  24-23  40-48  40-32
 
 [21 rows x 22 columns],
        Tm  #Bat  BatAge   R/G     G      PA  ...   GDP  HBP    SH   SF   IBB    LOB
 0     ATL    38    28.7  3.15   163    6119  ...   145   36    86   30    85   1175
 1     BAL    33    26.9  3.57   162    6023  ...   102   48    80   49    63   1119
 2     BOS    36    26.5  3.79   162    6037  ...   121   33    77   41    58   1108
 3     CAL    43    27.3  3.07   162    5944  ...   103   52    75   37    41   1092
 4     CHC    42    28.4  3.75   163    6020  ...   103   36    74   37    58   1063
 5     CHW    39    28.3  2.86   162    5965  ...   108   40    90   33    47   1059
 6     CIN    34    26.5  4.23   163    6287  ...   142   37    64   39    95   1133
 7     CLE    42    27.4  3.19   162    5990  ...    96   39    69   37    51   1097
 8     DET    33    28.1  4.09   164    6179  ...   124   61    73   34    53   1090
 9     HOU    38    25.6  3.15   162    5994  ...   115   48    97   34    81   1114
 10    LAD    35    27.9  2.90   162    5935  ...    89   18    79   45    61   1093
 11    MIN    33    28.2  3.47   162    5975  ...   106   49    69   38    61   1104
 12    NYM    34    26.0  2.90   163    6027  ...   104   43    75   27    50   1089
 13    NYY    34    28.1  3.27   164    5993  ...   100   30    56   30    52   1057
 14    OAK    34    24.9  3.49   163    6027  ...    87   35    78   35    53   1091
 15    PHI    32    29.2  3.35   162    5968  ...   106   22    64   46    60   1059
 16    PIT    33    29.7  3.58   163    6163  ...   133   33    96   43    93   1125
 17    SFG    31    27.8  3.67   163    6133  ...   117   47    92   44    55   1145
 18    STL    32    28.7  3.60   162    6082  ...    91   32    67   42    56   1113
 19    WSA    39    26.9  3.25   161    5972  ...   125   39    46   32    50   1088
 20  LgAvg    34    27.5  3.42   163    6042  ...   111   39    75   38    61   1101
 21     Tm  #Bat  BatAge   R/G     G      PA  ...   GDP  HBP    SH   SF   IBB    LOB
 22    NaN   676    27.5  3.42  3250  120833  ...  2217  778  1507  753  1223  22014
 
 [23 rows x 29 columns]]

Ah ha!  We can see that this entry contains 2 dataframes in the list of dataframes and that the second one is the one that we actually want to store in our spreadsheet.  Now we can just insert a bit of logic into our loop so that we grab the correct dataframes before and after 1968.  Our logic should look something like:

for year in years:
    website = f"https://www.baseball-reference.com/leagues/MLB/{str(year)}.shtml"

    baseball_stats = pd.read_html(website)
    stats[year]=baseball_stats
    if year > 1968:
        dfs.append(baseball_stats[0])
    else:
        dfs.append(baseball_stats[1])
        

Now when the program is excecuted, the resulting xlsx file contains consistent headers for all columns in all 66 tabs in our spreadsheet!

Next time we will figure out how to read from our spreadsheet (instead of reading from the website), and start on some data gathering so that we can make a plot for one of the teams over the last 66 years.  Thanks for watching and see you in the next tutorial!

 

skip_nextPython Pandas and Spreadsheets #3
  • Python Pandas and Spreadsheets #1

  • Python Pandas and Spreadsheets #2
    (currently viewing)
  • Python Pandas and Spreadsheets #3

  • Python Pandas and Spreadsheets #4

  • Python Pandas and Spreadsheets #5

  • Python Pandas and Spreadsheets #6