Retirement Simulator #5: Fit the Market Data with scipy curve_fit

Published Dec. 31, 2021, 9:29 p.m.

Highlights:

  • Plot a histogram of US Market Data
  • Fit the plot using a "Gaussian" Function
  • Use our fit create a "model" distribution

Welcome to the 5th tutorial in this series.  I switched from spyder to the notepad++ editor this time.  This should make no difference in the result.  You may continue as you were.  I just needed the horizontal space afforded by the text editor. 

Last time we established historical data.  The trouble is that we don't want to only choose from historical data, but rather we want to use the historical data to create a realistic model.  Then we want to randomly sample from our model.  In other words, we don't want to be foreced to pick the exact rates from previous years.  We want a function or a distribution that looks like the data.

First we will want to transplant our "import matplotlib.pyplot as plt" line from our test section to the top of the code.  Also we will need to import a curve fitting module as well.

## move me to the top with the other imports!
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

These are the only extra imports that we need at the moment.  Recall what we did last time.  We read a file and created a list of lines and filled two lists: one with years and one with .  Now we woratesuld like to take the rates data and fit it with a function. 

There is a bit of art and science to creating histograms.  If the bins are either too wide or too narrow we will not be able to discern the shape of our function.  We will try bins with a width of 10%.  This means we need to create our own distribution.  Python has a method that makes this easy (multiple methods really).  We will use a numpy function called 'arange':

rate_bins = np.arange(-0.55, 0.65, 0.1)
bin_centers = np.arange(-0.5, 0.65, 0.1)

The np.arange function uses a (start, stop, step-size) architecture.  Python generally includes the start and excludes the stop.  The rate_bins will start at -.55 and step with a step size of 0.1 with the last entry being 0.55.  These are the bin edges.  The bin_centers start at -0.5 and end at 0.5.  There should always be one more bin edge than bin center.

Now we plot the histogram using our custom made bins:

the_weights, thebins, zzz = plt.hist(rates, bins=rate_bins)

The plt.hist function returns three items:

  1.  Weights: the y value of the histogram plot
  2.  Bins:  the bins used in the plot
  3.  zzz: A representation of the plot:  The user can do something to remove this plot from being shown.

Try to forget about zzz for now.  It is beyond the scope of this tutorial.  We need to accept three returned items from the plt.hist function because we need to use one of them: the_weights.

Remember all of this is aimed at turning data into a model and fitting our model with a mathematical function.  The curve_fit function that we imported earlier is about to come to use.  Before we use it we should figure out what kind of input it needs.  The input required by curve_fit is three-fold:  the function to be fit, x-values and y-values.  The function can be any type we can define.  We are not limited to linear, or polynomial fits (although we can do those as well!).  So before we use curve fit we need to define a Gaussian function.

def gaus(x, a0, x0, sigma):
    return a0*np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))

Our function is a traditional Gaussian function.  It takes 'x' and three parameters and returns 'y' (like a function would).  This Gaussian is centered at x0, has amplitude a0 and width corresponding to sigma.  One nice feature of this definition of a function is that without needing to specify anything, we can give this function 'x' as a value OR 'x' as a list of values.  It will return values in the form given to it.

So far we do not know what are good parameters for a0, x0 and sigma.  That is the information that we hope to get from the curve_fit function.  By esablishing our own bins (and bin centers) and plotting the histogram so that it returned the_weights we now have the three things we need:  the function, x-values and y-values.

One final thing:  curve_fit returns two sets of values the first set is often refered to as 'popt' short for parameters-optimized or optimized parameters.  IN THIS CASE that will be a set of values coresponding to a0, x0 and sigma-- in that order because that is the order established by our definition.  The second set of values returned by curve_fit is called 'pcov' short for parameter covarience.  Covariance is a measure of the joint variability of two random variables.  For more on covarience, take a college statistics course or some such thing.  The main thing here is that since the curve_fit function returns these two sets of parameters, we must accept them both (even if we will only discard the covariences).

In python we would write:

popt, pcov =curve_fit(gaus, bin_centers, the_weights)
print(popt)
sys.exit()

Here I 'print(popt)' in order to see the optimize parameters that we have found.  Using sys.exit() just quits the program so that we can ignore anything following this statement for the moment. 

One final note:  In our case curve_fit was able to generate optimized parameters without us needing to provide any initial parameters.  My experience tells me that we were extremely lucky.  If you try to fit a defined curve and curve_fit barfs at you, you may need to give curve_fit some initial guesses for the parameters.  When no initial parameters are offered, curve_fit assumes 1 for each parameters and starts from there. (Pop quiz hot shot:  how might you estimate some initial parameters?)

For more info on curve_fit, open your terminal, import curve_fit as shown and then type:

>>help(curve_fit)

Now we can generate x and y values using our gaus function and optimized parameters to see how well our function fits the histogram.  To generate x values we can use the np.arange function again:

x_vals = np.arange(-0.55, .75, .01)

and generate values spaced 1% apart. (instead of 10% as we did earlier).

To generate y values we need to use our gaus function and optimized parameters with our new list of x values:

y_vals = gaus(x_vals, *potp)

Using the asterix before the popt tells python to use the values from the container without the container (i.e. give me the values without the parentheses).

Now we can plot our x and y values (thereby plotting our fit) with the histogram that we are fitting:

plt.plot(x_vals, y_vals)
plt.show()

If this all worked then you should have a nice smooth line that traces over the chunky blocks of our histogram pretty well.  Remember that an optimized fit should show the best comparison of our model to the data.  The fit is what we think the data would look like if we had more of it. 

In the next tutorial we will make a way to sample from our newly created distribution.  Thanks for watching!  See you next time.

skip_nextRetirement Simulator #6: Sampling from our custom market distribution
  • Retirement Simulator Introduction and Overview

  • Retirement Simulator #1: March through time with datetime and put money under the mattress!

  • Retirement Simulator #2: Apply an interest rate and verify it!

  • Retirement Simulator #3: Use a Gaussian distribution of rates!

  • Retirement Simulator #3+: Verify our Gaussian distribution!

  • Retirement Simulator #4: Make a US market rates distribution

  • Retirement Simulator #5: Fit the Market Data with scipy curve_fit
    (currently viewing)
  • Retirement Simulator #6: Sampling from our custom market distribution

  • Retirement Simulator #7: Organize multiple plots with subplots! and learn about plt.pause()

  • Retirement Simulator #8: Multiverse Investing: Simulate 1000 random investors

  • Retirement Simulator #9: Track averages and final outcomes

  • Retirement Simulator #10: Test and Compare Different Retirement dates (And make it pretty!)