Logo

Prediction (out of sample)ΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1059.
Date:                Sun, 22 Mar 2015   Prob (F-statistic):           1.94e-42
Time:                        14:58:40   Log-Likelihood:                 4.1650
No. Observations:                  50   AIC:                           -0.3299
Df Residuals:                      46   BIC:                             7.318
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          4.9818      0.079     62.970      0.000         4.823     5.141
x1             0.4867      0.012     39.889      0.000         0.462     0.511
x2             0.4739      0.048      9.880      0.000         0.377     0.570
x3            -0.0183      0.001    -17.044      0.000        -0.020    -0.016
==============================================================================
Omnibus:                        0.796   Durbin-Watson:                   2.483
Prob(Omnibus):                  0.672   Jarque-Bera (JB):                0.293
Skew:                          -0.157   Prob(JB):                        0.864
Kurtosis:                       3.205   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.52528468   4.98350894   5.40474642   5.76317138   6.04227847
   6.23759447   6.35741327   6.4214332    6.45752062   6.4971316
   6.57014375   6.6999474    6.89960269   7.16969375   7.4982328
   7.86262966   8.23340326   8.57902534   8.87110038   9.08903097
   9.22340188   9.27752693   9.26690458   9.21667158   9.15747301
   9.12042747   9.1320143    9.20972315   9.35918048   9.5732251
   9.8330848   10.11146052  10.37701153  10.59950462  10.75478133
  10.82872712  10.81959035  10.73827005  10.60652527  10.45340012
  10.31045187  10.20656625  10.16321155  10.19091144  10.28751601
  10.43855494  10.61961496  10.80035087  10.9494712   11.03987926]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 11.04452589  10.9243899   10.69805486  10.40787036  10.10958335
   9.85868939   9.69684531   9.64167027   9.6824319    9.78267398]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.981756
x1                  0.486687
np.sin(x1)          0.473874
I((x1 - 5) ** 2)   -0.018259
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
array([ 11.04452589,  10.9243899 ,  10.69805486,  10.40787036,
        10.10958335,   9.85868939,   9.69684531,   9.64167027,
         9.6824319 ,   9.78267398])

This Page