Logo

Prediction (out of sample)ΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.981
Method:                 Least Squares   F-statistic:                     867.5
Date:                Tue, 25 Aug 2015   Prob (F-statistic):           1.78e-40
Time:                        00:45:27   Log-Likelihood:                 1.0243
No. Observations:                  50   AIC:                             5.951
Df Residuals:                      46   BIC:                             13.60
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          5.0830      0.084     60.339      0.000         4.913     5.253
x1             0.4953      0.013     38.125      0.000         0.469     0.521
x2             0.4219      0.051      8.260      0.000         0.319     0.525
x3            -0.0208      0.001    -18.221      0.000        -0.023    -0.018
==============================================================================
Omnibus:                        1.746   Durbin-Watson:                   1.889
Prob(Omnibus):                  0.418   Jarque-Bera (JB):                0.934
Skew:                          -0.255   Prob(JB):                        0.627
Kurtosis:                       3.435   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.56341717   5.0144075    5.43096156   5.79008837   6.0770943
   6.28799715   6.43018041   6.52118022   6.58580416   6.6520555
   6.74653231   6.89005748   7.09425756   7.3596526    7.67557064
   8.02190114   8.37239905   8.69899681   8.97641578   9.18631944
   9.320326     9.38138557   9.3832954    9.34843307   9.3040799
   9.27793872   9.29358236   9.36658069   9.50194257   9.69329289
   9.92391996  10.16952124  10.40219613  10.59502999  10.72651615
  10.78408934  10.7661902   10.68252142  10.55245366  10.40184296
  10.2587827   10.14898794  10.09157047  10.09589886  10.16005915
  10.27116937  10.40749601  10.54202474  10.64689879  10.69799578]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 10.66407583  10.5168051   10.27272731   9.96954351   9.65688155
   9.38414547   9.18841962   9.08538917   9.06549979   9.09629691]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
<matplotlib.figure.Figure at 0x7f90c6ad2890>

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.083036
x1                  0.495320
np.sin(x1)          0.421859
I((x1 - 5) ** 2)   -0.020785
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
array([ 10.66407583,  10.5168051 ,  10.27272731,   9.96954351,
         9.65688155,   9.38414547,   9.18841962,   9.08538917,
         9.06549979,   9.09629691])

This Page