Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.979
Model:                            OLS   Adj. R-squared:                  0.978
Method:                 Least Squares   F-statistic:                     712.7
Date:                Sun, 16 Aug 2020   Prob (F-statistic):           1.50e-38
Time:                        18:00:44   Log-Likelihood:                -5.2958
No. Observations:                  50   AIC:                             18.59
Df Residuals:                      46   BIC:                             26.24
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0725      0.096     53.065      0.000       4.880       5.265
x1             0.4818      0.015     32.681      0.000       0.452       0.511
x2             0.4161      0.058      7.179      0.000       0.299       0.533
x3            -0.0180      0.001    -13.936      0.000      -0.021      -0.015
==============================================================================
Omnibus:                        3.299   Durbin-Watson:                   2.185
Prob(Omnibus):                  0.192   Jarque-Bera (JB):                2.324
Skew:                          -0.345   Prob(JB):                        0.313
Kurtosis:                       3.801   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.62157415  5.05400303  5.45328752  5.79675163  6.06990303  6.26881405
  6.40076706  6.48305793  6.54015441  6.59967611  6.68785671  6.82523383
  7.02327476  7.28249239  7.59236099  7.93304553  8.2786605   8.60152287
  8.87670031  9.08610749  9.22147749  9.28571996  9.29244317  9.2637181
  9.22645234  9.20796918  9.23151851  9.3124569   9.4557246   9.65503382
  9.89390183 10.14835901 10.39088702 10.59493994 10.73930572 10.81159125
 10.81025853 10.74487732 10.6345529  10.50478709 10.38328841 10.29541961
 10.26003067 10.28636179 10.37252525 10.50581556 10.66479693 10.82282514
 10.95242507 11.02980478]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.03052107 10.92067015 10.71656909 10.45540246 10.18611822  9.95744356
  9.80595481  9.74712222  9.77152238  9.84714534]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.072539
x1                  0.481805
np.sin(x1)          0.416080
I((x1 - 5) ** 2)   -0.018039
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.030521
1    10.920670
2    10.716569
3    10.455402
4    10.186118
5     9.957444
6     9.805955
7     9.747122
8     9.771522
9     9.847145
dtype: float64