Regression diagnostics
======================


.. _regression_diagnostics_notebook:

`Link to Notebook GitHub <https://github.com/statsmodels/statsmodels/blob/master/examples/notebooks/regression_diagnostics.ipynb>`_

.. raw:: html

   
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <p>This example file shows how to use a few of the <code>statsmodels</code> regression diagnostic tests in a real-life context. You can learn about more tests and find out more information abou the tests here on the <a href="http://statsmodels.sourceforge.net/stable/diagnostic.html">Regression Diagnostics page.</a> </p>
   <p>Note that most of the tests described here only return a tuple of numbers, without any annotation. A full description of outputs is always included in the docstring and in the online <code>statsmodels</code> documentation. For presentation purposes, we use the <code>zip(name,test)</code> construct to pretty-print short descriptions in the examples below.</p>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <h2 id="estimate-a-regression-model">Estimate a regression model</h2>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing code_cell rendered">
   <div class="input">
   <div class="prompt input_prompt">
   In&nbsp;[1]:
   </div>
   <div class="inner_cell">
       <div class="input_area">
   <div class="highlight"><pre><span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">print_function</span>
   <span class="kn">from</span> <span class="nn">statsmodels.compat</span> <span class="kn">import</span> <span class="n">lzip</span>
   <span class="kn">import</span> <span class="nn">statsmodels</span>
   <span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
   <span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
   <span class="kn">import</span> <span class="nn">statsmodels.formula.api</span> <span class="kn">as</span> <span class="nn">smf</span>
   <span class="kn">import</span> <span class="nn">statsmodels.stats.api</span> <span class="kn">as</span> <span class="nn">sms</span>
   <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
   
   <span class="c"># Load data</span>
   <span class="n">url</span> <span class="o">=</span> <span class="s">&#39;http://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv&#39;</span>
   <span class="n">dat</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
   
   <span class="c"># Fit regression model (using the natural log of one of the regressaors)</span>
   <span class="n">results</span> <span class="o">=</span> <span class="n">smf</span><span class="o">.</span><span class="n">ols</span><span class="p">(</span><span class="s">&#39;Lottery ~ Literacy + np.log(Pop1831)&#39;</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">dat</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
   
   <span class="c"># Inspect the results</span>
   <span class="k">print</span><span class="p">(</span><span class="n">results</span><span class="o">.</span><span class="n">summary</span><span class="p">())</span>
   </pre></div>
   
   </div>
   </div>
   </div>
   
   <div class="output_wrapper">
   <div class="output">
   
   
   <div class="output_area"><div class="prompt"></div>
   <div class="output_subarea output_stream output_stdout output_text">
   <pre>
                               OLS Regression Results                            
   ==============================================================================
   Dep. Variable:                Lottery   R-squared:                       0.348
   Model:                            OLS   Adj. R-squared:                  0.333
   Method:                 Least Squares   F-statistic:                     22.20
   Date:                Tue, 25 Aug 2015   Prob (F-statistic):           1.90e-08
   Time:                        00:46:09   Log-Likelihood:                -379.82
   No. Observations:                  86   AIC:                             765.6
   Df Residuals:                      83   BIC:                             773.0
   Df Model:                           2                                         
   Covariance Type:            nonrobust                                         
   ===================================================================================
                         coef    std err          t      P&gt;|t|      [95.0% Conf. Int.]
   -----------------------------------------------------------------------------------
   Intercept         246.4341     35.233      6.995      0.000       176.358   316.510
   Literacy           -0.4889      0.128     -3.832      0.000        -0.743    -0.235
   np.log(Pop1831)   -31.3114      5.977     -5.239      0.000       -43.199   -19.424
   ==============================================================================
   Omnibus:                        3.713   Durbin-Watson:                   2.019
   Prob(Omnibus):                  0.156   Jarque-Bera (JB):                3.394
   Skew:                          -0.487   Prob(JB):                        0.183
   Kurtosis:                       3.003   Cond. No.                         702.
   ==============================================================================
   
   Warnings:
   [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
   
   </pre>
   </div>
   </div>
   
   </div>
   </div>
   
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <h2 id="normality-of-the-residuals">Normality of the residuals</h2>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <p>Jarque-Bera test:</p>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing code_cell rendered">
   <div class="input">
   <div class="prompt input_prompt">
   In&nbsp;[2]:
   </div>
   <div class="inner_cell">
       <div class="input_area">
   <div class="highlight"><pre><span class="n">name</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;Jarque-Bera&#39;</span><span class="p">,</span> <span class="s">&#39;Chi^2 two-tail prob.&#39;</span><span class="p">,</span> <span class="s">&#39;Skew&#39;</span><span class="p">,</span> <span class="s">&#39;Kurtosis&#39;</span><span class="p">]</span>
   <span class="n">test</span> <span class="o">=</span> <span class="n">sms</span><span class="o">.</span><span class="n">jarque_bera</span><span class="p">(</span><span class="n">results</span><span class="o">.</span><span class="n">resid</span><span class="p">)</span>
   <span class="n">lzip</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">test</span><span class="p">)</span>
   </pre></div>
   
   </div>
   </div>
   </div>
   
   <div class="output_wrapper">
   <div class="output">
   
   
   <div class="output_area"><div class="prompt output_prompt">
       Out[2]:</div>
   
   
   <div class="output_text output_subarea output_pyout">
   <pre>
   [(&apos;Jarque-Bera&apos;, 3.3936080248431755),
    (&apos;Chi^2 two-tail prob.&apos;, 0.18326831231663288),
    (&apos;Skew&apos;, -0.48658034311223436),
    (&apos;Kurtosis&apos;, 3.0034177578816346)]
   </pre>
   </div>
   
   </div>
   
   </div>
   </div>
   
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <p>Omni test:</p>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing code_cell rendered">
   <div class="input">
   <div class="prompt input_prompt">
   In&nbsp;[3]:
   </div>
   <div class="inner_cell">
       <div class="input_area">
   <div class="highlight"><pre><span class="n">name</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;Chi^2&#39;</span><span class="p">,</span> <span class="s">&#39;Two-tail probability&#39;</span><span class="p">]</span>
   <span class="n">test</span> <span class="o">=</span> <span class="n">sms</span><span class="o">.</span><span class="n">omni_normtest</span><span class="p">(</span><span class="n">results</span><span class="o">.</span><span class="n">resid</span><span class="p">)</span>
   <span class="n">lzip</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">test</span><span class="p">)</span>
   </pre></div>
   
   </div>
   </div>
   </div>
   
   <div class="output_wrapper">
   <div class="output">
   
   
   <div class="output_area"><div class="prompt output_prompt">
       Out[3]:</div>
   
   
   <div class="output_text output_subarea output_pyout">
   <pre>
   [(&apos;Chi^2&apos;, 3.7134378115971938), (&apos;Two-tail probability&apos;, 0.15618424580304724)]
   </pre>
   </div>
   
   </div>
   
   </div>
   </div>
   
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <h2 id="influence-tests">Influence tests</h2>
   <p>Once created, an object of class <code>OLSInfluence</code> holds attributes and methods that allow users to assess the influence of each observation. For example, we can compute and extract the first few rows of DFbetas by:</p>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing code_cell rendered">
   <div class="input">
   <div class="prompt input_prompt">
   In&nbsp;[4]:
   </div>
   <div class="inner_cell">
       <div class="input_area">
   <div class="highlight"><pre><span class="kn">from</span> <span class="nn">statsmodels.stats.outliers_influence</span> <span class="kn">import</span> <span class="n">OLSInfluence</span>
   <span class="n">test_class</span> <span class="o">=</span> <span class="n">OLSInfluence</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
   <span class="n">test_class</span><span class="o">.</span><span class="n">dfbetas</span><span class="p">[:</span><span class="mi">5</span><span class="p">,:]</span>
   </pre></div>
   
   </div>
   </div>
   </div>
   
   <div class="output_wrapper">
   <div class="output">
   
   
   <div class="output_area"><div class="prompt output_prompt">
       Out[4]:</div>
   
   
   <div class="output_text output_subarea output_pyout">
   <pre>
   array([[-0.00301154,  0.00290872,  0.00118179],
          [-0.06425662,  0.04043093,  0.06281609],
          [ 0.01554894, -0.03556038, -0.00905336],
          [ 0.17899858,  0.04098207, -0.18062352],
          [ 0.29679073,  0.21249207, -0.3213655 ]])
   </pre>
   </div>
   
   </div>
   
   </div>
   </div>
   
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <p>Explore other options by typing <code>dir(influence_test)</code></p>
   <p>Useful information on leverage can also be plotted:</p>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing code_cell rendered">
   <div class="input">
   <div class="prompt input_prompt">
   In&nbsp;[5]:
   </div>
   <div class="inner_cell">
       <div class="input_area">
   <div class="highlight"><pre><span class="kn">from</span> <span class="nn">statsmodels.graphics.regressionplots</span> <span class="kn">import</span> <span class="n">plot_leverage_resid2</span>
   <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">))</span>
   <span class="n">fig</span> <span class="o">=</span> <span class="n">plot_leverage_resid2</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">ax</span><span class="p">)</span>
   </pre></div>
   
   </div>
   </div>
   </div>
   
   <div class="output_wrapper">
   <div class="output">
   
   
   <div class="output_area"><div class="prompt"></div>
   
   
   <div class="output_text output_subarea ">
   <pre>
   &lt;matplotlib.figure.Figure at 0x7f90c603fc50&gt;
   </pre>
   </div>
   
   </div>
   
   </div>
   </div>
   
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <p>Other plotting options can be found on the <a href="http://statsmodels.sourceforge.net/stable/graphics.html">Graphics page.</a></p>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <h2 id="multicollinearity">Multicollinearity</h2>
   <p>Condition number:</p>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing code_cell rendered">
   <div class="input">
   <div class="prompt input_prompt">
   In&nbsp;[6]:
   </div>
   <div class="inner_cell">
       <div class="input_area">
   <div class="highlight"><pre><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">cond</span><span class="p">(</span><span class="n">results</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">exog</span><span class="p">)</span>
   </pre></div>
   
   </div>
   </div>
   </div>
   
   <div class="output_wrapper">
   <div class="output">
   
   
   <div class="output_area"><div class="prompt output_prompt">
       Out[6]:</div>
   
   
   <div class="output_text output_subarea output_pyout">
   <pre>
   702.17921454900602
   </pre>
   </div>
   
   </div>
   
   </div>
   </div>
   
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <h2 id="heteroskedasticity-tests">Heteroskedasticity tests</h2>
   <p>Breush-Pagan test:</p>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing code_cell rendered">
   <div class="input">
   <div class="prompt input_prompt">
   In&nbsp;[7]:
   </div>
   <div class="inner_cell">
       <div class="input_area">
   <div class="highlight"><pre><span class="n">name</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;Lagrange multiplier statistic&#39;</span><span class="p">,</span> <span class="s">&#39;p-value&#39;</span><span class="p">,</span> 
           <span class="s">&#39;f-value&#39;</span><span class="p">,</span> <span class="s">&#39;f p-value&#39;</span><span class="p">]</span>
   <span class="n">test</span> <span class="o">=</span> <span class="n">sms</span><span class="o">.</span><span class="n">het_breushpagan</span><span class="p">(</span><span class="n">results</span><span class="o">.</span><span class="n">resid</span><span class="p">,</span> <span class="n">results</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">exog</span><span class="p">)</span>
   <span class="n">lzip</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">test</span><span class="p">)</span>
   </pre></div>
   
   </div>
   </div>
   </div>
   
   <div class="output_wrapper">
   <div class="output">
   
   
   <div class="output_area"><div class="prompt output_prompt">
       Out[7]:</div>
   
   
   <div class="output_text output_subarea output_pyout">
   <pre>
   [(&apos;Lagrange multiplier statistic&apos;, 4.8932133740939667),
    (&apos;p-value&apos;, 0.086586905023521704),
    (&apos;f-value&apos;, 2.50371594625644),
    (&apos;f p-value&apos;, 0.087940287826729857)]
   </pre>
   </div>
   
   </div>
   
   </div>
   </div>
   
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <p>Goldfeld-Quandt test</p>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing code_cell rendered">
   <div class="input">
   <div class="prompt input_prompt">
   In&nbsp;[8]:
   </div>
   <div class="inner_cell">
       <div class="input_area">
   <div class="highlight"><pre><span class="n">name</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;F statistic&#39;</span><span class="p">,</span> <span class="s">&#39;p-value&#39;</span><span class="p">]</span>
   <span class="n">test</span> <span class="o">=</span> <span class="n">sms</span><span class="o">.</span><span class="n">het_goldfeldquandt</span><span class="p">(</span><span class="n">results</span><span class="o">.</span><span class="n">resid</span><span class="p">,</span> <span class="n">results</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">exog</span><span class="p">)</span>
   <span class="n">lzip</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">test</span><span class="p">)</span>
   </pre></div>
   
   </div>
   </div>
   </div>
   
   <div class="output_wrapper">
   <div class="output">
   
   
   <div class="output_area"><div class="prompt output_prompt">
       Out[8]:</div>
   
   
   <div class="output_text output_subarea output_pyout">
   <pre>
   [(&apos;F statistic&apos;, 1.1002422436378148), (&apos;p-value&apos;, 0.3820295068692508)]
   </pre>
   </div>
   
   </div>
   
   </div>
   </div>
   
   </div>
   <div class="cell border-box-sizing text_cell rendered">
   <div class="prompt input_prompt">
   </div>
   <div class="inner_cell">
   <div class="text_cell_render border-box-sizing rendered_html">
   <h2 id="linearity">Linearity</h2>
   <p>Harvey-Collier multiplier test for Null hypothesis that the linear specification is correct:</p>
   </div>
   </div>
   </div>
   <div class="cell border-box-sizing code_cell rendered">
   <div class="input">
   <div class="prompt input_prompt">
   In&nbsp;[9]:
   </div>
   <div class="inner_cell">
       <div class="input_area">
   <div class="highlight"><pre><span class="n">name</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;t value&#39;</span><span class="p">,</span> <span class="s">&#39;p value&#39;</span><span class="p">]</span>
   <span class="n">test</span> <span class="o">=</span> <span class="n">sms</span><span class="o">.</span><span class="n">linear_harvey_collier</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
   <span class="n">lzip</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">test</span><span class="p">)</span>
   </pre></div>
   
   </div>
   </div>
   </div>
   
   <div class="output_wrapper">
   <div class="output">
   
   
   <div class="output_area"><div class="prompt output_prompt">
       Out[9]:</div>
   
   
   <div class="output_text output_subarea output_pyout">
   <pre>
   [(&apos;t value&apos;, array(-1.0796490077783818)), (&apos;p value&apos;, 0.28346392475585869)]
   </pre>
   </div>
   
   </div>
   
   </div>
   </div>
   
   </div>

   <script src="https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"type="text/javascript"></script>
   <script type="text/javascript">
   init_mathjax = function() {
       if (window.MathJax) {
           // MathJax loaded
           MathJax.Hub.Config({
               tex2jax: {
               // I'm not sure about the \( and \[ below. It messes with the
               // prompt, and I think it's an issue with the template. -SS
                   inlineMath: [ ['$','$'], ["\\(","\\)"] ],
                   displayMath: [ ['$$','$$'], ["\\[","\\]"] ]
               },
               displayAlign: 'left', // Change this to 'center' to center equations.
               "HTML-CSS": {
                   styles: {'.MathJax_Display': {"margin": 0}}
               }
           });
           MathJax.Hub.Queue(["Typeset",MathJax.Hub]);
       }
   }
   init_mathjax();

   // since we have to load this in a ..raw:: directive we will add the css
   // after the fact
   function loadcssfile(filename){
       var fileref=document.createElement("link")
       fileref.setAttribute("rel", "stylesheet")
       fileref.setAttribute("type", "text/css")
       fileref.setAttribute("href", filename)

       document.getElementsByTagName("head")[0].appendChild(fileref)
   }
   // loadcssfile({{pathto("_static/nbviewer.pygments.css", 1) }})
   // loadcssfile({{pathto("_static/nbviewer.min.css", 1) }})
   loadcssfile("../../../_static/nbviewer.pygments.css")
   loadcssfile("../../../_static/ipython.min.css")
   </script>