Johansen test - lagged difference terms

Hi,

In the course Mean Reversion Strategies, we've learned about the Johansen test. In the examples from the course, it is always used a single lagged difference term, and a constant trend term.

I wonder if someone could explain a good way of determining the number of lagged terms that one should use in the Johansen test, or is it that 1 is the typical number that is used in practice (otherwise we introduce more parameters, higher variance in estimations, etc). Would a test with 0 lagged difference terms also give a reasonable cointegrating hedging ratio (that would reduce the number of fitted parameters even further)?



Thanks,

Razvan

Hi Razvan



That's a very interesting question. You can use AIC and BIC to find the optimal lag length. You are spot on, 1 is widely used as lag length to reduce overfitting. With 0 it will not work as you need to check with at least 1 previous value to determine if the time series is cointegrated. 



I believe you might find these below discussions interesting but eventually rule against overfitting and go with a lag length of 1.

1. https://www.researchgate.net/post/How_do_you_choose_the_optimal_laglength_in_a_time_series

2. https://www.researchgate.net/post/How_can_I_decide_the_lag_lenth_in_Johansen_Test



I hope this helps.



Thanks!

Thank you for your reply. I think though, that when k_ar_diff=0 (using statsmodels parameter name) the model still has a lagged term, it just doesn't have a lagged difference term. So it's similar to an VAR(1) model and it still can model a mean-reverting process. So in principle, it seems to me that a model with k_ar_diff=0 could still fit a mean reversion process, and have less parameters (and smaller variance in the fitted parameters) compared to a model with k_ar_diff=1.



Another related question, when looking at the johansen.py function provided in section 3, it seems to me there might be a typo on line 133: dx = detrend(lx, f)

Shouldn't this be: lx = detrend(lx, f) instead of dx = …?



Thanks,

Razvan

The notation used in the code is dx to indicate a detrend of x series and lx for lag of x by k series. Therefore variable name dx is used. Thank you.