'optimal lag value' to my Johansen cointegration test

Jane_D · June 16, 2022, 12:07pm

Strangely when I pass this 'optimal lag value' to my Johansen cointegration test it fails but i use lag 1 it works. Is something wrong? How should I interpret this?

Jose_Carlos_Gonzales_Tanaka · June 16, 2022, 12:53pm

Hello Jenny,

It might be because you have too few observations. If you increase the number of lags to test, you need also a more number of observations to run the test.

Please verify you have a sufficient number of observations compared to the total number of parameters that the VAR has for the Johansen test.

Thanks

José Carlos

Jane_D · June 16, 2022, 1:30pm

I used 100 Ill try 252 now.

Jane_D · June 16, 2022, 2:19pm

252 didnt work btw

Jane_D · June 16, 2022, 6:12pm

I forgot to round the lag value!

Jose_Carlos_Gonzales_Tanaka · June 16, 2022, 6:16pm

Hello Jenny,

Great to know you found the solution. It's really good to face these issues since it is the best way to learn to code.

Thanks and regards,

José Carlos

Jane_D · June 16, 2022, 7:38pm

The performance has lowered using the optimal lag value than using 1 lag period. Is this normal?

Jose_Carlos_Gonzales_Tanaka · June 16, 2022, 7:45pm

Hello Jane,

Could you please provide your code for checking?

Thanks

José Carlos

Jane_D · June 16, 2022, 10:00pm

Whats your email!? Il send it to you there!

Jane_D · June 17, 2022, 11:37am

Run it and see for yourself. Let me know if you confirm what Im saying that the code does worse with the optimised lag. I cant see a reason why or anything thats wrong here.

    data.dropna(inplace=True, axis='columns')

    stock_data = data.diff()

    adf = stock_data.apply(adf_filter, axis=0 )
    adf = adf[adf==True]
    print("adf",adf)
    print(f'{get_datetime()}:total assets found in adf {len(adf.index)}')
    stock_data = data[adf.index].diff()


    adfsort = stock_data.apply(adf_filter_sort, axis=0)
    adfsort=adfsort.sort_values(axis=0, ascending=False)
    #print("adfsort",adfsort)
    adfsorts=adfsort.rank().sort_values()
    adfsortss=adfsort.index[-12:]
    #print("adfsortss",adfsortss)
    stock_data = data[adfsorts.index[-12:]]
    
    IC = "aic" # or "bic", "fpe", "hqic"
    stock_data.dropna(inplace=True, axis='columns')
    #print("stock_data",len(stock_data.index))
    #print("stock_data",stock_data)


    mod = tsa.VAR(stock_data)
    res = mod.select_order(trend='c')
    print(f"{IC} selects {res.ics[IC][res.selected_orders[IC]]}")

    
    if len(stock_data.columns>2):    

        #stock_data = np.log(stock_data).dropna(axis='columns')
        try:
            #result = coint_johansen(stock_data, 0, round(res.ics[IC][res.selected_orders[IC]]))# <----Jose Carlos Gonzales Tanaka
            result = coint_johansen(stock_data, 0, 1)
        except:
            print("No cointegration ")
            return
        # Store the results of Johansen test
        
        # Store the value of eigenvector. Using this eigenvector, you can create the spread
        ev = result.evec

        # Take the transpose and use the first row of eigenvectors as it forms strongest cointegrating spread 
        ev = result.evec.T[0]

        # Normalise the eigenvectors by dividing the row with the first value of eigenvector
        ev = ev/ev[0]
        #print("type(ev)",type(ev))

        evv=[]
        #stock_data.columns.values.tolist()
        df = pd.DataFrame(stock_data.columns.values.tolist())
        #df.set_index(df.iloc[:,0])
        #print("df+++",df)
        
        for i in range(0,len(stock_data.columns)):
            evv.append(-ev[i])
            #print("ev[i]",ev[i])
        #print("df",df)
        #print("evv",evv) 
        df[1] =evv
        #print("df",df)
        print("df",df)
        Total = df[1].abs().sum()
        #print("Total",Total)    


        # Get the eigenvalue
        theta = result.eig[0]
        half_life = math.log(2)/theta*24
        #print('Half-life :', half_life.round(2))
        #print("ev",type(ev))




        #spred=spred.append(np.dot(stock_data.iloc[-1], ev))
        spred.append(np.dot(stock_data.iloc[-1], ev))
        
        #ind=ind.append(stock_data.index[-1])
        ind.append(stock_data.index[-1])
        #print("ind",ind)
        sped=pd.Series(spred,index=ind)
        

        
        #lookback = (math.ceil(half_life/24))*3
        lookback = 5   
        # Moving Average and Moving Standard Deviation
        stock_data['moving_average'] = sped.rolling(lookback).mean()
        stock_data['moving_std_dev'] = sped.rolling(lookback).std()

        # Upper band and lower band
        stock_data['upper_band'] = stock_data.moving_average + 0.5*stock_data.moving_std_dev
        stock_data['lower_band'] = stock_data.moving_average - 0.5*stock_data.moving_std_dev

        stock_data['long_entry'] = sped.iloc[-1]< stock_data.lower_band.iloc[-1]
        stock_data['long_exit'] = sped.iloc[-1] >= stock_data.moving_average.iloc[-1]

        stock_data['short_entry'] = sped.iloc[-1] > stock_data.upper_band.iloc[-1]
        stock_data['short_exit'] = sped.iloc[-1] <= stock_data.moving_average.iloc[-1]
        

        

        if  round(result.lr2[0], 4) > round(result.cvm[0, 1],4):
                #for i, j in zip(df[0], df[1]):.
                for index, row in df.iterrows():
                    i, j = row[[0,1]]
                    print("i+",i)
                    print("j+",j)
                    # Long positions when lookback returns are greater than the threshold value
                    if stock_data.long_entry[-1]:
                        print("{} Long entry".format(get_datetime()))
                        #order_target_percent(i, (j/Total))
                        order_target(i, j)
                        #print("(j/Total)",(j/Total))
                    if stock_data.short_exit[-1]:
                        print("{} Long exit".format(get_datetime()))
                        order_target_percent(i, 0)

                    # Short positions when lookback returns are less than the threshold value
                    if stock_data.short_entry[-1]:
                        print("{} Short entry".format(get_datetime()))
                        #order_target_percent(i, (j/Total))
                        order_target(i, j)
                        #print("(j/Total)",(j/Total))

Jose_Carlos_Gonzales_Tanaka · June 19, 2022, 10:47pm

Hello Jenny,

Our excuses for the late response.

Thanks for sending us your code. Now I see what you meant by optimal lag value.

Let's explain.

You asked: The performance has lowered using the optimal lag value than using 1 lag period. Is this normal?

Answer: A VAR with 1 one lag is, obviously, a different model from a VAR with 2 lags or more. A VAR with 1 lag is as different as a random forest model and as different as a VAR with 2 lags. What I want to make you understand is that each VAR with its own number of lags will produce different forecasts on the time series. Consequently, you cannot expect the same performance between two models, even if these two models are different just because of their lags.

Besides, in econometrics, there's a concept called overfitting. Overfitting occurs when you fit the time series data as much as perfect as possible. One of the possible negative consequences of overfitting is that your model can make poorer forecasts. In this case, when you increase the number of lags, you might overfit the model to the data. Thus, you might see a poorer performance.

Please don't forget the following: When you find the optimal lag with information criterias, always check autocorrelation on the optimal lag VAR chosen by the information criteria you selected. One of the main important things on the estimation of a VAR model is to have its errors with autocorrelation. Once you find there is no autocorrelation on the VAR errors, then you can proceed with the cointegration test.

I hope this helps,

Thanks and regards,

José Carlos