From the blog(https://blog.quantinsti.com/kalman-filter-techniques-statistical-arbitrage-china-futures-market-python/)
You can see how you can use the kalman filter for pairs( 2 assests) but Im wondering how do you apply it to multiple assets using the Johansen cointegration test.
x: the price series of contract one
# y: the price series of contract two
# Run regression to find hedge ratio and then create spread series
df1 = pd.DataFrame({'y':y,'x':x})
state_means = KalmanFilterRegression(KalmanFilterAverage(x),KalmanFilterAverage(y))
df1['hr'] = - state_means[:,0]
df1['spread'] = df1.y + (df1.x * df1.hr)
# Store the results of Johansen test
result = coint_johansen(df[:90], 0, 1)
# Store the value of eigenvector. Using this eigenvector, you can create the spread
ev = result.evec
# Take the transpose and use the first row of eigenvectors as it forms strongest cointegrating spread
ev = result.evec.T[0]
# Normalise the eigenvectors by dividing the row with the first value of eigenvector
ev = ev/ev[0]
# Print the mean reverting spread
print("\nSpread = {}.GLD + ({}).GDX + ({}).USO".format(ev[0], ev[1], ev[2]))
Not sure exactly what you are trying to do here, Are you trying to estimate a spread using more than two assets using kalman filter and then running conintegration test on that spread? That is quite a wrong way to do it. For two assets, the approach should be to first test a conintegrating relationship (johansen or other methods). Once you are sure the pair is conintegrated, you can (optionally) use kalman filter to model a time-varying hedge ratio (than the fixed hedge ratio coming out of the conintegration test) or z-score. Kalman filter itself is not a test for conintegration (or spread stability). This is because it is designed to make the residual stationary in the transition equation (and transfer the nonstationarity to the time varying hedge ratio). In other words, spurious outcome.
Keeping the above in mind, the approach for more than two assets will be to first test if they are conintegrated. Then theoritically, you can use the kalman filter to estimate the time-varying hedge ratios as before (there are more than one hedge ratio to estimate now) with some initial values. The pykalman filter method support nXn matrices where n >=2. However, you run a big risk of model mis-specification and that plays havoc, especially when your transition equation has integrated variables. With a pair, your chances of mis-specification is significantly less, but it can be damaging with more than two. For more information on this, see here (opens PDF). My advice is don't do it. I have not seen anyone doing it. I won't do it myself. To convince yourself, try with some simulated data.
Also, more fundamentally, scanning for stationary basket of more than two assets (with or without kalman stuff above) from a large universe is itself dangerous. For any real world application, these baskets must make economic sense. In certain markets (e.g interest rates or volatility surface), they sometimes do. But I cannot think of a single good case in equities. A pair being driven by the same economic factors will show stationary spread. But for three assets driven by the same economic factors, the individual spreads will be more stable to trade than their combination. Also cointegration has its limitations - see here (opens PDF). If you are scanning n assets at a time from a universe of N assets (where n and N are large), at some level, you will coax your data to find you some conintegration.
Throwing a lot of quant stuff at a real problem is usually not a good approach. Rather we should understand the real problem and pick the right quant tools. Do not run conintegration tests for more than two assets in equities unless you know what you are doing. Also, do not use kalman filter for more than two assets at all.