Query on regression analysis

While doing a regression analysis, suppose there are some independent variables which are not applicable to certain dependent variables, how do we adjust for the same. For eg, if we have returns of 100 stks and we are regressing those returns on various factors like earnings growth, valuation etc, how do we account for the fact that some stocks are evaluated based on P/E, some on P/B and some on EV/EBITDA. For stocks which are evaluated based on EV/EBITDA, we want to ignore P/E and P/B as the valuation factors. One way is to have separate subsets of stocks but then if there are a lot of factors then the stock universe could end up getting sub divided into many small subsets and would defeat the purpose of trying a factor model.

Hi Deepak,



The splitting of the stock basket on the basis of EV/EBITDA or P/E, P/B ratios as evaluation basis seems reasonable.

Yes, you are correct in saying that we can not subdivide on the basis of each classification. The approach should be that major differentiating factors should be dealt with I different baskets and the further smaller factors can then be assigned weights as per the ML method you are applying.



So if EV/EBITDA, P/E, P/B are the major differentiators, they should be dealt with separately. Otherwise the most prominent of the three will be fit in your regression analysis which might not give you a correct inference.



Hope this helps!

This is a very interesting question. Let's assume the general regression model is Y(i,t) = a1 + a2(i) + b1*X1(i,t) + b2(i)*X2(i,t). Here subscript i identifies individual stocks, t is the time subscript and X are the explanatory variables. This is a typical panel regression model. The intercept a1 is a constant ('fixed') term for all stocks, while a2 can vary. Similarly, slope b1 for a bunch of factors (X1) are common accross stocks, while slopes b2 for some other factors (X2) can vary among stocks.



One way to estimate this is seggregating the equations by i as you mentioned. That will be problematic not only because of the reasons you mentioned, but also because of certain statistical issues with that approach. For such cases, a statiscally more robust approach is the so-called random effect (or in this case mixed-effect) model. The estimation is more complicated than the usual least squares. For Python, fortunately statsmodel has direct support for this. For R, a popular package is lme4 for the same.

Thanks a lot for the solution. If possible, can you also pls explain the logic behind the random effect model or the mixed effect model that you recommend using (I mean how does it take care of our problem). Any solution available for this in matlab?



Had another query on regression analysis. Taking the same example forward of stk returns being the dependent variable, if there are some independent variables that are binary like profit making or loss making cos (represented by 1 and o) and some variable are normal like P/E. 

The kind of data you are analyzing (multiple stocks over multiple periods) are known as panel data. Estimation of linear model for each stocks individually, while possible, is a very different model than the regression based on all of them. Separate estimations asumes the error terms independent whereas combined estimation assumes them correlated. Also, if tsome of the factors are common (fixed) across stocks and some are varying (random), you cannot capture it in separate regression.



To tackle the first case (correlated error) you can either go for mixed effect model, or seemingly unrelated regression. On top, if you need to capture the second effect (mix of fixed and random parameters), then mixed effect is pretty much the only choice. If you have quantitative background, I suggest you google a bit, there are quite a few papers on this and general estimation method. Otherwise the wiki links above give the gists. You necessarily do not need to capture the mixed effect by stocks. You can even group them (either based on your knowledge, or some sort of clustering) and model it based on that grouping variable.



For the last part (binary explanatory variable), this is called dummy variable regression - there are quite a few examples if you google this as well. Almost all serious stats software (including matlab) will have support for all of these (SUR, mixed effect, dummy variables).