Section 15, Unit 15

Slicing out the data using the respective window lengths

feature_window_data = data['Adj Close'][:feature_window+1]

label_window_data = data['Adj Close'][feature_window:feature_window+label_window+1]



I have a question on how to slice the entire dataset: let's say we have a data set much larger in time horizon terms, than the feature and label windows we're considering. How do you split the dataset? Do you make sure that the feature windows do not overlap? Or which other approach?



Many thanks

Hello Alex,



That's a very interesting question. So as per Prado, it's NOT recommended that the label window finishes before the next feature window. Why so? It can be demonstrated mathematically that this creates sparse or coarse models without continuity with low predictability. In fact, we should allow it to overlap.



But what issues does overlap cause? 



If two label windows overlap, the labels we get out of them, cease to be IID. Why so? Because there are many data points or returns in the paths of these label windows which are common. That makes them hardly independent. Now how do we go around this?



We look at the level of overlap for each label window has with the other. This is done based on how many returns are common to each label. This allows for unequal label windows as well. Now based on this overlap a uniqueness score is calculated for each label. The more unique a label the higher is the weight for the sample the label corresponds to. So apart from the X (features), labels(y) there is also the weight (w) for each sample. The samples with higher weight are sampled for learning more often than others. 



Sample weighting has been dealt with in chapter 4 of Advances in Financial Machine Learning. This solves the IID problem of overlapping windows.