I am trying to build a machine learning model. Features are economics data (like GDP, which is quarterly lagging data and jobless claim, which is weekly lagging data, etc), while target value is daily sotck price. Now, I want to create a dataframe with these features and target value in all trading days.
There are two questions here.
- How to deal with frequency difference?
- Since GDP is an lagging data, it seems unfair to fill in the data in all quarter. For example, on 3/25/2022 we will know GDP in this quarter is 100K. But, if you fill in GDP with 100K on 1/1/2022, it supposes that you knows the value in advance of the data declaration. It seems weird. What is the common way to fill in the data on 1/1/2022 in this example?
These are two very interesting questions.
On the first point, handling of different frequencies depend on your model. If the target var of the model itself is of a lower frequency than the X variables (which can be a mix of frequencies), the usual approach is mixed data sampling or other filtering methods used in nowcasting. In your case, the target is higher frequency. So you can treat the low frequency X variables as simply states that does not change between multiple y observations, and use the standard method.
On the second point, your best bet is to use a properly curated dataset. Standard practice is to use lagged data. US GDP for e.g. is usually reported with a 1 month delay. So 1st Feb to 30 Apr should refer to Dec (Q4) GDP and 1st May to 31 July use Q1 GDP and so on. But note, this may still introduce the look-ahead bias. Oftentimes, these stats undergoes second, third or even more revisions and most sources will store only the last revised data point. Look for a source that not only publish the data and the reporting period (as most do) but also report the publishing date as well as store each revisions separately.