In the ML track in "Section 15 Unit 4" the dynamic threshold for data labeling calculated with the foloowing code:
Calclate the threshold dynamically from the daily returns on a rolling basis on the feature_window
The threshold is a function of the rolling standard deviation
data['return_threshold'] = 0.125 * <br />
np.sqrt(feature_window)*data.daily_returns.rolling(feature_window).std()
Question: What's this 'np.sqrt(feature_window)' and why it is multiplied to the standard deviation?
Thanks in advance
Hello Mohammad,
feature_window is the number of bars in the feature window. We calculate the average ( rolling ) returns, but they represent the returns of a single day. To be able to get it in the same "unit" or timescale/barscale as the feature window we multiply with the variance with the feature_window or the standard deviation ( volatility) with the square root of feature_window. By doing so, we get the returns for the feature_window. A function of these returns is used to calculate the threshold as shown in the code.
Generally, n_step_volatiltiy = sqrt(n) * 1_step_volatiltiy <— This formula has its roots in Random walks (Brownian motion)
Basically, market prices take random walks. For each increment of a price’s random walk, the variance is proportional to the time taken.
For example:
- Let’s say the variance was equal to 3 in one day. As a proportion, that’s 3*1.
- For day 2, time is doubled, so the variance is doubled also, given a variance of 3 * 2 = 6.
- For day 3, time is tripled, so the variance is 3 * 3 = 9
Since the standard deviation is a square root of the variance we take the square root. So, for the last example where the variance is 3 * single day variance the standard deviation will be sqrt(3) * single day standard deviation.
Hi and thank you.
What about the rolling calculation itself? isn't that time is already applied to this calculation? And if it doesn't, then for example 'x.rolling(30).std()' must be equal to 'x.rolling(10).std()' …!
I don't get the point why exactly the threshold can't be 'std()' itself.
This kind of multiplying seems to be like Anualizing the 'Sharpe Ratio'. I get the point there, but not here.
Can you bring me some links to articles, web pages, etc.?
Thanks in advance
Hello Mohammad,
"isn't that time is already applied to this calculation? And if it doesn't, then for example 'x.rolling(30).std()' must be equal to 'x.rolling(10).std()' …! " - so, both of them represent the daily standard deviation only this is one is talking average of last 10 days and the other is taking an average of the last 30.
The threshold could definitely have been daily std() directly as it is a function of the n-bar std() but to put across the point that the threshold is a function of the volatility ( std() ) of the given feature window period we chose to elaborate that in code.
You can check out the book "Advances in financial machine learning" by Dr Marcos Lopez de Prado.
Thanks Akshay, I think it's getting more clear for me, but not completely yet. There must be a spesific diference between time series data and non temporal data in the concept and usage of STD, which I don't understand clearly, yet …
Abd thanks for the reference, too.
Hello Mohammad,
You're welcome. In a temporal dataset, values such as std(), mean() etc can be generalised for longer time periods using data from a smaller time period unlike say the standard deviation of the diameter of a football a factory produces which doesn't have any time component.