Short Selling in Trading : Look-ahead bias

Francesco_Parrella_3Mcx1 · January 15, 2020, 9:00am

Hi all,

I've started reviewing "Short Selling in Trading" course. The signals are based on "swings" which are computed using all the data. Then, a strategy is based on these signals ignoring the lookahead information. For me, it seems that there is a lookahead bias in the evaluation process of this strategy. I am not sure if I am missing something on this.

Any clarification is appreciated.

Thank you.

Laurent_Bernut_Laurent_Bernut_6ZdT · January 17, 2020, 5:13am

Ciao Francesco,

This a valid concern. Thank You very miuch. I shoud have elaborated more on this and i am sorry for the confusion.

We use the function argrelextrema to calculate swings. We use a window of 20 periods (feel free to experiment with shorter windows). argrelextrema does not reset automatically when there is a higher high or lower low. This process has to be done manually as in the function swings. As a result, when testing regime_fc on its own, we use a lag of a similar duration. We could use a shorter window of half to a third of that duration.

On the other hand, moving averages and breakouts are delayed by one day. Signal happens on bar [n]. Trade happens on [n+1], standard stuff.

When using regime_fc in combination with moving average crossover, 2 conditions have to be met:

regime reversal, function of swings
moving average crossover

Swings are usually registered 1 to 2 periods after they occur. That is just how argrelextrema works. Unless extremely short durations are used, swing discovery will systematically precede moving average crossover.

The necessary lag is therefore a function of the slowest moving component, i-e moving average.

Moving average will confirm regime reversal. Trade can take place 1 day after moving average crossover signal.

I understand it can be a bit confusing at first glance, and may be interpreted as "peeking bias". I invite to verify for yourself to be absolutely sure. One way to do this is to run a For Loop over a small data sample where a regime occurs. Swing discovery happens first, which leads to regime reversal, which is then confirmed by moving average crossover.

Now, if you want to use regime on its own, you may want to play with the lag window. 20 periods lag is roughly a month. People rarely wait for an entire month for confirmation. Somewhere between 7 to 10 days should be fine, but please bear in mind that the number of false positives will rise as you shorten the lag. This will adversely affect your gain expectancy.

I hope it clarified the matter. Once again, that was a very valid concern. I should have elaborated more in the course. Thank You very much for picking this up!

Francesco_Parrella_3Mcx1 · January 17, 2020, 1:24pm

Thanks Laurent,

Could you please provide a python code for the strategy without using any lookahead?

Thanks again,

Alexander_Suvorov_3N8Qd · January 22, 2020, 1:58pm

Laurent,

Still have a question on lookahead bias and some parts of your comment.

“Unless extremely short durations are used, swing discovery will systematically precede moving average crossover. The necessary lag is therefore a function of the slowest moving component, i.e. moving average. Moving average will confirm regime reversal. Trade can take place 1 day after moving average crossover signal”

For the backtesting 20 day window used for argrelextrema function. Does it mean that on average moving average crossover signal is lagging swing discovery by 20 days (or as you mentioned 7-10 days should be ok)? And this effectively prevents from lookahead bias?

“Swings are usually registered 1 to 2 periods after they occur. That is just how argrelextrema works.”

Are you referring to argrelextrema functionality specific if latest point is highest or lowest it is not detected as local extremum till some other lower / higher points are added at the end? I.e.

argrelextrema(np.array([1, 2, 3]), np.greater, order=2) will return (array(, dtype=int64),) whereas argrelextrema(np.array([1, 2, 3, 2]), np.greater, order=2) will return (array([2], dtype=int64),).

Thank you,

Alex

Ishan_Shah · January 30, 2020, 11:27am

I believe, since the argelextrema uses 20 data points forward and backward, a simple shift of the dataframe should help to avoid any kind of look ahead bias.

high_low['swing_high'] = high_low['swing_high'].shift(argrel_window)

high_low['swing_low'] = high_low['swing_low'].shift(argrel_window)

But as Laurent says, moving average cross over is a lagging indicator so some future data points are removed due to it. But the challenge is to figure out the exact days it will lag. I feel this is a very good discussion which we are having on this forum.

Laurent_Bernut_Laurent_Bernut_6ZdT · January 30, 2020, 1:52pm

x = np.array([2, 1, 2, 3, 2, 1,1,2,4,5,6,5,5,7])

argrelextrema(x, np.greater,order= 2)

(array([ 3, 10]),)

x = np.array([2, 1, 2, 3, 2, 1,1,2,4,5,6,5,5,7])

argrelextrema(x, np.greater,order= 3)

(array([3]),)

Order sets the number of values surrounding the peak. In the first example order = 2 identifies the second peak. Order = 3 requires 3 values both left and right of the peak to be lower.

With order = 20, argrelextrema will wait 20 bars before disqualifying a swing. The current swing function resets to NaN if values greater than the peak are subsequently found. It however cannot eliminate false positives: swings found then subsequently reset.

This happens only at the last swing. For historical swings, there is enough data on both sides to identify meaningful swings without fail. This function is a compromise between historical and real time swings. It will work well for historical swings, but need some adjustments for real time trading. There are three ways to adjust for the last swing

Moving averages: since swings are identified one bar after the peak, moving averages are de facto slower. They act as a confirmation filter. Personally, I am not an advocate of moving averages, but I recognize their usefulness in this case
Time: the longer the lag after a swing has been discovered, the more likely it will not be invalidated. Currently, order = 20 is robust enough for historical swings. It is however lagging too much for real life trading. 20 days is a month, an eternity for most traders. In practice, half that span would be enough to confirm the validity of a swing.

Hybrid: Time + Distance: one way to reduce noise and shorten the lag is to incorporate distance. The longer price will have traveled, the more likely a swing will reflect an exhaustive move. Example: if price has traveled 2.5 stdev away from the previous swing, a swing is likely to indicate a reversal. Conversely, blips occuring at 0.5-2 stdev are more likely to be just noise.

This can be incorporated directly in the alternation loop with 1 simple line of code:

<div><span style="color:#cccccc"># removes noisy swings: distance test</span></div>

<div><span style="color:#cccccc">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;hilo.loc[(hilo[s_hilo]*hilo[s_hilo].shift(1)&lt;0)&amp; # hi/lo succession</span></div>

<div><span style="color:#cccccc">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(np.abs(hilo[s_hilo]+hilo[s_hilo].shift(1)).div(hilo[&#39;std&#39;].values) &lt; 2.5),s_hilo] = np.nan</span></div>

<div><span style="color:#454545">Note that&nbsp;hilo[&#39;std&rsquo;] has not been instantiated in the function</span></div>

<div><span style="color:#454545">This hybrid method time + distance is probably closer to the reality of trading. The farther price has traveled, the more valid the signals. Instead of waiting 10 days, you may elect to shorten the waiting period to 5 bars or less. Again, please do not take my word for it. Test everything</span></div>
</li>

Woody_Yan_2ZxmV · May 25, 2020, 2:56pm

"This happens only at the last swing. For historical swings, there is enough data on both sides to identify meaningful swings without fail. This function is a compromise between historical and real time swings. It will work well for historical swings, but need some adjustments for real time trading."

When i first read this, i think this is our trade problem, in the real time, we can not trade like historical data, you know, signal always shift and historical is settaled.

However, after read, Laurent Bernut give us a better way which high probability to comfirm the sigal, just $199 take too much Knowledge from Laurent Bernut.

And this mean our backtest is the best result we get. The reality has a discount.

I'll take this for apply to 1min bar for a test.

Thanks. Laurent Bernut

Woody_Yan_2ZxmV · May 25, 2020, 6:46pm

I've got this issue, stop loss change some time, when i take this to calc positon size, it's be trouble. ask some help here.

I take 1m to test. here some paramaters:

(I've make swing first, and then MA cross)

signal_lag = 30

st = 90

mt = 120

argrel_window=60

t_dev=180

threshold=2

result:

                       close  floor  cross  signal  sl90120  eqty_risk
time                                                                  
2020-05-25 08:51:00  9512.25    1.0   -1.0     0.0  9482.25   3.160750
2020-05-25 08:52:00  9512.75    1.0   -1.0     0.0  9482.25   3.108934
2020-05-25 08:53:00  9513.50    1.0   -1.0     0.0  9482.25   3.034320
2020-05-25 08:54:00  9516.25    1.0   -1.0     0.0  9482.25   2.788897
2020-05-25 08:55:00  9517.75    1.0   -1.0     0.0  9482.25   2.671056
2020-05-25 08:56:00  9517.00    1.0   -1.0     0.0  9482.25   2.728705
2020-05-25 08:57:00  9517.75    1.0   -1.0     0.0  9482.25   2.671056
2020-05-25 08:58:00  9516.75    1.0   -1.0     0.0  9482.25   2.748478
2020-05-25 08:59:00  9516.25    1.0   -1.0     0.0  9482.25   2.788897
2020-05-25 09:00:00  9516.00    1.0    1.0     0.0  9520.50 -21.156667

 close  floor  cross  signal  sl90120   eqty_risk
time                                                                   
2020-05-25 08:52:00  9512.75    1.0   -1.0     0.0   9517.5  -20.036842
2020-05-25 08:53:00  9513.50    1.0   -1.0     0.0   9517.5  -23.793750
2020-05-25 08:54:00  9516.25    1.0   -1.0     0.0   9517.5  -76.140000
2020-05-25 08:55:00  9517.75    1.0   -1.0     0.0   9517.5  380.700000
2020-05-25 08:56:00  9517.00    1.0   -1.0     0.0   9517.5 -190.350000
2020-05-25 08:57:00  9517.75    1.0   -1.0     0.0   9517.5  380.700000
2020-05-25 08:58:00  9516.75    1.0   -1.0     0.0   9517.5 -126.900000
2020-05-25 08:59:00  9516.25    1.0   -1.0     0.0   9517.5  -76.140000
2020-05-25 09:00:00  9516.00    1.0    1.0     0.0   9517.5  -63.450000
2020-05-25 09:01:00  9520.50    1.0    1.0     0.0   9517.5   31.725000

20200525_143630 -> calc...
--------------------
                       close  floor  cross  signal  sl90120  eqty_risk
time                                                                  
2020-05-25 08:53:00  9513.50    1.0   -1.0     0.0  9482.25   3.034320
2020-05-25 08:54:00  9516.25    1.0   -1.0     0.0  9482.25   2.788897
2020-05-25 08:55:00  9517.75    1.0   -1.0     0.0  9482.25   2.671056
2020-05-25 08:56:00  9517.00    1.0   -1.0     0.0  9482.25   2.728705
2020-05-25 08:57:00  9517.75    1.0   -1.0     0.0  9482.25   2.671056
2020-05-25 08:58:00  9516.75    1.0   -1.0     0.0  9482.25   2.748478
2020-05-25 08:59:00  9516.25    1.0   -1.0     0.0  9482.25   2.788897
2020-05-25 09:00:00  9516.00    1.0    1.0     1.0  9482.25   2.809556
2020-05-25 09:01:00  9520.50    1.0    1.0     1.0  9482.25   2.479020
2020-05-25 09:02:00  9518.25    1.0    1.0     1.0  9482.25   2.633958

Laurent_Bernut_Laurent_Bernut_6ZdT · May 27, 2020, 6:44am

Hello Woody,

Thank You for your comment. I am honored and grateful you applied the code.

Your point about the lag is a valid concern. I have sent Ishan Shah 2 functions that completely eliminate lag.

I was originally reluctant to send those functions. I was worried those functions would be too complicated. It turns out I have vastly underestimated the audience.

They have 3 distinctive features:

1. Retest: This is a pullback. Retest high: price marks a low, rebounds, drops, fails to take the low and crosses above the the rebound. This suggests price may continue to travel higher. Retest high delays recognition of swing low. Vice versa for a retest high and swing high

Adaptive range: When price rebounds off a low and prints a retest high, it may trade sideways before finally crossing that point. Adaptive range narrows the range for the test to either the first or later retest. You may find this feature useful in intraday trading where some orders have market impact.
Distance test: Retest has no statistical validity per se. Retests happen in narrow ranges. When price has traveled some distance however, retest may indicate trend exhaustion. Distance test is a measure of sensitivity. It is all a trade-off between distance and hit rate. Try with 2 or 3 std

For more on the topic, please read the following post on Quora: https://www.quora.com/How-do-I-find-out-if-it-is-a-reversal-or-retracement-in-day-trading/answer/Laurent-Bernut

I use a variation of swings_fp in my own trading. It is statistically accurate enough. For example, it timed the March 2020 low on March 25th. Bear in mind however that this is not an exact science. Sometimes the market pushes through.

For the source codes, please ask Quantra. All functions are fullx explained in details.

I hope you will find the lag problem solved

Kind regards,

Laurent Bernut

Ishan_Shah · May 27, 2020, 1:16pm

Thanks, Laurent for adding these two functions.

The code and explanation for these two new functions and implementation can be accessed from here.

Woody_Yan_2ZxmV · May 27, 2020, 2:35pm

Hi Ishan Shah:

Thanks upload so efficient.

I've download the file. I'm going to take a deep learn from it.

Woody_Yan_2ZxmV · May 27, 2020, 2:38pm

Hi Laurent:

Thanks very much. It's awesome. Today i break down the code to solve the real time tradeing problem(print out all thing to compare). The reply just on time.

I've found one is argrelextrema can not take last value to calc, simple fix it:

argrelextrema( mode='wrap')

Thanks a lot.

Woody

Woody_Yan_2ZxmV · June 11, 2020, 9:08am

Hi Laurent Bernut:

I've done the concept and code part, but I think there is distance to reality, for this can i get your email please? and also i change some code want to have a discuss. just send me something i will reply. aprilsnowyou@gmail.com

Thanks

Woody

Nikolay_Zuykov · August 9, 2024, 9:46am

Hello Laurent.

Unfortunately, both of your Legless Swing detection functions, swings_argrelan and swing_fp, are using argrelextrema and find_peaks. These functions introduce look-ahead bias, which causes them to produce significantly different results on real-time data compared to historical data. Do you have any thoughts on how to address this issue?

Please review and provide a fix.

Kind Regards, Nikolay

_Rushda_Ansari · August 9, 2024, 2:50pm

Hi Nikolay,

We have forwarded your query to the author of the course and will keep you updated on this

Laurent_Bernut_Laurent_Bernut_6ZdT · August 13, 2024, 1:33am

Hello NiKolay,

Thank You very much for your question. The latest version of the swing detection does not use find_peaks or argrelextrema. It is much faster and more accurate.

It calculates fractals at all levels. It seamlessly works across time frames. It can find the 1 minute bar that triggered the bear market avalanche on 1 dar bar

I have sent a Jupyter notebook to Quantra with the latest version of the floor ceiling. This will take care of the problem immediately.

Besides, I will revise the code in the course. Thank You very much for pointing the issue BTW. I will also work on an update to the course with position sizing libraries and several changes.

Once again,

Nikolay_Zuykov · August 19, 2024, 8:26am

Thank you very much, Laurent, for your prompt response! It would be great to see the updates.

Dear Quantra, could we please get access to the updated notebooks mentioned by Laurent?

Kind Reagrds, Nikolay

_Rushda_Ansari · August 20, 2024, 5:37am

Hey Nikolay!

The notebook has been shared with you over email

Nikolay_Zuykov · August 20, 2024, 6:48am

Hi Rushda,

Thank you!

Regards, Nikolay

Nikolay_Zuykov · August 20, 2024, 3:41pm

Hi Laurent,

It seems like the fractal-based calculations might still have a look-ahead bias. For example, consider the following code:

def fractal(px, lvl):
    max_lvl = np.minimum(2,lvl)
    fractal = px[(px<= px.shift(-1)) & (px < px.shift(+1)) &
                 (px<= px.shift(- max_lvl)) & (px < px.shift(+ max_lvl))]
    return fractal

Am I missing something?

Regards,

Nikolay