Normalising data before split into train/test

fx 3. Trading with Machine Learning Regression:


3.22 Cross Validation, Test and Train (and i a previous ex)


There are some examples  of you normalizing the whole data sample before splitting into train/test datasets. Isn't that a mistake? as you are using info from the test set to normalise the train set.

Step 1: Scale the data
# First we put scaling and then linear regression in the pipeline.
steps = [('scaler', StandardScaler()),
         ('linear', LinearRegression())]

# Define pipeline
pipeline = Pipeline(steps)

Step laster: splitting the data
# We are using 80%-20% split, therefore splitting ratio will be 0.80
splitting_ratio = .80

# Split the data into two parts
# Use int to ensure that result is of integer data type.
split = int(splitting_ratio*len(gold_prices))

# Define train dataset
X_train = X[:split]
yU_train = yU[:split]
yD_train = yD[:split]

# Define test data
X_test = X[split:]
yU_test = yU[split:]
yD_test = yD[split:]
 

Hello Henrik, thanks for pointing this out.  It looks like there is a look-ahead bias in a couple of instances. This will be rectified.



Thank you

Hello Henrik, the notebooks have been updated. 



Thanks