Normalising data before split into train/test

Henrik_Gjerning_5KRV5 · December 7, 2022, 12:13pm

fx 3. Trading with Machine Learning Regression:

3.22 Cross Validation, Test and Train (and i a previous ex)

There are some examples of you normalizing the whole data sample before splitting into train/test datasets. Isn't that a mistake? as you are using info from the test set to normalise the train set.

Step 1: Scale the data
# First we put scaling and then linear regression in the pipeline.
steps = [('scaler', StandardScaler()),
('linear', LinearRegression())]

# Define pipeline
pipeline = Pipeline(steps)

Step laster: splitting the data
# We are using 80%-20% split, therefore splitting ratio will be 0.80
splitting_ratio = .80

# Split the data into two parts
# Use int to ensure that result is of integer data type.
split = int(splitting_ratio*len(gold_prices))

# Define train dataset
X_train = X[:split]
yU_train = yU[:split]
yD_train = yD[:split]

# Define test data
X_test = X[split:]
yU_test = yU[split:]
yD_test = yD[split:]

varun_kumar_pothula · December 12, 2022, 4:04am

Hello Henrik, thanks for pointing this out. It looks like there is a look-ahead bias in a couple of instances. This will be rectified.

Thank you

varun_kumar_pothula · December 16, 2022, 7:54am

Hello Henrik, the notebooks have been updated.

Thanks