Why fitting StandardScaler only on train data?

"Section 1, Unit 29, Neural Networks course, ML track"



In the following code, you fitted the standard scaler on train data, and then use the scaler to transform both train and test data:


Scale the data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()


Create the scaler model using train data

scaler.fit(X_train)



Question: Why not fitting the scaler on all features (both train and test, i. e. the variable X)? What's the reason to only fit by X_train?



Thanks in advance

Hello Mohammad, 



The reason for that is simple. We don't want to use any nuance or information from the test set ( which is a subset of the complete set ) to standardise the training set. The idea is that this might lead the model to use out of sample information (from the test set) in its training which might lead it to show false good results while testing. Like, in the case of the standard scaler we don't want the rows from the test set to be used in calculating the mean and the standard deviation. This might falsely lead to better accuracy,precision etc while testing. 



To summaries, this basically means we don't want to use information from the test set in the training and then again use this "contaminated" model for testing on the test set.