I would like to cluster before training and testing. What data should I use? Usually, preprocessing uses all the data, which means I'll have lookahead bias or data leakage.
Hi Emma,
To perform clustering before training and testing without introducing lookahead bias or data leakage, you can follow these steps:
- Data Splitting: Split your dataset into at least two parts: a training set and a testing (or validation) set. The training set will be used for clustering, while the testing set will be kept separate for later evaluation.
- Clustering: Apply your chosen clustering algorithm on the training set only. This clustering step should not involve any information from the testing set.
- Cluster Assignment: Assign each data point in the training set to one of the clusters based on the results of the clustering algorithm. Be sure not to use any information from the testing set during this assignment.
- Feature Engineering and Modeling: For each cluster identified in the training set, you can perform feature engineering and modelling separately within each cluster. This means that any transformations, feature selection, or modelling techniques are applied independently to each cluster.
- Testing or Validation: Once you have trained separate models or conducted separate analysis within each cluster, you can use your testing set to evaluate the performance of your models or the analysis results. Importantly, the testing set should not have been used at any point during clustering, cluster assignment, feature engineering, or modelling within each cluster.
Thanks,
Akshay