When to perform split of data into training and test sets (classifier is returning below chance accuracies)

You are right that it is not adequate to process the training and test data together but it’s also a problem to use different processing parameters (e.g. scaling mean and sd, pca solution, csp filters) for each of the splits because you can get very different parameters in the test data which will interfere with classification. You should learn the parameters on the training split and apply them to the test data.

You can do this easily by wrapping all your steps in a sklearn pipeline.

something like:

from sklearn.pipeline import Pipeline

clf = Pipeline([('scaler',  Scaler(epochs.info)), ('pca', UnsupervisedSpatialFilter(PCA(.99))), ('csp', CSP(n_components=6, reg=0.1, log=True)])

X_train = clf.fit_transform(X_train)
X_test =  clf.transform(X_test)
2 Likes