When to perform split of data into training and test sets (classifier is returning below chance accuracies)

mpcoll · September 3, 2021, 8:07pm

You are right that it is not adequate to process the training and test data together but it’s also a problem to use different processing parameters (e.g. scaling mean and sd, pca solution, csp filters) for each of the splits because you can get very different parameters in the test data which will interfere with classification. You should learn the parameters on the training split and apply them to the test data.

You can do this easily by wrapping all your steps in a sklearn pipeline.

something like:

from sklearn.pipeline import Pipeline

clf = Pipeline([('scaler',  Scaler(epochs.info)), ('pca', UnsupervisedSpatialFilter(PCA(.99))), ('csp', CSP(n_components=6, reg=0.1, log=True)])

X_train = clf.fit_transform(X_train)
X_test =  clf.transform(X_test)

Topic		Replies	Views
Classifier either returns very high (close to 1) or very low accuracies (close to 0) Support & Discussions meg , machine-learning	10	448	September 22, 2021
Eeg data from diffrent subjects Support & Discussions meg , eeg , statistics , epochs	1	319	July 17, 2021
Will the test data set undergo all the operation in clf? Support & Discussions machine-learning	1	236	May 6, 2022
Perform Linear model on EEG data from epochs Support & Discussions statistics	1	416	August 8, 2022
Exercies from Sleep stage classification from polysomnography (PSG) data Page Support & Discussions eeg , epochs	0	270	October 12, 2022

When to perform split of data into training and test sets (classifier is returning below chance accuracies)

Related topics