Cluster based permutation test in MNE, and how to deal with differences in sample sizes?

Hi good people of MNE discourse! I’m running some decoding models, which return a series of AUC’s, with one AUC for for each time point reflecting the classifiers ability to discriminate between trial types at each time point. But I need to know for each participant and within each condition, where the significant clusters of high AUC’s start and end. To do this I wrote a cluster based permutation test which works on the probabilities that the classifier assigns to each trial/timepoint (pasted below). It takes a n_trials x time_point array and returns the significant clusters of probabilities. It first finds time points to go into the clusters by looking at those where the p-value is less than .05, and then the permutation test is based on the t-values. It then returns the AUC’s for those significant clusters.

My question has two parts, first, there’s probably a pre-build MNE function that does this right? My brain just isn’t powerful enough to figure out how to do that :confused:

And second, how do you compare the CBPT results for subsets of trails with differing sample sizes? The reason that this is not clear to me is that the initial step which includes time points in the permutation procedure is based on the p-value, which is of course affected by the sample size, so If I run the same procedure on a subset of samples within a participant, the probability of there being significant time points that are considered will of course be lower, and this then introduces a bias in my analysis for conditions that have a larger N. Is there an approach to this that doesn’t involve p-values which I can implement in MNE? Many thanks! Below is my function, but I don’t really expect anyone to look at it :slight_smile:

def clusterBasedPermutationTest(predictions, y, mean_aucs):


    # calculate t and p valus for each time point
    res = pd.DataFrame()
    for j in range(0, (predictions.shape[1] - 1)):
        col = j
        tval, pval = stats.ttest_ind(predictions[y == 1001, j], predictions[y == 1002, j], permutations=None,
                                     random_state=None)

        res.loc[j, 'sample_point'] = j
        res.loc[j, 'time_point'] = (j * 4) - 100
        res.loc[j, 'tval'] = tval
        res.loc[j, 'pval'] = pval

    # find significant differences and create a grouping variable in order to assign a
    # label to each cluster
    res['less_than_point5'] = np.where(res['pval'] <= .05, True, False)
    m = res['less_than_point5'].ne(res['less_than_point5'].shift()).cumsum()
    res['cluster_number'] = (m[res['less_than_point5']].groupby(m).ngroup().add(1)
                             .reindex(res.index, fill_value=False))

    # select those clusters that encompass a time span of more than 10 ms
    res['cluster_size'] = res.groupby('cluster_number')['cluster_number'].transform('count')
    res = res[(res['cluster_size'] >= 3) & res['less_than_point5'] == True]

    # tidy
    res['sample_point'] = res['sample_point'].astype(int)
    res['time_point'] = res['time_point'].astype(int)
    res = res.drop(['less_than_point5', 'cluster_size'], axis=1)

    # get total t values for each cluster
    res['t_sum'] = np.abs(res.groupby('cluster_number')['tval'].transform('sum'))

    cluster_labels = list(res['cluster_number'].unique())

    # iterate over clusters
    cluster_perm_results = []
    # for k in np.unique(res['cluster_number']):
    for k in cluster_labels:

        cluster = res[res['cluster_number'] == k]

        cluster_preds = predictions[:, cluster['sample_point']]

        # iterate over y permutations
        t_sum = []
        for l in range(0, 1000):

            y_perm = y.copy()
            np.random.shuffle(y_perm)

            # iterate over sample_points within a cluster
            tval_cluster = []
            for col in range(0, cluster_preds.shape[1]):
                tval, pval = stats.ttest_ind(cluster_preds[y_perm == 1001, col],
                                             cluster_preds[y_perm == 1002, col],
                                             permutations=None, random_state=None)

                tval_cluster.append(np.abs(tval))
            t_sum.append(np.sum(tval_cluster))
        cluster_perm_results.append(t_sum)

    actual_cluster_results = res.groupby('cluster_number')['t_sum'].unique().reset_index(drop=False)
    actual_cluster_results['t_sum'] = actual_cluster_results['t_sum'].apply(lambda x: x.item())

    sig_clusters = []
    for t_sum in range(0, len(cluster_perm_results)):
        sig_clusters.append(actual_cluster_results['t_sum'][t_sum] > np.percentile(cluster_perm_results[t_sum], 97.5))

    sig_clusters = [i for (i, v) in zip(cluster_labels, sig_clusters) if v]

    sig_timepoints = np.array(res.loc[res['cluster_number'].isin(sig_clusters), 'sample_point'])

    # assign all sampling points that don't form a cluster to False
    final_clusters = []
    for k in range(0, mean_aucs.shape[0]):

        if np.any(sig_timepoints == k):
            final_clusters.append(mean_aucs[k])
        else:
            final_clusters.append(False)

    return final_clusters

I managed to figure out the first part of my question, in that the CBPT can be run with from mne.stats import permutation_cluster_test and it is somewhat encouraging to see that it returns the same result as the above function. However, that they both give the same result suggest to me that timepoints that are selected as candidates for the clusters are initially selected based upon their p-values, and there would be a problem to compare results from analyses with different sample sizes. Does anyone know how to circumvent this?