Linear regression on source estimates give everything as significant

  • MNE version: e.g. 0.24.1
  • operating system: Windows 10

Hey all,

First time poster here, sorry if my question has already been dealt with, I had a search but didn’t see anything.

I am essentially trying to run a linear regression on a series of source spaces calculated using EEG and the template MRI. The problem I am having is the LR is resulting in every voxel and every time point is significant. This is both with data that should have some correlations, and data that should have no correlations. Here is what I had done:

I have a lot of epochs, so I had recreated the BEM and STC to reduce the number of ‘voxels’:

subject = 'fsaverage'
trans = 'fsaverage'  
bem = mne.make_bem_model(subject = subject, ico=4) 
src = mne.setup_source_space(subject, spacing='ico4', surface='white', subjects_dir=subjects_dir, add_dist=True, n_jobs=1, verbose=None)
bem = mne.make_bem_solution(bem)

I then imported the data from FIFF (it has already been cleaned etc in Fieldtrip, and imported over), and ran through the basic forward and inverse modelling:

#Add montage from inbuilt file
Montage = mne.channels.make_standard_montage('GSN-HydroCel-256')
ch_names = Montage.ch_names
ch_types = ['eeg'] *256
sfreq = 250
info = mne.create_info(ch_names=ch_names, sfreq=sfreq, ch_types=ch_types)
#Create Epochs
epochs = mne.EpochsArray(epochs_data, info=info, events=events, tmin = -0.096,
                       event_id={'arbitrary': 1} )
epochs.baseline = (-.096, 0)

#Add projection
picks = mne.pick_types(info, meg=False, eeg=True, misc=False)
#Downsample and crop times
epochs.resample(50, npad='auto') #DS because of lots of data
epochs.crop(None, .3)

#Calculate forward model
fwd = mne.make_forward_solution(, trans='fsaverage', src=src, meg = False,
                              bem=bem, eeg=True, mindist=5.0, n_jobs=1)

#Inverse operator
inverse_operator = make_inverse_operator(
  , fwd, noise_cov, loose='auto', depth=5.0)
#Perform inversion
method = "MNE" # also tried dSPM, sLORETA
snr = 3. #Standard
lambda2 = 1. / snr ** 2 #standard
stc = apply_inverse_epochs(epochs, inverse_operator, 
          lambda2, method=method, pick_ori=None, verbose=True)

Then I create a design variable, and ‘add’ the STCs together so that I have a list of around 15k STCs. The design is basically three column vectors (1st tracks subject, 2nd and 3rd are predictive error measures).

if sub_counter == 1:
       design_store = design
       stc_store = stc
       design_store = np.append(design_store, design, axis =0)
       stc_store = stc_store+stc
sub_counter += 1

And finally I run the regression:

LR_Results = linear_regression(stc_store,design_store)

But as I said at the top, this results in all voxels (5124 of them, as I said, downsampled), at all time points (21, again downsampled) being significant, whether the input is ‘real’ or basically random data.

I guess I am wondering if I am doing anything spectacularly wrong here? I have run the analysis on individuals, gotten their Beta values, and then run a one sample t-test on their data, and I get some ‘reasonable’ results. In that case, if I use cluster-based correction (using mne.stats.spatio_temporal_cluster_1samp_test) I get around 5 clusters which are significant throughout the time frame (which doesn’t seem plausible), but if I instead use FDR correction, the results look more reasonable. But I read somewhere that unless you are actually comparing two conditions, i.e. subtracting one condition from another per subject, then this method has some issues…

Anyway, sorry for the long post. Hopefully there is the right information in there, and thanks to anyone who can shed some light on this for me!


Hi Jordan, thanks for the question! It sounds like you’re running into issues of multiple comparisons which is very typical for this kind of data. There are several references that suggest that cluster permutations are the way to go e.g. Redirecting. If you have multiple conditions that are not amenable to subtraction and a one-sample cluster permutation test, repeated measures ANOVA is probably the way to go Repeated measures ANOVA on source data with spatio-temporal clustering — MNE 1.1.dev0 documentation. I would follow the tutorial and make some plots of the significant points which would make it a lot easier to give advice. There are many things that can go wrong in the preprocessing of the data so my #1 tip without digging into your analysis too much is to visualize your output at every step. Hope that helps!

1 Like

Hey Alex,

Thanks for the response! Yeah, I would have thought cluster correction is the way to go as well, but yeah was worried about non-subtractability (if that’s a word) haha. However, having said that, I should have specified that, firstly, even with Bonferroni correction (using mne.stats.bonferroni_correction — MNE 1.0.0 documentation), everything is still significant in this linear regression. Also, the variables of interest (i.e. the measure of predictive error I am using) is a continuous variable, so I don’t think an ANOVA is appropriate.

As for plotting things at different stages, I have run the sensor space analysis in Fieldtrip and things looked alright. In MNE, I have plotted the source reconstructions at the individual level and they looked alright to me. The alignment of the electrodes over the head and the whitened GFP also look good (i.e. reasonably close to zero prior to the onset of the trigger). The covariance matrix looked a little funny to me (i.e. sort of shimmery), but not sure how bad that is. Would it be helpful if I added some images to this?


Adding images definitely helps so that me or someone else can give you advice. As I’ve heard Matti quote (who wrote MNE-C which MNE-Python was originally based off of and who is a great scientist)

“Statistics are really not necessary. I only conduct experiments in which the result is clear.”
Ragnar Granit

It’s in this talk if you’re interested The point being that hopefully you should have some plot that you can see what you’re running the linear regression on and the statistics will just tell you a p-value but the effect should be visually clear. That’s what I would try and look for/visualize.

That sounds reasonable to use a linear regression for a continuous variable.

Hope that helps!

Hey Alex,

Thanks again for the reply. I definitely feel that sentiment haha, I have always thought the same; you should be able to ‘see’ an effect, and just verify by stats. But then, journals feel differently haha, so want to make sure the stats are solid, and follow some form of standard.

I have attached a bunch images. Firstly, here is the alignment on the head (sorry, static image, but you can kind of see they seem alright):

Then there is the whitened data (of a single subject) and the covariance matrix:


I am only allowed three images, so will continue this in a followup post…

I mentioned that one way I had done this is to get the beta values for each individual, and then use a 1 sample t-test to look at if the beta values are significantly different than zero at each voxel. Here is the FDR corrected version (at 150 ms). It looks semi reasonable (should have a type of mistmatched negativity response, if that helps at all):

But then, when I use cluster correction, I get something like this, which is just wild to me:

For both those images, the data is sampled at 250 hz, and I had applied a ‘standard’ adjacency matrix (either just spatial, or spatio-temporal, I’ve done both), as per the documentation, using the fsaverage template. I would have thought cluster is the way to go (especially as it gives ‘clusters’ which are spatially and temporally limited, and therefore much easier to report), but the results look pretty rubbish…

Any thoughts? Again, thanks for your help! I wonder if it is something to do with running mne.linear_regression on STCs produced from inverse_epochs which is the issue?


Okay, now I’m following your approach much better, thanks for posting the images!

Probably my main piece of advice would be not to run statistics on statistics (i.e. a t-test on beta values). Correlation coefficients have their own distribution and associated p-value, which is what I would recommend you use directly (corrected for multiple comparisons). I am not aware of a way to use cluster corrections for GLMs which is a bit of an issue because with brain-wide associations, you usually end up with no significant correlations after correction. This leads a lot of people to define a priori regions of interest which I am not a big fan of (because you’re biasing your data with prior knowledge that is often circular with your hypothesis, i.e. this brain area is related to this behavior so we only checked in this brain area to see if it was modulated and it throws out the rest of the data). As far as I know, this is just a general weakness of GLMs and is a big part of what a lot of people are raising as issues with reproducibility (see any of Russ Poldrack’s work for instance). One different approach is to change it into a classification task and do a test-training (and hyperparameter optimization if there are any) split. The way I think of this is that you can p-hack all you want and test a million different models on the testing data but then you get your real results on unseen testing data, which if the testing data is appropriately independent, will mean that your model has found something that is not idiosyncratic to the training data but generalizes to human brains more generally. This is basically the motivation for machine learning in neuroscience as I understand it.

Yeah, I thought that running the stats on the beta values might be a bit iffy. At some point, I do need to compare two conditions, and see if the correlation between neural activity and the prediction error measure are the same or different, in which case I might have to use the beta values. I kind of came to the same conclusion, that I should use some kind of SVM approach.

But I would also like to get to the bottom of why the linear regression didn’t work when I ran it across all the subjects simultaneously (i.e. how I think you are supposed to do it, with a regressor for subject, and a regressor for the variable of interest). As a little follow-up, I added this bit to my code:

for i in range(0, len(design_store)):
    design_store[i,1] = np.random.random_sample()
    design_store[i,2] = np.random.random_sample()

To generate random values for the predictive error measures, and I am still getting all time points, all voxels as significantly correlated (all p<1x10^-308, pretty ridiculous). The input is 15803 STC files (not sure I mentioned that before). But, I also tried splitting the input into two files - a high predictive error and low predictive error condition, and then running a t-test on the difference between the STC reconstructions per subject (i.e. the input to the t-test was a subject x voxel x timepoint array). The results of which were again a huge number (not all, but a lot) of significant clusters. So maybe the source recons are the problem… The whole, garbage in garbage our phenomenon.

Haha, sorry, no real question in there, but any advice would be great!


Ah okay, if you have a lot of values you can get a really low p-value even for a small effect. That’s another big issue with linear regressions. I think we’re getting somewhere. I would try picking one voxel at one time point and plotting it across all subjects/stcs to give you an idea of the linear trend (and maybe run a linear regression and see what kind of p-value you get). Sounds like a very cool big data project!

Ah, good idea! I just ran this code (to be clear) to get a random voxel and a random time point, across all subjects and trials, and plotted the activity (y-axis) against one measure of predictive error (x-axis):

import matplotlib.pyplot as plt

X = np.array([stcs.lh_data for stcs in stc_store]) #Just looking at left hemisphere for speed
vert = np.random.randint(0,high = 2565) #Rand voxel
time = np.random.randint(0,high=20) #Rand time
y = X[:, vert, time] #Find rand

time2 = stc[0].times[time] #Convert time to actual time for plotting
x = design_store[:,1]
y =y *1000 #Making larger just so the equation might be large enough to show up
x = x*10 #And same thing here
plt.plot(x, y, 'o', color='black');
m, b = np.polyfit(x, y, 1)
#plt.plot(x, m*x + b)
plt.plot(x, m*x + b, 'r', label='y={:.2f}x+{:.2f}, vert = {:.1f}, time = {:,.2f}'.format(m,b,vert,time2))

And here are three of the images:

Figure 2022-04-14 145516
Figure 2022-04-14 145438
Figure 2022-04-14 145452

But yeah, I think everything seems to look like a flat line. This could be because the relatively higher predictive error values seem to have less data, but that might bring down a regression line right? I guess, if I was really really dedicated to regression, I could do this for every voxel and time point, calculate p-values and do an adjustment for multiple comparisons, but that doesn’t sound like a good idea right…

Any thoughts?


Oh I thought I would add that, those were with MNE source reconstruction, if I do the same code but use sLORETA, I get much higher coefficients (e.g. at 220 ms on vertice 772, the formula for the effect is 14.9x + 376). Having run that plotting code a few times, it does seem that the coeffecients are higher in the ‘active’ time around 150 ms post stimulus onset, and 200 or so ms post onset, compared to the baseline period, though the baseline period (averaging around 4.5 or so). I wonder if there is a way to mediate the beta values in the active period (any time after stimulus onset) by the values in the baseline period. The data is already baseline corrected, but still…

Anyway, thought I would add that side-tangent on, in case anyone else has ever had a similar question haha.


Looks like the issue is that there is so much data that a very small effect will have a significant p-value which, as far as I know, is just a fundamental limitation to this method of analysis. Maybe someone else can comment further but with this much data, applying statistical corrections just ends up making everything very small numbers compared to other very small numbers which I don’t think works all that well. A slope of 0.00 seems odd that it would be significant though…

Hey Alex,

Thanks for your help with this! Even if we don’t get to the bottom of it, it’s been helpful to talk through, and considering the hard time I had getting to the bottom of it, I am sure others will be interested too.

Here is my current ‘thinking’ (I put it in quotation marks because, well, not sure how smart it is haha). I reckon there might be something like three options to follow up on:

  1. I can use cross-validation in some form of machine learning approach. Maybe do something like export to a numpy array and then use a function in scikit-learn. The problem here is the multiple comparison correction method. Doing something like FDR would be easy, but I am not sure about implementing something more robust (at least in my opinion) like cluster-based correction post-hoc. I get the idea (finding clusters of activity across a temporo-spatial adjacency matrix, and then getting the t-stats that are above a threshold etc), but the implementation seems like a bit of work…

  2. Maybe I could try and find significant differences between the baseline values and the post-stimulus values. The problem here is that there is an amount of arbitrary-ness to this approach, for example, is 1 ms after the stimulus onset a reasonable timeframe for frontal activation? Probably not. Is -50 ms a reasonable time for some kind of STG activation, who maybe, if there is an expectational effect or something…

  3. I was reading a bit about the backend of the two-stage SPM approach to analysis. It seems to perform regression at the single subject level, and then bring the beta values forward to the second level. I agree with your earlier point about the distribution of beta values, so maybe doing some kind of either z-transform, or a non-parametric second level test might be in order. The problem here being, this method still doesn’t explain why cluster-based correction doesn’t work. I’d be much more comfortable if the FDR results, and the cluster-based correction results were similar.

Anyway, if anyone out there has a good suggestion, let us know, Otherwise, I’ll hopefully post and let people know what option I went for, and how it worked out.


I think the cluster permutation would be reasonable and is implemented in MNE in that example I posted earlier. You would have to discretize your variables which is not ideal since it’s a continuous variable. I think it’s actually not too hard to use a cluster permutation on beta values from a linear regression since you can pass in a stats function.

Otherwise, the hierarchical model sounds reasonable. I think there is more documentation in SPM about hierarchical models, but there is some I think here also Frontiers | A Reproducible MEG/EEG Group Study With the MNE Software: Recommendations, Quality Assessments, and Good Practices | Neuroscience. I think hierarchical modeling is something that hasn’t been added to MNE much as far as I know but would be a great addition when someone has time. I personally have done some Bayesian hierarchical modeling but I had some help from a coauthor (Sam) on writing the Stan model which is pretty hard and complicated. That’s a cool approach too but definitely not the easiest route. SPM is probably a good place to look for hierarchical linear modeling documentation or maybe scikit-learn but I don’t see a whole lot. @agramfort, do you know any documentation with MNE compatibility for hierarchical linear modeling?

@agramfort, do you know any documentation with MNE compatibility for hierarchical linear modeling?

Denis is using lme4 in R for this. Clearly the group level stats is the next big thing that needs to be tackled

in MNE. For this I am not the most experienced person so any help is appreciated


1 Like

Best choice IMHO.

Problem with lme4 is that it’s usually not easy to retrieve p-values, which are still some sort of a “must-have” depending on where you intend to publish. For this reason, I would usually use afex, which relies on lme4 under the hood and conveniently gives you p-values by default.

1 Like

Hey all,

Thanks for your help! I hadn’t thought of exporting to R, not a bad idea. Just to add to what was said above, there is a package, lmerTest, which also gives p-values for lme4 models. I am guessing a GAM would also be an option in this case, but then interpreting effects might be more difficult.

As far as using an LME, so you all reckon export single-trial data as a csv, and run a model like:
amplitude ~ predictive error * voxel + (1|Subject)

I guess in that case, there is still the issue of clustering though right?