I’m still wondering about the heuristic for setting threshold_muscle. I computed the mean and standard deviation of scores_muscle and the y-value of the red line is approximately the mean plus two standard deviations. Is using the mean plus some multiple of the standard deviation a reasonable approach for setting the threshold?

I don’t know of a general threshold that would work for all datasets. The documentation suggests:

# The threshold is data dependent, check the optimal threshold by plotting
# ``scores_muscle``.

The default threshold is set to 4, which seems to be reasonable for many use cases. But whether it makes sense for you or if you need to adjust it can only be decided by actually looking at the data.

The default threshold is already normalized as a way of z-score (# standard deviations basically).

Putting the Z score criterion too low results in a high number of false positives for clean datasets and a high number of true positives for noisier datasets.
Putting the Z score criterion too high results in a low* number of false positives for clean datasets and a low number of true positives for noisier datasets.

So to generalize this value is hard, but I still do this.

My solution is to be quite conservative with the value (Z-score of 4 is fine though), but to put a minimal requirement on the length of the bad segment to rule out spurious false positives.

for a, annotation in enumerate(annot_muscle):
if annotation['duration']<0.3:
remove.append(a)
annot_muscle.delete(remove)