Skip to main content

The joy of sigma

Anyone who looks in a bit of detail at scientific results may have come across p-values and sigmas being used to determine the significance of a outcome - but what are they, and why is there a huge disparity between practice in the social sciences and physics?

These are statistical measures that determine the probability of the results being obtained if the 'null hypothesis' is true - which is to say if the effect being reported doesn't exist. The social sciences, notably psychology, usually consider the marker for statistical significance to be a p-value of less than 0.05, while in physics the aim is often to have a 5 sigma result.

Both these measures depend on creating a probability distribution, showing the likelihood of different values occuring. The p-value is a direct measure of the probability of getting the reported results if the null-hypothesis applies. So, a p-value of 0.05 means there is one in twenty (1/20 = 0.05) chance of this happening. Sigmas effectively measure the same thing, but in terms of a statistical measure called standard deviation that shows how spread out the distribution is.

It might seem odd not to use the more straightforward p-value, but the reason that sigmas tend to be used is that the p-value equivalent becomes very small at the kind of levels physicists look for. CERN, for example, actually works with p-values, but converts them to sigmas for easier communication. Here's a look at equivalent values:

Sigma    P-value        Cliff measure

2            0.05              Whiff 

3            0.003            Evidence

4            0.0001          Annoying*

5            0.0000003    Discovery

The 'Cliff measure' used above is a humorous interpretation of sigmas given by particle physicist Harry Cliff in his book Space Oddities. Arguably this is a more effective description of the value of different levels than the way statistical significance is usually regarded in the social sciences. Choosing 2 sigma/p-value of 0.05 as being statistically significant was an arbitrary choice, plucked out of the air by mathematician Ronald Fisher in 1925. However, it should be seen as nothing more than a note that something is worthy of proper investigation - Cliff's whiff - rather than an indicator that the outcome is accepted science.

Such has the focus been on getting a p-value below 0.05, there has in the past been a significant amount of 'p-hacking' - manipulating the data with the intention to get the result below the critical level. But Fisher certainly never intended this to be any sort of indicator of a real discovery. Remember, a p-value of 0.05 means that there is a 1 in 20 chance of getting these results when the effect doesn't exist. It may be a better probability than Russian roulette (p-value equivalent 0.17), but it's still hardly something you would want to risk your life on.

Why, then, is there such a disparity between the social sciences and physics? Because it isn't practical to have sufficient experimental subjects or experimental runs to come close to a 5 sigma outcome. It's very rarely going to be possible. As a result, the social sciences can't hope for equivalent degrees of apparent certainty. However, there is a strong feeling that the social sciences could do better - perhaps aiming for 3 sigma before they get excited. And it does mean that the outcomes of social sciences studies should arguably always carry a health warning and be better reported in terms of the risk of their misattributing an outcome to a particular cause.

One final consideration - even 5 sigma results can be wrong. Scientists can make a mistake with the maths. And there could be confounding factors too - a great example is the BICEP2 study, which aimed to study polarisation in the Cosmic Microwave Background radiation in the hope of finding direct evidence for the cosmological theory of inflation, which evidence as yet doesn't exist. BICEP2 did so at a 5.9 sigma level. Except it turned out that the results were being distorted by cosmic dust - it was not a discovery after all. There is always the possibility that scientists have not allowed for a factor they were not aware of that has distorted the results - something that sadly tends to disappear from popular science/news reporting where outcomes are often stated is if they were fact.

Probability and statistics can be hard to get our heads around - but when scientific results are reported, it is essential that this particular aspect should be carefully explained up front. To have confidence in scientific results, we need to know know what the limitations of a particular study are.


* Cliff's 'annoying' for 4 sigma is not saying it is useless, but rather than it's annoying it's getting close to the 'gold standard' 5 sigma without quite making it.

Image from Unsplash by Naser Tamimi

See all of Brian's online articles or subscribe to a weekly digest for free here

Comments

Popular posts from this blog

Why I hate opera

If I'm honest, the title of this post is an exaggeration to make a point. I don't really hate opera. There are a couple of operas - notably Monteverdi's Incoranazione di Poppea and Purcell's Dido & Aeneas - that I quite like. But what I do find truly sickening is the reverence with which opera is treated, as if it were some particularly great art form. Nowhere was this more obvious than in ITV's 2010 gut-wrenchingly awful series Pop Star to Opera Star , where the likes of Alan Tichmarsh treated the real opera singers as if they were fragile pieces on Antiques Roadshow, and the music as if it were a gift of the gods. In my opinion - and I know not everyone agrees - opera is: Mediocre music Melodramatic plots Amateurishly hammy acting A forced and unpleasant singing style Ridiculously over-supported by public funds I won't even bother to go into any detail on the plots and the acting - this is just self-evident. But the other aspects need some exp

Is 5x3 the same as 3x5?

The Internet has gone mildly bonkers over a child in America who was marked down in a test because when asked to work out 5x3 by repeated addition he/she used 5+5+5 instead of 3+3+3+3+3. Those who support the teacher say that 5x3 means 'five lots of 3' where the complainants say that 'times' is commutative (reversible) so the distinction is meaningless as 5x3 and 3x5 are indistinguishable. It's certainly true that not all mathematical operations are commutative. I think we are all comfortable that 5-3 is not the same as 3-5.  However. This not true of multiplication (of numbers). And so if there is to be any distinction, it has to be in the use of English to interpret the 'x' sign. Unfortunately, even here there is no logical way of coming up with a definitive answer. I suspect most primary school teachers would expands 'times' as 'lots of' as mentioned above. So we get 5 x 3 as '5 lots of 3'. Unfortunately that only wor

Why backgammon is a better game than chess

I freely admit that chess, for those who enjoy it, is a wonderful game, but I honestly believe that as a game , backgammon is better (and this isn't just because I'm a lot better at playing backgammon than chess). Having relatively recently written a book on game theory, I have given quite a lot of thought to the nature of games, and from that I'd say that chess has two significant weaknesses compared with backgammon. One is the lack of randomness. Because backgammon includes the roll of the dice, it introduces a random factor into the play. Of course, a game that is totally random provides very little enjoyment. Tossing a coin isn't at all entertaining. But the clever thing about backgammon is that the randomness is contributory without dominating - there is still plenty of room for skill (apart from very flukey dice throws, I can always be beaten by a really good backgammon player), but the introduction of a random factor makes it more life-like, with more of a sense