These are statistical measures that determine the probability of the results being obtained if the 'null hypothesis' is true - which is to say if the effect being reported doesn't exist. The social sciences, notably psychology, usually consider the marker for statistical significance to be a p-value of less than 0.05, while in physics the aim is often to have a 5 sigma result.
Both these measures depend on creating a probability distribution, showing the likelihood of different values occuring. The p-value is a direct measure of the probability of getting the reported results if the null-hypothesis applies. So, a p-value of 0.05 means there is one in twenty (1/20 = 0.05) chance of this happening. Sigmas effectively measure the same thing, but in terms of a statistical measure called standard deviation that shows how spread out the distribution is.
It might seem odd not to use the more straightforward p-value, but the reason that sigmas tend to be used is that the p-value equivalent becomes very small at the kind of levels physicists look for. CERN, for example, actually works with p-values, but converts them to sigmas for easier communication. Here's a look at equivalent values:
Sigma P-value Cliff measure
2 0.05 Whiff
3 0.003 Evidence
4 0.0001 Annoying*
5 0.0000003 Discovery
The 'Cliff measure' used above is a humorous interpretation of sigmas given by particle physicist Harry Cliff in his book Space Oddities. Arguably this is a more effective description of the value of different levels than the way statistical significance is usually regarded in the social sciences. Choosing 2 sigma/p-value of 0.05 as being statistically significant was an arbitrary choice, plucked out of the air by mathematician Ronald Fisher in 1925. However, it should be seen as nothing more than a note that something is worthy of proper investigation - Cliff's whiff - rather than an indicator that the outcome is accepted science.
Such has the focus been on getting a p-value below 0.05, there has in the past been a significant amount of 'p-hacking' - manipulating the data with the intention to get the result below the critical level. But Fisher certainly never intended this to be any sort of indicator of a real discovery. Remember, a p-value of 0.05 means that there is a 1 in 20 chance of getting these results when the effect doesn't exist. It may be a better probability than Russian roulette (p-value equivalent 0.17), but it's still hardly something you would want to risk your life on.
Why, then, is there such a disparity between the social sciences and physics? Because it isn't practical to have sufficient experimental subjects or experimental runs to come close to a 5 sigma outcome. It's very rarely going to be possible. As a result, the social sciences can't hope for equivalent degrees of apparent certainty. However, there is a strong feeling that the social sciences could do better - perhaps aiming for 3 sigma before they get excited. And it does mean that the outcomes of social sciences studies should arguably always carry a health warning and be better reported in terms of the risk of their misattributing an outcome to a particular cause.
One final consideration - even 5 sigma results can be wrong. Scientists can make a mistake with the maths. And there could be confounding factors too - a great example is the BICEP2 study, which aimed to study polarisation in the Cosmic Microwave Background radiation in the hope of finding direct evidence for the cosmological theory of inflation, which evidence as yet doesn't exist. BICEP2 did so at a 5.9 sigma level. Except it turned out that the results were being distorted by cosmic dust - it was not a discovery after all. There is always the possibility that scientists have not allowed for a factor they were not aware of that has distorted the results - something that sadly tends to disappear from popular science/news reporting where outcomes are often stated is if they were fact.
Probability and statistics can be hard to get our heads around - but when scientific results are reported, it is essential that this particular aspect should be carefully explained up front. To have confidence in scientific results, we need to know know what the limitations of a particular study are.
* Cliff's 'annoying' for 4 sigma is not saying it is useless, but rather than it's annoying it's getting close to the 'gold standard' 5 sigma without quite making it.
Image from Unsplash by Naser Tamimi
See all of Brian's online articles or subscribe to a weekly digest for free here
Comments
Post a Comment