Data dredging/p-hacking

8/23/2023

The equation for Chi square is x 2 = Σ((o-e) 2/e), where “o” is the observed value and “e” is the expected value.ĭon’t forget that this equation includes a Σ (sigma). Known as x squared, the Chi square measures the difference between your expected and observed values. In this example, your degrees of freedom would be 2-1=1 because we had two variables, men and women.Ĭompare your expected results to your observed results using the chi square. This is easily calculated with a simple equation. Now that we have your expected results and your observed results, the last number you’ll need is your ‘Degrees of Freedom’. But before you publish these findings, let’s find your p-value. This conclusion points to the idea that women are 50% more likely to share video content on social media than men. Upon investigation of your data using a data analytics platform like YOUBORA Analytics, you find out that 2,400 of the social media shares are from men and 2,600 are from women. We can then use this to calculate how to find your studies p-value.įollowing the example from earlier, let’s say that you’re expecting your results to be exactly 50/50 every single time, so 2,500 of the social media shares to be from men, and 2,500 of them to be from women. To find the p-value of your study, you will need to first set a hypothesis containing the results you are expecting. Now if you wanted to look specifically at how many of these were men and women, you’d need to take into consideration your p-value before looking into any results. Let’s say of your 100,000 viewers this month, 5,000 of them shared a video on social media. Any results over this number must be further investigated.įor example, let’s say you are looking to find out how many of your viewers share content they have watched on social media. You can choose your significance value as anything you’d like, however it is important to note that most research papers use a value of 0.05 as the highest possible in order to reject your null hypothesis. Basically, if your eventual p-value is under your proposed significance level, you are not able to reject your null hypothesis and must investigate deeper. This p-value is compared to your proposed ‘Significance Level’ at the end of the study to see whether your results may be deemed as statistically significant or not. This is the probability that the null hypothesis that you make before a study is true or not.

To understand p-hacking, the first thing to know is that this is referencing a p-value. Results must be investigated deeper and analysed using it’s p-value to measure probability of anomalous results. For example, flipping a coin 5 times and having it land on tails 3 out of those 5 times does not mean the chance of getting tails when flipping a coin is 60%. In other words, a conclusion is taken from a set of results that have not been fairly investigated to deem the conclusion as fact. p-Hacking (also known as data dredging) is when a data scientist is analyzing a set of results or following the progress of an A/B test and finds a pattern that could be stated as statistically significant, yet in reality, there is no real underlying effect. In the realm of scientific research, p-hacking is a well-known term, especially after a paper titled, ‘ p-Hacking and False Discovery in A/B Testing’ was published earlier this year to an overwhelming response on social media.

As much as we want to convince ourselves that numbers on a page are black and white, in reality, there is always context in regard to what we are trying to find out, what we’re hoping to discover and what method we used to get this data. Data does not speak for itself, it must be interpreted.

0 Comments

Data dredging/p-hacking

Leave a Reply.

Author

Archives

Categories