All too often in environmental monitoring, we see a change in the environment and have no idea what caused it. For example, fish downstream of an industrial development may be reaching sexual maturity earlier. Is this due to the influence of industry? Climate change? Eutrophication from farming upstream? Results are often frustratingly hard to interpret.

Last year I had the pleasure of representing Sustainability Resources on the Metal Mining Effluent Regulations 10-year Review. In this process, we discussed proposed changes to the Environmental Effects Monitoring Program, which requires mines to sample fish downstream and upstream to see if their operations are having any effect on the receiving environment. Mines who find an effect on downstream communities are triggered into more in-depth monitoring and studies. Predictably, we spent the entire process arguing over how you determine whether a mine has had an “effect.” It turns out, answering this question is not as easy as one would think.

Most EEM studies involve sampling 20 fish upstream and 20 fish downstream and measuring a variety of endpoints, such as length, age, weight, liver weight and gonad weight (actually, EEM studies also involve sampling of benthic invertebrates and are paired with a variety of chemical testing and toxicological studies, but I won’t go into detail here.) (Dumaresq et al. 2002) Initially one might think that any “significant” difference between upstream and downstream measurements means that the mine is having an effect, but in fact any study with sufficient power will find a statistically significant difference. This is no guarantee that the difference is meaningful.

The problem here is that as scientists we are trained, perhaps poorly trained, in the use of statistics to support our decision making. We’re taught in undergraduate classes that there are a variety of statistical tests that one can use to help answer questions, and to rely on the all-important p-value to determine whether our results are meaningful. We are taught that a p-value that is sufficiently low, usually below 0.05, means that our experimental results are unlikely to have occurred by chance. This results in the scientist giving a cheer from his or her computer, and making plans to publish.

And don’t be fooled; it’s next to impossible to publish without some statistically significant results to report. As a result, there’s a large incentive for scientists to manipulate data to show statistical significance. Called p-hacking, this can be done by dropping variables, manipulating sample sizes, or even making up data. Studies of published papers have shown that there’s a suspiciously high number of experiments published with p-values reported just under 0.05; this suggests that a great number of studies have been manipulated to ensure that the results are just below the magic number that has been somehow accepted to be the arbiter of what’s true.

How did this come to pass? It turns out that R.A. Fisher, the founder of modern statistical testing, never intended p-values to have such a large role in decision making, but rather one of many tools. Ideally, decisions are made on the basis of a number of experiments that together tell a story about the phenomenon being investigated, rather than on the basis of one experiment that supplies a significant p-value. The situation has become so dire that the American Statistical Association (ASA) has released a statement on the proper use of statistical tests, stating that “[r]esearchers should bring many contextual factors into play to derive scientific inferences….. The widespread use of “statistical significance” (generally interpreted as “p < 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.” In a recent perspective piece in Science, Steven N. Goodman offers a fantastic overview of the issue, the history of statistical inference testing, and suggestions for how the situation can be improved. In essence, decision making must include “prior evidence, understanding of mechanism, [and] experimental design and conduct.”

This brings us back to environmental monitoring. In the case of determining whether an industrial process has an effect on the environment, how does one make a decision? In the EEM case, critics were concerned that finding statistically significant differences between upstream and downstream sites meant that the facility could be required to take management action. In reality this criticism is more a result of the dominance of the p-value in scientific reasoning than a legitimate concern about the way the EEM program works. In reality a mine has “failed” or is triggered into the next phase of monitoring only if studies have shown statistically significant changes in the same endpoint, in the same direction, over two cycles of monitoring. In other words, there has to be an interpretable pattern of change in order to trigger the next phase of monitoring, not simply one failed study (Bosker et al. 2012). (As an aside – in the Pulp and Paper Environmental Effects Monitoring Program, the facilities must show that these differences are of a sufficient magnitude as well. Called Critical Effects Sizes, these thresholds ensure that the change in an endpoint is large enough to be biologically relevant. Critical Effects Sizes may be implemented in the Metal Mines program as well.)

As environmental monitoring becomes more common and gains more of a role in decision making in Alberta, it is important to remember that these studies are for making management decisions, thus it’s imperative that we spend time considering how to interpret the evidence collected. It would be folly to design monitoring without considering how the evidence is going to be used, and important to remember the pitfalls of using statistical inferences. In the EEM program, the emphasis needs to be placed back on the concept of finding interpretable patterns of change, followed by an investigation of cause, to trigger management action. This is difficult to balance with the imperative of environmental protection, which may suffer while investigations are on-going. This means building a monitoring and management system that is adaptive, supports or at least communicates with fundamental research, refers to baseline research and projections conducted either by routine monitoring or during the impact assessment process, and is responsive to the concerns of stakeholders. Environmental monitoring supports decision making, so it’s worth spending the time thinking about what decisions need to be made, and how.

Works Cited:
Dumaresq C, Hedley K and Michelutti R. 2002. Overview of the Metal Mining Environmental Effects Monitoring Program. Water Quality Research Journal Canada. 37(1): 213-218.
Bosker T, Barrett TJ and Munkittrick KR. 2012. Response to Huebert et al. (2011) “Canada’s Environmental Effects Monitoring Program: Areas for Improvement.” Integrated Environmental Assessment and Management. 8: 381-382.
Wasserstein RL and Lazar NA. 2016. The ASA’s Statement on p-Values: Context, Process, and Purpose, The American Statistician, 70:2, 129-133
Goodman SN. 2016. Aligning statistical and scientific reasoning. Science 352(6290): 1181

Endeavour Scientific can help you access and understand the latest science. See the Body of Knowledge project.