Life as a grad stu­dent, as many of you already know, is a con­stant bat­tle with com­plet­ing assign­ments on time, keep­ing up the grades for future fund­ing oppor­tu­ni­ties and get­ting enough sleep. My first semes­ter in the Mas­ter of Spa­tial Analy­sis pro­gram is near­ing its end in what seems like the short­est semes­ter of school I’ve ever taken. Dur­ing this time I have had the plea­sure to strengthen some of my skills in the R lan­guage for sta­tis­ti­cal com­put­ing for the pur­pose of water qual­ity trend analy­sis (SA8904GIS Project Man­age­ment). This post will show how I have used R for water qual­ity trend analy­sis, using pub­li­caly avail­able data from the USGS (due to a NDA that pro­hibits me to share my school work).

Download

I retrieved my data from the USGS open data por­tal for water qual­ity analy­sis. You can find your own, but I was specif­i­cally look­ing for a site that would become a good exam­ple for this blog. For the fol­low­ing exam­ples, a “good” site would be one that reported often, reg­u­larly, and for a rea­son­able length of time. This would allow the non-parametric tests to per­form as expected. Sites exhibit­ing large tem­po­ral gaps and anom­alies (out­liers) were avoided. There are many other cri­te­ria to con­sider when per­form­ing water qual­ity trend analy­sis in a research set­ting and this post is not aimed at pro­vid­ing that level of detail or guid­ance (there is plenty of lit­er­a­ture on the subject).

If you want to fol­low along with me, you can prac­tice nav­i­gat­ing the USGS por­tal and locate sta­tion: USGS 07227500 Cana­dian Rv nr Amar­illo, TX. Alter­na­tively, you can down­load the raw data as I retrieved it on 2013-11-22 here: gist.github.com/MichaelMarkieta/7610033/raw/07227500.txt (right-click > save as…) and open in your favourite spread­sheet soft­ware (note: TAB-delimited). As you can see, this water qual­ity sta­tion in Texas has been report­ing from the late 1940s and I believe con­tin­ues to report to this day. I devised the fol­low­ing exam­ples by look­ing at one water qual­ity para­me­ter:
 
”# 70303 — Dis­solved solids, water, fil­tered, tons per acre-foot”

Filter

Using a pivot table in Excel, I extracted the sam­ple dates (sample_dt) and test result val­ues (result_va) for the dis­solved solids (70303) at this water qual­ity report­ing sta­tion. The exact def­i­n­i­tion of dis­solved solids varies but Health Canada pro­vides a thor­ough descrip­tion. While the pres­ence, increase, or vari­ance in dis­solved solids in our drink­ing water is not par­tic­u­larly of con­cern, it does have a direct influ­ence on water palata­bil­ity (taste) and pipe scal­ing. The result of the pivot table has 823 records of dis­solved solids mea­sure­ments (gist.github.com/MichaelMarkieta/7610033/raw/07227500_70303.txt).

Remove Outliers

If you were to sort the 823 records by the result value, you would imme­di­ately see that there are 2 records that are extreme out­liers. Using R we can con­firm this anamolie in our data by test­ing for skew­ness using the moments pack­age and box­plot func­tion like so:

> library(moments)
> data <- read.csv("07227500_70303.csv")
> skewness(data[,2])
[1] 18.65909

> boxplot(data[,2])

Those 2 pesky out­liers can be safely assumed as anom­alies (con­t­a­m­i­nated sam­ples, errors pro­duced by the equip­ment, etc.) and can be removed from the data set.

Date,Result
3/9/1958,89
6/1/1961,67

The cleaned ver­sion can be retrieved here: gist.github.com/MichaelMarkieta/7610033/raw/07227500_70303_clean.txt. For good mea­sure, check the skew­ness and box­plot in R after remov­ing those two val­ues. The remain­ing out­liers rep­re­sent the nature of the data and are within a rea­son­able dis­tance from the mean result values.

> data <- read.csv("07227500_70303_clean.csv")
> skewness(data[,2])
[1] 1.319998

> boxplot(data[,2])

Analysis

Now we can use the Mann-Kendall trend tests on our water qual­ity data to check if there is a monot­o­nic trend (a grad­ual and steady change over time). Here we use the Kendall pack­age and the MannK­endall func­tion to deter­mine if a trend exists on a vec­tor of tem­po­ral data. Presently, our data is just a sim­ple pointer to the csv on our hard drive. We must con­vert the csv table into a tem­po­ral vec­tor of data using the XTS pack­age before run­ning our MannK­endall func­tion. All together:

> library(xts)
> library(Kendall)
> data <- read.csv("07227500_70303_clean.csv")
> data_xts <- xts(data[,2],as.Date(data[,1],"%m/%d/%Y"))
> MannKendall(data_xts)
tau = 0.227, 2-sided pvalue =< 2.22e-16

With this bit of code we’ve deter­mined that there is a sig­nif­i­cant (p-value=0.000) monot­o­nic trend of increas­ing (tau=0.277) dis­solved solids over time at the “USGS 07227500 Cana­dian Rv nr Amar­illo, TX” water qual­ity report­ing sta­tion. We can visu­al­ize this increas­ing trend using a LOWESS line fit­ted to the data on a plot in R. As we can see from the plot, the dis­solved solids result value is not only increas­ing over time but also fanning-out. Of course, this method could be extended for any water qual­ity para­me­ter; detect­ing sea­sonal vari­a­tion; iden­ti­fy­ing shifts in trends (pos­i­tive to neg­a­tive); and so on.

> x <- time(data_xts)
> y <- coredata(data_xts)[,1]
> plot(x,y)
> lines(lowess(x,y, f=0.5),lwd=2, col=2)

Resources

Here are a list of R pack­ages that are use­ful to have for water qual­ity test­ing, as well as other sta­tis­ti­cal com­pu­ta­tion and visualization:

 
 
If this post helped you and you enjoy my site I would hap­pily accept Lite­coin dona­tions:
 
LKPfT772e9HxvXYcA8LVDctTmENoqQxQF3
 

Thank you!

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>