Wednesday, November 5, 2014

Module 10 - Special Topics in GIS - Introductory Statistics, Correlation and Bivariate Regression

This week we had a crash course in Statistics, as it might apply to GIS and other areas.  We took advantage of the considerable capabilities of Excel for our exercises, which covered basic statistics such as mean, median and Standard Deviation.  We also examined correlation: how two or more variables might vary together, and how much the variation in one may influence the other.  The task we worked on was over linear regression.  This is a predictive tool, in which you plot known values of two variables on X and Y axes, then determine a best-fit line through the points.  The slope and y-intercept of the line help us predict unknown values beyond the sample.
In this problem, we had yearly rainfall data for two weather stations, A and B.  For B, the weather data went from 1931 to 2004.  For Station A, the data was only for 1950 - 2004.  We plotted Station B and A values against each other (B as X or the independent variable, and A as Y, the dependent variable which we hope to make predictions on.)   Then, slope and y-intercept are calculated using functions in Excel.  Y' (Station A rainfall values) can then be calculated from the known X (Station B) value for the years that rainfall data for Station A is missing.

The equations is very simple, just the usual equation for a line that we learned in Algebra:

Y' = bX + a

where b = slope, a = y-intercept of the best fit (regression) line, Y' is the predicted rainfall for a particular year for Station A, and X = known rainfall value for Station B for the same year.

Y' for each year gives us an estimated, hypothetical rainfall for that station, based on its relationship to the other station.

This method does have limitations.  If the values for X and Y deviate too far from the line of best fit, then this error will decrease the reliability of predicted values.

No comments:

Post a Comment