Tuesday, November 25, 2014

Module 13 - Special Topics in GIS - Effects of Scale

We began the last 3-week topic in Special Topics for this semester, on issues of scale and resolution in GIS.  As an introduction, we examined how resolution decreases in data as the scale at which it was collected  decreases:  in other words, 1:100,000 scale data will have much less detail than the same features recorded at 1:1,200 scale.

We examined two different types of DEMs in this exercise: one obtained at 1-m resolution by LIDAR and the other obtained at about 82-m resolution by SRTM, which is space-shuttle-borne microwave, known also as RADAR.  I resampled the LIDAR 1-meter data to pixel sizes of 2, 5, 10, 30, and 90 meters, and compared their slopes.  The highest resolution data (1-m) will have the highest-value slope and the slope will decrease slightly with decreasing resolution and increasing pixel size.  This is because with more and smaller pixels, the slope for each across a given horizontal distance will be higher, resulting in a higher average slope, than the slope across a single larger pixel.
The LIDAR 90-meter DEM was then compared to the SRTM DEM, which was reprojected to the same resolution (90 m) and coordinate system. The elevation and its derivatives (1st-order: slope and aspect, and 2nd-order: curvature) for both 90-m DEM's were compared to those values for the original 1-meter LIDAR DEM, with the assumption that the higher resolution original data is more accurate.

The relative errors for all values (elevation, slope, aspect and curvature) between the 90-m LIDAR and the 1-m LIDAR are smaller than those between the SRTM 90-m DEM and the 1-m LIDAR.

Selected Results from comparison to LIDAR 90-m DEM and SRTM 90-m DEM, to LIDAR 1-m DEM

It can be noted from the values in the above table that while average differences in elevation between the LIDAR 90-meter and SRTM 90-meter are quite small, the differences are compounded in the 1st derivative of slope, which hinges on elevation.  The differences are even greater for the 2nd derivative of curvature.

These results make sense, because LIDAR has many more elevation data points because of the much more rapid data pulse rate.  It is also a much shorter wavelength than SRTM.

Wednesday, November 19, 2014

Module 12 - Special Topics in GIS - Geographically Weighted Regression

Last week in Special Topics, we learned how to run a linear regression in ArcMap, using the OLS tool: this stands for Ordinary Least Squares.  This analysis method gathers explanatory variable values from the entire study area and thus does not take into account spatial variation in correlation between variables.  In other words, it assumes that the relationship between, for example, median income and rate of auto theft, is the same over the whole study area.  This is probably not the case.

The GWR (Geographically Weighted Regression) tool re-runs the regression analysis repeatedly over small, local areas within the general study area, then produces a correlation coefficient for every data location from the input, between each explanatory variable and the dependent variable.  Thus, we can see if and how relationships between variables change spatially.

In this lab, I examined rate of auto theft as a function of four explanatory variables: percent Black population, percent Hispanic population, percent renter-occupied housing units, and median income.  By running the Moran's I tool for spatial autocorrelation, I could see that there would be clustering of similar values across the area.

I then ran the same data through the GWR analysis, with the default bandwidth method of AIC.   There was no significant improvement in the model performance from the OLS method, based on AIC score and Adjusted R-squared values (AIC is a relative measure, comparing distances between various models to an unknown "truth"; Adjusted R-squared is the percentage of variation of the dependent variable that is explained by the explanatory variables).

The fact that the GWR and OLS tests produced nearly identical results suggested that the GWR analysis settings were ranging too far from each location to collect data points for each small local regression.  (In the extreme, a GWR analysis that accepts ALL of the points in the study area as neighbors will be just the same as the global OLS analysis of the whole area.)  The solution to this problem is possibly to assign the number of neighbors used by the GWR a particular quantity, rather than letting the AIC calculation figure it out.

I tried a few versions of both Fixed and Adaptive GWR and found that Adaptive GWR with 15 neighbors produced the best-performing model to explain the correlation between auto theft and the four explanatory variables, based on AIC and Adjusted R-squared.

Wednesday, November 12, 2014

Module 11 - Special Topics in GIS - Multivariate Regression, Diagnostics and Regression in ArcGIS

This week's lab was very interesting and very challenging, as we learned more about statistical analysis of data, this time in ArcGIS, and about multivariate analysis.

We can use several tools in the Spatial Statistics toolbox in ArcMap to analyze our spatial data and determine what sorts of relationships exist among variables, and whether the relationships are statistically significant.  One of the tools is called Ordinary Least Squares, or OLS.  One dependent variable is loaded into the tool, along with one or more explanatory or independent variables.  The resulting report shows many different statistics about the strength and type of effect the independent variables have on the dependent variable, and how likely those statistics are to be significant.  We can also examine the distribution of residuals on a map.  Residuals are the degree to which the explanatory variables actually fail to explain the dependent variable.  We can also check for autocorrelation, which is the degree to which data points that are spatially close to each other on a map have similar values.  There are 6 criteria to check the results from the ArcGIS OLS, by examination of the various statistics.  They include how well the variables help the model, whether any of the variables are redundant, whether the model is biased, either spatially across the study area or by magnitude of values, whether the relationships between explanatory and dependent variables are as we expected them, and if we have all necessary explanatory variables.  We can run an additional analysis called Exploratory Regression, that quickly looks at all possible combinations of as many and whichever of our explanatory variables as we want, and give us the statistics we need to decide which explanatory variables create the best model.  The key statistics are R-squared (adjusted), which tells us what percentage of the dependent variable is explained by the independent variables, and the AIC (Akaike's Information Criterion) is a relative factor that we use to compare several models to find the best-fitting one.

Tuesday, November 11, 2014

Module 10 - Remote Sensing - Supervised Classification

Supervised Land Cover Classification and Distance File Map
We worked with unsupervised classification last week in Remote Sensing, in which the programs determine classes of pixels that share similar spectral signatures, without prior input from the human analyst. Any adjustments are done afterwards.  This week we worked with supervised classification.  With this process, we "train" the software by identifying samples of particular classes of land cover, then let the programming assign the rest of the pixels in the image based on their similarity to the signatures of the samples.  The map at right was classed from an image using this type of process, after setting up 16 samples representing 8 unique land cover classes (as shown on the map legend at left).

The map was created using a band combination of Red = Band 4, Green = Band 5 and Blue = Band 6.  These are all bands in the Near- to Mid-Infrared part of the spectrum.  They were chosen because they display the classes most distinctly and there is little overlap between the signatures.

 The small grey-scale map at the bottom is called a Distance File map.  It shows how well our classification signatures match the signatures of the pixels from the original image.  The dark areas are those features which have a signature that matches the class spectral signatures best, while the white areas are those that do not fall into one of the classes created from the samples.  It is called a Distance file because it refers not to spatial distance on the ground, but rather distance on an XY plot between defined classes and the image pixels.

Wednesday, November 5, 2014

Module 10 - Special Topics in GIS - Introductory Statistics, Correlation and Bivariate Regression

This week we had a crash course in Statistics, as it might apply to GIS and other areas.  We took advantage of the considerable capabilities of Excel for our exercises, which covered basic statistics such as mean, median and Standard Deviation.  We also examined correlation: how two or more variables might vary together, and how much the variation in one may influence the other.  The task we worked on was over linear regression.  This is a predictive tool, in which you plot known values of two variables on X and Y axes, then determine a best-fit line through the points.  The slope and y-intercept of the line help us predict unknown values beyond the sample.
In this problem, we had yearly rainfall data for two weather stations, A and B.  For B, the weather data went from 1931 to 2004.  For Station A, the data was only for 1950 - 2004.  We plotted Station B and A values against each other (B as X or the independent variable, and A as Y, the dependent variable which we hope to make predictions on.)   Then, slope and y-intercept are calculated using functions in Excel.  Y' (Station A rainfall values) can then be calculated from the known X (Station B) value for the years that rainfall data for Station A is missing.

The equations is very simple, just the usual equation for a line that we learned in Algebra:

Y' = bX + a

where b = slope, a = y-intercept of the best fit (regression) line, Y' is the predicted rainfall for a particular year for Station A, and X = known rainfall value for Station B for the same year.

Y' for each year gives us an estimated, hypothetical rainfall for that station, based on its relationship to the other station.

This method does have limitations.  If the values for X and Y deviate too far from the line of best fit, then this error will decrease the reliability of predicted values.

Tuesday, November 4, 2014

Module 9 - Remote Sensing - Unsupervised Classification

Result of Unsupervised Classification
of an Air Photo of the UWF Campus
If we want to convert raw pixel data (eg. reflectance, etc.) into thematic material, we'll need to decide what spectral characteristics equate with what sorts of features.  Both ArcMap and ERDAS Imagine have automated tools for this conversion.  There are two ways of having the software do this.  In unsupervised classification, the software clusters groups of pixels with similar spectral signatures and assigns them to thematic classes, without regard to types of features the classes represent.  The user specifies how many thematic classes will be used then refines and corrects the final image and assigns the classes to the groups of similar pixels.   We did this type of exercise this week in Lab.  We started with a color air photo of the UWF campus with only the visible red, green and blue bands.  Then, we instructed ERDAS imagine to classify all the pixels in the image, based on color, into 50 classes.  We then examined the classes and assigned them to one of five thematic classes:  buildings or streets, trees, grass, shadow, and mixed.  The software gave us a considerable head start in this task, by grouping the pixels (and features) with similar colors.

Next week, we'll do supervised classification, in which we'll "train" the software with some samples that we've identified ahead of time.