Saturday, March 28, 2020

Data Classification

The module this week had us exploring different classification methods for data.  The ones we looked at most closely and used for this lab were equal interval, quantile, natural breaks, and standard deviation.  These different classification systems determine what groups data are divided into and thus was color they correspond to on a choropleth map.  Therefore, the classification style chosen for a map can greatly effect the message that it conveys to the audience.  It is important to know what each classification system does and when best to apply it as to not create maps that are misleading.

The equal interval system divides the data into range breaks of equal size based off of number of classes desired.  The range breaks will get however much data there is within the ranges - whether they contain most of the data or none at all.  The ranges are thus almost always going to be unequally filled because of how they are equally spaced out.

The quantile system classifies data into categories with an equal number of data values within each category.  The spacing within the categories will be uneven, but the quantity of content within each class is equal.

The natural break system uses a mathematical algorithm to calculate natural groupings within the data.  The algorithm computes a unique grouping of classifications for each data set. Because how tailored it is this is a preferred choice for many maps.

For standard deviation, classes are created by adding or subtracting the standard deviation from the mean of the dataset. Standard deviation does a good job at creating (statistically) evenly spaced classes.  The way that the data is broken into classes is statistically logical. However, a casual audience may not understand what standard deviation is.

For this project we used census tract data for Miami-Dade county to create maps showing the senior population using different classification methods.  I created eight maps in all.  The first set was the percentage of seniors in each census tract displayed via equal interval, quantile, natural breaks, and standard deviation classifications.  The second used the same grouping of classifications but normalized the population based off of square mileage to find the density of the senior population in each census tract.  The purpose of the lab was to get us to understand the differences between classification systems, how to normalize data, and the overall usage of choropleth maps.  All of this work was performed using ArcGIS Pro.
The percent of senior citizens living within different census tracts of Miami-Dade County.
The population density living within different census tracts of Miami-Dade County.  This data took the count of senior living in each census tract and normalized it based off of the square mileage of each tract.

Most of this maps all use the same graduated bluish-purple color ramp except for the standard deviation maps.  This is because standard deviation maps typically use a diverging color ramp - one color for the classes below the mean and another for those above.

The natural breaks map normalized by square mileage isolated.  This is a strong map for both presentation and for data interpretation. 
I organized the map frames in this order because I believe that they go from conceptually easiest to understand to hardest.  Equal interval is very straight-forward, and quantile is one step of complexity above that.  The actual algorithm behind natural breaks is fairly sophisticated, but it is easy to explain to someone that they are grouped into most alike classes.  Standard deviation is harder to explain to someone without a statistics primer first.  If your audience isn't mathematically inclined using this methodology requires an extra layer of explanation in a way that the others do not. 

The best classification method for this map is natural breaks.  Since the algorithm finds the best class sections within the data it creates a map with good contrast while taking into account possible outliers.  The equal interval maps for both percent and density make it look like there are hardly any seniors at all. While the standard deviation classification is mathematically sound, it is difficult to explain to an audience who is not mathematically oriented and may not be interested in a lengthy explanation.  Quantile also produces a pretty solid looking map, but I like how natural breaks takes into account the grouping of the data. Where people live is influenced by a lot of factors, such as cost of living and access to health care so to divide the data up arbitrarily in quantiles has the potential to be problematic.  It makes more sense to use natural breaks which is able to take into account natural clusters of data.

Additionally, using the natural breaks map that is normalized by square mileage makes the most sense. Normalized data is a more useful depiction than the percentage maps.  If a county official was using a map to determine funding for geriatric-specific medical care, it makes more sense for that official to consult a population density map.  The density map makes comparisons between counties equal, while a percent map the comparisons do not easily transfer between counties.

To further drive this point, imagine if the city official was tasked with allocating funding to ambulances for each tract.  If the official looked at just the percentage, they would assign a large number of ambulances to that rather small dark purple tract that shows up in the Northeast quadrant of the county, but hardly any to the much larger country directly West of it.  When consulting the normalized natural breaks map, these two quadrants are in the same class. Resources should probably be allocated approximately equally between them instead making that decision off of the less balanced population map.



No comments:

Post a Comment