A Random Tutorial
Taking Samples of Spatial Data for use in Statistical Tools
This tutorial covers a technique for extracting a random sample of entity and attribute information from a GIS feature class export to another tool for analysis. IN this case, we have a homework assignment that asks us to extract 500 random residential parcels from the Somerville Tax Assessor's database, and use these in Excel to practice calculating some summary statistics. The second part of the tutorial takkes a look at several ways of making segmented or stratified samples.
Note that we could actually run our statistical experiments on the entire population of parcels -- since we have the data -- but the point of the statistical problems we are creating is to understand how to use sampling techniques that may be necessary in cases -- like household surveys -- where a whole polulation sample would be impossible. Since we have all of the data in this case, we can easily compare the inferences that we make from the samples to understand how well our sample-based inferences match the population as a whole. In many real situations we must rely on the theoretical properties of distributions and samples to judge the veracity of understandings derived from limited data.
This tutorial assumes that you remember the basic GIS concepts and techniques from the elementary tutorials,
- Beginning a GIS Database.
- Thematic Mapping with Nominal Class Data
- Thematic Mapping with Quantitative Data
References and Deeper Reading
- Spatial Modeling for Scholarship and Decision Support
- Critique of Data and Metadata from the GSD GIS Manual.
- Lecture Notes on Relational Database Management Systems
- Using ArcMap particularly Chapter 10, Working with Tables.
Download Sample Data
Right Click Here to download the sample dataset. Extract its contents to your C:\temp\yourusername folder.
Explore the Parcels Table, Make a Join, Dump a Random Set of Parcels
Lets take a look at the data. We are looking for residential parcels. Open the attribute table for parcels and try to find something that might help us identify a set of parcels that may answer to the concept "Residential." Perhaps the Use-Code attribute. We will make a table join with the Department of Revenue Use-Code Lookup table, and then make an attribute query to select Residential Parcels. SO far so good. But we want to select just 500 parcels and they have to be random. So we add a new field to the parcels database to hold a random number. This field should be initialized to hold a double-precision number. Then we calculate a randome number into this field. Finally, we alter our table query to select a random slice of the residential parcels, and export these to a new table.
- Joining Tables
- Selecting Records from a table
- Adding and Deleting Fields from a Table
- Making field calculations
- Exporting Selected Records to a New Table
- Join the parcels table with the dor_lucode_lut using the foreign-key, Use_Code.
- Create a new column in your parcels table named random. Make it a Double precisiton number.
- Calculate the value of Random using the expression rnd()
- Do an attribute query to select those parcels selecting Residential and Apartment Properties having a random value greater than some number and less than some number. Adjust the slice until the selection ends up being 500
- Export the selected records to a new dbf table.
If we consider all of the residential property in somerville as a population, the exploration of a large-enough random sample may be considered, within some level of confidence, provide some reflection on the situation in Somervile in general. And yet, what can you say about a place as large and diverse as that? To say "Residential in Somerville" is painting with a very wide brush. Somerville has distinct neighborhoods, the the overall inventory housing stock is very diverse and segmented. If we could divide the population of residential parcels into meaningful segments and sample these seperately, we may find that these data help us to distinguish meaningful patterns (or not.) One question that one must always ask is whether this technique is as useful for discovering patterns that have been shaped by nature; as much as they help us to test the extent to which the artificial categorizations that we (or the tax assessor) impose on the data, actually do divide the data into sub-populations that might be recognized as distinct based on the examination of some measure or other. In any case, the exercise can be useful if we understand what we are doing. To some extent, all categories are artificial when we look at the natural world, the question is whether or not they are useful. If the clasees we choose have actual meaning in the real world, this stratification method may be said to guarantee representation of subgroups that may be missed in an unbiased random sampling strategy.
We will look at two ways of stratifying a sample by categorical divisions on the data. In both cases we will attempt to achieve a proportionate allocation in our class-wise samples -- selecting samples for each strata, that have the same proportion with the town-wide sample as occurs for the size of class in the population as a whole. The first method will be to stratify the sample according to a nominal class. We will exploint the Department of Revenue Land Use Classification scheme to divide the residential properties into 3 or 4 classes, and take proportionately sized samples from these. The second method will be to divide the landscape of somerville into categories of distance (buffers) from some features or other (we will choose commercial centers.) Then we will assign residential properties to these categories based on which distance buffer the parcel centroid falls within.
Stratifying by Nominal Class
This strategy buildso on the technique of categorizing by lookup tables and selecting by attributes as practiced above. The difference is that rather than lumping all of the residential parcels together as a single class, we will use the lookup table to maintain three distinct classes of residential. You could modiefy the lookup table to define your own finer or different categories if you wanted to. Then to understand the proportional representation of each subclass of the population, we can create a summary table which counts the number of individual cases occurring for each class. This information provides us with the proportion infomration wee need to divide out sample total (500 parcels) into straified sub samples of the apropriate size.
Stratifying by Distance from Commercial Centers
IN this case we create our own classification of parcels based on their distance from commercial centers. This involves creating a new feature class to represent the copncept "Commercial Center". We can then transform these point feature into zones of distance, known as buffers. We will then use a spatial join to assign each parcel to the buffer that it falls inside. IN doing this, we will find that the association of parcels to distance classes is not entirely clear, since many parcels will fall into more than one buffer. In order to make this categorization work, we must transform our parcels to points. Oncfe we have categotrized our parcels this way, the method for stratified sampling by proportional allocation will be applied, as above.