Cultivating Spatial Intelligence
Understanding GIS Data, Referencing Systems and Metadata
The diagram below is introduced on Understanding GIS Models. Here, we will be looking more closely at the slice of it that is necessary to understand at the outset of the data gathering phase of a place-based research project. Our approach to understanding the world requires that we have a Conceptual Model. Some of the individual Concepts in the model may refer to real world Entities and Phenomena and these, in turn may be represented by the traces of Observations and Measurements that have been gathered according to some Method or other. In order to become data, the measurements and observations must be encoded using regular, predictable and documented Referencing Systems. There is much data that is collected and shared by adminstrative agencies and scholars. It is often the case that these existing datasets are re-used to represent concepts in our models. Yet before doing so, one must have an understanding of the purposes, and methods and referencing systems surroundng the production of a dataset. This necessary information about a dataset is known as its Metadata.
- ArcMap 101: Introduction to GIS Data and Portrayal provides a tour through an arcmap dataset with many different types of data and metadata.
- Sources of Geographic Data Discusses sources and strategies for finding data that can be used in GIS.
- Spatial Data Structures and Formats. discusses and compares the various waus that GIS data are structured and exchanged.
- ArcMap 10 Help on Data Formats Supported in ArcMap
In lecture, the relationships among these concepts may be demonstrated with this sample dataset.
Purposes, Questions, Conceptual Model and Concepts
It goes without saying that a purpose is what makes work worth doing. So we should always begin our discussion by stating our purpose. FO rthe purposes of this discussion, we will continue to explore a conceptual model about how wetlands may be affected by the properties of land nearby. We hope that this understanding will help us to evaluate proposed new developments in terms of their potential impact on wetlands. Our purpose is: The evaluation of proposed new developments. Our very simple conceptual model involves two concepts of fact: wetlands and the proerties of land and one concept of relationship: land nearby wetlands. Two of these concepts are entities that may be easily represented with data, the third, nearby, is a concept that needs to be simulated with a procedure.
Observations and Measurements
One would think that the process of understanding wetlands and their relationships with nearby developments would involve getting our feet wet; but, for better or worse, this is not necesarily true. There is a good deal of prior scholarship about the processes of impact, and there are many existing datasets that represent wetlands and development in our study area. For now we will focus on wetlands. We have at least three datasets that represent wetlands, and we could probably find at least two more if we looked around a little bit. Each of these datasets represents records of some observations and measurements that were made for some purpose. We can say without even looking at the data that they are imperfect if for no other reason than that they reflect a past condition, but we know that there are many other reasons that a dataset is not a perfect representation of reality (we will explore these in detail, below). The more important question is whether a given datsaset is good enough to represent wetlands for our model. This is what we will attempt to investigate.
If we wish to use data to represent specific concepts in our conceptual model, there are several aspects of that data that are necessary to eveluate if we are to judge the datasets fitness for our purposes. This information about data is called Metadata. In order to evaluate data, it is very useful to have formal metadata documents.
Metadata will tell you many things that may be necessary for evaluating the fitness of the data for your purposes e.g.:
- What sort of real-world entities is this datsaset intended to represent?
- What were the methods used to discover and observe and measure these entities
- Who collected the data? Is the source of data a recognized authority?
- For what purpose were the data collected/intended?
- What time period does the data represent?
- What spatial referencing systems were used to record observations for the geometry of each feature.
- What is the spatial prcision employed in these measurements.
- What semantic referencing systems are used for the and for each of the attributes? (This is known as the Data Dictionary.)
- Are the data considered to be complete?
Sometimes formal metadata documents do not exist for a dataset, and we have to make inferences about the quality of the data according to such things that can be observed during the process of obtaining the data or by looking at it with GIS software: such as the what is the authority and interest of the presumtive primary source of these data? What is the source where the data were obtained, When is the date that the data were gathered, and by looking at the apparent logical consistency with other datasets that may be better understood. All of these observations should be recorded in a simple readme file and be saved with the data, if no better metadata can be found.
Information Infrastructure: Metadata is important whenever data are intended for serious use. In addition to being essential for understanding an individual dataset, systematically structured metadata is also a key element of data infrastructure such as searchable catalogs of data, or automated systems for mapping and analysis. For this reason there are a couple of important standards for machine-readable metadata. For an interesting point of view on this topic, consider the story of the United States National Spatial Data Infrastructure, which was first chratered in 1990 as a means of coordinating the data collection and maintenance efforts of various federal agencies. The first and largest program of the NSDI was to develop The U.S. Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata. For a demonstration of the power of geospatial information inforstructure, try a search in THe Tufts GeoPortal a federated catalog of geographic information resources kept by many authorities across the world.
Referencing Systems & Their Logic
Datasets are organizations of references to entities and phenomena and their attributes. The previous section discusses a few of the ways that these references may be organized. It is also important to understand the particulars of references themselves and the different sorts of logic that each supports and how a set of references from one system may be transformed into another. Computer systems will allow all sorts of maps and analyses to be done with data, but the only way to understand whether these are useful or garbage is to understand the properties of the referencing systems inherent in the data.
Numeric References can reflect relative sequense, relative magnitude, absolute count or weight, or rational relationships. Understanding which of these is the intent of a numeric referncing system will detrmine whether arithmetic and algebraic logic applies. It may make sense for example to subtract two measures of area, or to divide one into the other to reflect a percent. The same may not be true of numeric measures of temperature, or class rank. If a dataset records attributes in numeric values, it should provide documentation reflecting the logical yype of number that is being used, and the units of measure. It would also be helpful to know the precision that was required by the data collection method.
Date and Time References References to dates allow us to distinquish events that happened before or after some other point in time. Date and time references can be manipulated with logic that allow s us to subtract one from another to reveal the interval that elapsed between them. If time references are given in a dataset, the metadata should indicate the time zone that is assumed.
Text Strings Text strings have very simple logic. Even if the string is made up completely of numerals, (as in postal codes) textusl referencing systems do not lend themselves to addition and subtraction. However they can be used in equations whereby two references that are equal, are said to refer to the same quality. THis can be useful for creating mappings, lookups or crosswalks of one referencgin scheme with another. Logical manipulation of text strings can take the form of sorting alphabetically, and also of parsing a string into substrings. When we consider that text strings may be references to cateories of objects, we can see how useful this can be. For example if we look at the Anderson Land use Categorization System, we see that each anderson code is a string two characters long. The first character reflects the Major Class (e.g. forest) and the second character reflects the minor class (e.g. Coniferous). By parsing such charcter-string references complex taxonomic relationships can be encoded and meaningful generalizations may be reduced. It is interesting to see how this notion of Hierarchal referencing systems has spatial applications as it is utilized in postal codes and census tracts.
Metadata for Text Type references might include a Data Dictionary listing of all of the possible codes that may appear, along with the definhition of each. Mechanisms for transforming and reclassifying text based attribute codes are discussed in Reclassifying Data With Lookup Tables.
Geographic Coordinate Systems Geographic Coordinate References (GCS) refer to a specific point on the earth with a Latitude and a Longitude, which establish a ray with specific direction from the center of the earth, relative to the Earth's axis of rotation and the Greenwich meridian. Since the earth is not a sphere, references to latitude and longitude must be clarified with a reference to an Earath Model that provides an estimate the radius of the earth at each point if we hope to make sense of how two pairs of latitude and longitude references relate to eachother (for example to estimate the distance between the points. More information on Geographic Coordinate Systems is provided in Fundamentals of Spatial Referencing Systems. It is important to keep in mind that although Geographic Coordinate References are expressed as pairs of numbers, it is not apropriate to apply the logic of Cartesian Coordinates. For example, the actual distance of a unit of latitude is nearly 100 kilometers at the equator and infinitessimaly small as you approach the poles. So plane geometry and graphical ideas of map scale can not be figured with GCS. Metadata for a GCS must indicate the Earth Model that is assumed.
Projected Coordinate Referencing Systems in order to make maps that appear on a 2-dimensional surface or that have a regular scale in all directions, it is necessary to transform the sperical coordinates of GCS to a cartesian coordinate system. There are many methods for doing this, and these are discussed in detail in the page, Fundamentals of Spatial Referencing Systems. For now, we will just say that the metadata for a Projected Coordinate System should refer to the Projectiopn Method, The Projection Case, The Earth Model, and the Coordinate Units.
Critical Evaluation of Geographic Information, Maps and Models
Thinking critically about geographic information begins with examining all of the elements of metadata as described above. It should be noted also that some of the most important inormation we should criticize is the information we generate by combining dat as maps or models. Therefore, these critical considerations should be applied to all model inputs and outputs!
Fitness for Modeling Purposes
When evaluating data it is crucial to make reference to a specific purpose. We know that all data are flawed. The question is: How will the flaws in the data impact the analysis that we are intending to make?
Errors of Omission and Commission
Once we have had a look at a dataset and its metadata (if any exists) we may be able to evaluate whether it is Fit to Use for our Purposes or not. Much of the answer to this question can be inferred from the metadata (see above) or from the other contextual examinations of the data (such as whether the data appear to be logically consistent with other datasets. A very good exercise to go through (and potentially to put into the documentation of your model) will be whether you think that the errors in your dataset will bias your ultimate interpreation and in which direction. Errors of Omission result when a potentially important entity (or a part of an entitiy) in the real world may have been omitted from the dataset; or Errors of Commission which would result in an entity being reflected in the datset when, in reality that entity does not actually exist. It is likely that any dataset will have both of these sorts of errors. A part of this would be to understand whether either of these types of error are systematic, which would indicate a Bias in the data. Ultimately it is very useful if we can predict that whether the model as a whole is likely to under-estimate or over estimate the phnemena in question.
Another important sort of error to look for includes errors of logical consistency. An example of this might be if our dataset shows wetlands existing in a place where our terrain model indicates steep slope. Errors of logical consistency let us know that there must be an error in the data even if we have no knowledge of the actual condition on the ground. Often these errors arise from a difference in geometric precision in one or more layers, or a wholesale displacement of the coordinate referencing systems. It is interesting to consider that even data that are completely fictitious might be useful for testing a model provided that the relationships among the 'facts' portrayed in the data create a logically plausible model of the types of relationships we would like to explore. Thus we might use very flawed data to create models that would be useful once we find adequate data. Conversely, if the data we are using portray relationships that are illogical with regard to our conceptual model, then it is very unlikely that we will be abler to make any useful interpretation of the model.
Errors Related to Categorical Precision or Granularity
In many applications of reasoning, it is necessry to classify of entities having similar properties as if general rules might be applied to them. Thus we create categories and we classify things. In GIS we encounter categories of the qualitative sort, such as land use classification systems, and we also have categories in a spatial sense such as zip code boundaries or census tracts. In either case, the coarseness or fine-grain of our classification system will definitely affect our ability to model certain phenomena and relationsips. We know, for example that the pattern of population density is much different if we use a block-level as opposed to a tract-level aggregation. The same is true for land use classes. One classification system may distinguish 5 classes of housing according to lot-size, another may lump all housing and industrial uses into one class, called urban land. Given the choice of two datasets that reflect observations of the same entities, the level of granularity in the referencing systems used will render one or the other less fit to use for a given purpose. Therefore all discussions of data should include consideration of the spatial and categiorical granularity as these relate to actual ground patterns that are related to your modeling purpose or question.