Geographic Information Systems (+/-)
Data Resources (+/-)
Data Handling (+/-)
Effective Cartography (+/-)
Analytic Techniques (+/-)
Topographic Modeling in 3D (+/-)
Metropolitan Scale 3d Models (+/-)
  Computer Resources GIS Manual  

Understanding GIS Data, Referencing Systems and Metadata

The diagram below is introduced on Understanding GIS Models. Here, we will be looking more closely at the slice of it that is necessary to understand at the outset of the data gathering phase of a place-based research project. Our approach to understanding the world requires that we have a Conceptual Model. Some of the individual Concepts in the model may refer to real world Entities and Phenomena and these, in turn may be represented by the traces of Observations and Measurements that have been gathered according to some Method or other. In order to become data, the measurements and observations must be encoded using regular, predictable and documented Referencing Systems. There is much data that is collected and shared by adminstrative agencies and scholars. It is often the case that these existing datasets are re-used to represent concepts in our models. Yet before doing so, one must have an understanding of the purposes, and methods and referencing systems surroundng the production of a dataset. This necessary information about a dataset is known as its Metadata

A collection of data may be organized as a Database Schema that will play an important part of turning our conceptual model into a Data Model that can serve as the basis of Portrayals such as Maps and systematic operations that we can use to simulate and perform experiments on. Naturally, if any of this is to have any value, we need to concern ourselves with the details. This page considers the issues involved with with understanding Data and their Fitness as representations for concepts. Along the way we will discuss how data are organized as schema, and transformed into Portrayals for visualization.

In lecture, the relationships among these concepts may be demonstrated with one of the following sample datasets:

Purposes, Questions, Conceptual Model and Concepts

It goes without saying that a purpose is what makes work worth doing. So we should always begin our discussion by stating our purpose. FO rthe purposes of this discussion, we will continue to explore a conceptual model about how wetlands may be affected by the properties of land nearby. We hope that this understanding will help us to evaluate proposed new developments in terms of their potential impact on wetlands. Our purpose is: The evaluation of proposed new developments. Our very simple conceptual model involves two concepts of fact: wetlands and the proerties of land and one concept of relationship: land nearby wetlands. Two of these concepts are entities that may be easily represented with data, the third, nearby, is a concept that needs to be simulated with a procedure.

Observations and Measurements

One would think that the process of understanding wetlands and their relationships with nearby developments would involve getting our feet wet; but, for better or worse, this is not necesarily true. There is a good deal of prior scholarship about the processes of impact, and there are many existing datasets that represent wetlands and development in our study area. For now we will focus on wetlands. We have at least three datasets that represent wetlands, and we could probably find at least two more if we looked around a little bit. Each of these datasets represents records of some observations and measurements that were made for some purpose. We can say without even looking at the data that they are imperfect if for no other reason than that they reflect a past condition, but we know that there are many other reasons that a dataset is not a perfect representation of reality (we will explore these in detail, below). The more important question is whether a given datsaset is good enough to represent wetlands for our model. This is what we will attempt to investigate.

Metadata

If we wish to use data to represent specific concepts in our conceptual model, there are several aspects of that data that are necessary to eveluate if we are to judge the datasets fitness for our purposes. This information about data is called Metadata. In order to evaluate data, it is very useful to have formal metadata documents.

Metadata will tell you many things that may be necessary for evaluating the fitness of the data for your purposes e.g.:

  • What sort of real-world entities is this datsaset intended to represent?
  • What were the methods used to discover and observe and measure these entities
  • Who collected the data? Is the source of data a recognized authority?
  • For what purpose were the data collected/intended?
  • What time period does the data represent?
  • What spatial referencing systems were used to record observations for the geometry of each feature.
  • What is the spatial prcision employed in these measurements.
  • What semantic referencing systems are used for the and for each of the attributes? (This is known as the Data Dictionary.)
  • Are the data considered to be complete?

Sometimes formal metadata documents do not exist for a dataset, and we have to make inferences about the quality of the data according to such things that can be observed during the process of obtaining the data or by looking at it with GIS software: such as the what is the authority and interest of the presumtive primary source of these data? What is the source where the data were obtained, When is the date that the data were gathered, and by looking at the apparent logical consistency with other datasets that may be better understood. All of these observations should be recorded in a simple readme file and be saved with the data, if no better metadata can be found.

Information Infrastructure: Metadata is important whenever data are intended for serious use. In addition to being essential for understanding an individual dataset, systematically structured metadata is also a key element of data infrastructure such as searchable catalogs of data, or automated systems for mapping and analysis. For this reason there are a couple of important standards for machine-readable metadata:

Referencing Systems & Their Logic

Datasets are organizations of references to entities and phenomena and their attributes. The previous section discusses a few of the ways that these references may be organized. It is also important to understand the particulars of references themselves and the different sorts of logic that each supports and how a set of references from one system may be transformed into another. Computer systems will allow all sorts of maps and analyses to be done with data, but the only way to understand whether these are useful or garbage is to understand the properties of the referencing systems inherent in the data.

Numeric References can reflect relative sequense, relative magnitude, absolute count or weight, or rational relationships. Understanding which of these is the intent of a numeric referncing system will detrmine whether arithmetic and algebraic logic applies. It may make sense for example to subtract two measures of area, or to divide one into the other to reflect a percent. The same may not be true of numeric measures of Temperature, or class rank. If a dataset records attributes in numeric values, it should provide documentation reflecting the logical yype of number that is being used, and the units of measure. It would also be helpful to know the precision that was required by the dat acollection method.

Date and Time References References to dates allow us to distinquish events that happened before or after some other point in time. Date and time references can be manipulated with logic that allow s us to subtract one from another to reveal the interval that elapsed between them. If time references are given in a dataset, the metadata should indicate the time zone that is assumed.

Text Strings Text strings are in one sense, the dumbest of all referencing systems, but some of the ways that they are employed lead to very useful logic and transformations. Logical manipulation of text strings can take the form of sorting alphabetically, and also of parsing, or chpping a string into substrings. When we consider that text strings may be references to cateories of objects, we can see how useful this can be. For example if we look at the Anderson Land use Categorization System, we see that each anderson code is a string two characters long. The first character reflects the Major Class (e.g. forest) and the second character reflects the minor class (e.g. Coniferous). By parsing such charcter-string references aomplex taxonomic relationships can be encoded and meaningful generalizations may be transformed. It is interesting to see how this notion of Hierarchal referencing systems has spatial applications as it is utilized in postal codes and census tracts.

Metadata for Text Type references might include a Data Dictionary listing of all of the possible codes that may appear, along with the definhition of each. Mechanisms for transforming and reclassifying text based attribute codes are discussed in Reclassifying Data With Lookup Tables.

Geographic Coordinate Systems Geographic Coordinate References (GCS) refer to a specific point on the earth with a Latitude and a Longitude, which establish a ray with specific direction from the center of the earth, relative to the Earth's axis of rotation and the Greenwich meridian. Since the earth is not a sphere, references to latitude and longitude must be clarified with a reference to an Earath Model that provides an estimate the radius of the earth at each point if we hope to make sense of how two pairs of latitude and longitude references relate to eachother (for example to estimate the distance between the points. More information on Geographic Coordinate Systems is provided in Fundamentals of Spatial Referencing Systems. It is important to keep in mind that although Geographic Coordinate References are expressed as pairs of numbers, it is not apropriate to apply the logic of Cartesian Coordinates. For example, the actual distance of a unit of latitude is nearly 100 kilometers at the equator and infinitessimaly small as you approach the poles. So plane geometry and graphical ideas of map scale can not be figured with GCS. Metadata for a GCS must indicate the Earth Model that is assumed.

Projected Coordinate Referencing Systems in order to make maps that appear on a 2-dimensional surface or that have a regular scale in all directions, it is necessary to transform the sperical coordinates of GCS to a cartesian coordinate system. There are many methods for doing this, and these are discussed in detail in the page, Fundamentals of Spatial Referencing Systems. For now, we will just say that the metadata for a Projected Coordinate System should refer to the Projectiopn Method, The Projection Case, The Earth Model, and the Coordinate Units.

Schema

Schema is a word that refers to an organization of data. There are a few very basic schema that provide systematic ways of assocaiting references with locations. These tabular or raster arangements are what allow us to systematically apply procedures to data. Interesting things happen when we have multiple datasets organized together -- especially when the referencing systems used are well understood and may have relationships with one another. In this case we may be able to use associations between different datasets to generate new information. Such a collection of related datasets, is also known as a schema.

Basic GIS Schema for Individual Datasets

In GIS, there are three very basic containers for storing references to entities and their attributes.

Vector-Relational databases represent entities as crisp geometries of points, vertices, lines, and polygons. A Feature Class uses represent all of the instances of a particular type of entity (e.g.) wetland, with a row in a table. THe table can have as many columns as necessary to reflect each of the attributes that have been measured or observed for each entity. One of the attributes holds the geometric properties of the entity (e.g. polygon) Other attributes may hold numeric, or character-string or time-date attribute references. Often these other attributes are encoded in some specific referencing system, such as a land use code, or area measurements in some units or other, for which we will need a data dictionary to figure out -- otherwise we may have to guess.

Image Raster Geographic Images organize measurements of intensity for an array of congruent locations (pixels). Normally the intensities are scaled to a range that can be represented with 8 binary bits -- or 256 increments of intensity. To get more discrimination, often images are use multiple channels to record different parts of the spectrum. This is most often seen with true color images that have a separate channel for Red, Green and Blue. SOme images divide the spectrum in other ways and may use many more different channels.

Grid Raster The third common container for organizing geographic data is known as a Grid. This is a raster of congruent square cells, but the number of values is not restricted to 256 discrete values. The attributes for grid cells may either be integers or rational numbers with decimal places (floating point).

Portrayal Information

As Yogi Berra once said, we can observe a lot by just looking! But how do we look at spatial data -- which is, after all just a bunch of 0s and 1s in data files on our hard drive?? THe answer is, that we use some graphical program like a GIS to portray the data. There are many ways to portray a any single dataset or multiple datasets in juxtaposition with eachother. For this reason, portrayal information is not an integral part of an individual dataset, but may be considered as part of a schema. You could think of portrayals as more metadata that describes how a dataset should be rendered for a specific purpose.

Layer Portrayals

We may simply want to look at the values of one datsaset as attributes in a table. We may want to look at the arrangement of cells or geometric shapes of a single dataset as a layer of pixels or shapes colored according to one of its attributes. As it happens, there will be a very large number of potential ways to portray a dataset. Therefore, portrayal information is stored separately from data. One aspect of a layer portrayal is a reference to the dataset that it is intended to portray, which will usually be a reference to the dataset as a file in the filesystem. The portrayal may also contain symbology, and other aspects.

Map Portrayals

We may want to display more than one geographic dataset in juxtaposition with eachother. This requires that we have some sort of portrayal that can serve as a container for multiple layers. We would call this a Map Document One big question of portrayal in the world of spatial data is what geometric projection should we use to display the data (since we may have different layers that use different coordinate systems (we will talk much more about this important aspect of map documents on another page). Since the map document may hold several layers, it goes without saying that it may contain references to to many different datasets.

Filesystem References

An impoirtant aspect of a schema that includes multiple datasets is that it should behave predictably. This means, for one thing, that the references to the various datasets involved should be stable. If a GIS schema is intended to be moved from one storage device to another, care should be taken that the references to dataset files should not make references to specific storage devices. Rather, all of the data files, metadata and portrayal information should be stored in a structure of folders that is easy to keep in a stable relative relationship with one another, and file system references should be made relatively with respect to this structure.

Critical Evaluation of Geographic Information, Maps and Models

Thinking critically about geographic information begins with examining all of the elements of metadata as described above. It should be noted also that some of the most important inormation we should criticize is the information we generate by combining dat as maps or models. Therefore, these critical considerations should be applied to all model inputs and outputs!

Errors of Omission and Commission

Once we have had a look at a dataset and its metadata (if any exists) we may be able to evaluate whether it is Fit to Use for our Purposes or not. Much of the answer to this question can be inferred from the metadata (see above) or from the other contextual examinations of the data (such as whether the data appear to be logically consistent with other datasets. A very good exercise to go through (and potentially to put into the documentation of your model) will be whether you think that the errors in your dataset will be Errors of Omission where a potentially important entity (or a part of an entitiy) in the real world may have been omitted from the dataset; or Errors of Commission which would result in an entity being reflected in the datset, which in reality does not actually exist. It is likely that any dataset will have both of these sorts of errors. A part of this would be to understand whether either of these types of error are systematic, which would indicate a Bias in the data.

Errors of Logical Consistency

Another important sort of error to look for includes errors of logical consistency. An example of this might be if our dataset shows wetlands existing in a place where our terrain model indicates steep slope. Errors of logical inconsistency let us know that there must be an error, even if we have no knowledge of the actual condition on the ground. Often these errors arise from a difference in geometric precision in one or more layers, or a wholesale displacement of the coordinate references. The logical inconsistency may indicate a problem with the attribute coding of entities.

Errors Related to Categorical Precision or Granularity

In many applications of reasoning, it is necessry to classify of entities having similar objects as if general rules might be applied to them. Thus we create categories and we classify things. In GIS we encounter categories of the qualitative sort, such as land use classification systems, and we also have categories in a spatial sense such as zip code boundaries or census tracts. In either case, the coarseness or fine-grain of our classification system will definitely affect our results. We know, for example that the pattern of population density is much different if we use a block-level as opposed to a tract-level aggregation. The same is true for land use classes. One classification system may distinguish 5 classes of housing according to lot-size, another may lump all housing into one class, called urban land. Given the choice of two datasets that reflect observations of the same entities, the level of granularity in the referencing systems used will render one or the other less fit to use for a given purpose.

Fitness for Modeling Purposes

Even if the data may be biased in one direction or another with regard to completeness, it still may be useful for exploring a model for model's sake. Particularly if a dataset is known to be the best available, or if it an official source that ought to be evaluated. Remember that as scholars, our chief aim is to try to make a useful model and then to evaluate the result. In this sense a dataset may make for a worthy modeling experience even we find that it is not likely to produce a precise or credible answer to our question. This is especially true if our inferences of bias can lead us to an prediction that our final answer is likely to be an Overestimate or an Underestimate. Modeling with imperfect data is useful, particularly if it leads to a more practical understanding of what sort of data are required in order to make a more useful model.

It is particularly problematic, however, to try to make an interesting model when data are grossly logically inconsistent. For example, a database schema that reflects wetlands on steep slopes or forests in the ocean, may not really tell us anything useful about the real world or even about modeling the world.