A Critical Framework for Planning and Evaluating Data Models
Much of the work we do with geographic information systems and three-dimensional modeling software is creating arrangements of electronic data to represent a place so that we may better understand important aspects of that place. We call such a representation a Data Model. This sounds simple enough, but practice shows us that the process of creating data models can be very costly and the usefulness of the resulting models varies widely.
How can we use spatial data to lean more about these special places, and how new projects might affect these special views?
Planning and Building a Data Model The cost of building a data model (in terms of time or money) and the resulting utility of the result depends largely on the modeler's understanding of the modeling process. For most beginners, this understanding is arrived at through a difficult process of trial and error, while experienced or educated modelers work efficiently by planning the critical steps of their modeling project in advance, and thus avoiding details or avenues that are unnecessary or impractical.
Documenting a Model The same critical framework that is useful for planning and building electronic data models is also essential for anyone who wishes to understand the world through these representations. Since models are appropriate for specific purposes and subject to the limitations of individual components, the user should be informed of these things so that he or she may use or misuse the model responsibly. Therefore, the responsible modeler will provide documentation of the critical aspects of the model.
This page documents some critical aspects of a modeling process. We discuss the context of these modeling aspects and their implications for planning, implementation and evaluation.
An Overview of the Modeling Process
Understanding Models Models are by definition simplified representations of a situation (real, historic or proposed or imaginary.) According to this definition, we can assume that all models are imperfect and inaccurate. Since it is pointless to try to make a perfect model, an experienced modeler begins by examining the analytical tasks that the model must perform. From theis we can identify a limited set of entities and relationships that ideally need to be represented. With these ideals in mind we must look to the data sources and the data structures that are available to us, and the software procedures that we may apply to these. Ultimately these procedures will yield a resulting model that may yield a single output -- a map or rendering, or may potentially be used in many ways, interactively, or integrated into other, higher-order models. Ultimately, with the knowledge of the original modeling goal, and the critical intermediate pieces, the model may be evaluated according to its performance and the strengths and weaknesses of critical intermediate components.
Understanding the Problem
A model is made with a purpose in mind. It is nice when a model can serve multiple purposes, but it is most important that the model serve to illuminate the specific problem at hand and that it be practical to construct. Therefore, it is useful to start by defining the modeling goal. This will help by defining such matters as what aspects of the world we need to represent, at what level of precision.
Example: we may want to build a model that will allow us to evaluate view corridors for the new Zakim Bridge connecting Charleston with Boston. A view of the Zakim Bridge could be defined as a view that includes at least half of one of the upright pilons and some of the diagonal guy-wires -- a place would not be considered as having a view of the bridge if it culs only see the tops of the pilons. A view corridor would be considered as a gap between buildings that permits a view of the bridge for a [erson on the ground or from an existing or potential building.
Implication Understanding the problem at this level informs the rest of the modeling process in a couple of diferent ways. Particularly, this problem statement defines exactly what needs to be represented, with some idea of the precsion and more importantly it helps us to disregard the multitude of things that don't need to be represented. Further, this definition of the problem will provide us (and our client, or our critic) a means of understanding and evaluating the final model result.
Identify Specific Entities and Relationships that need to be Represented
Our description of the problem can be furhter broken down into specifc entities or classes of entities and relationships that must be represented. Entities in a model usually correspond with real world things or phenomena. Usually a representation of an entitiy records some but not all of the attributes of that entity, so at this point we will try to anticipate those attributes of the entities that we need to represent. These entities and attributes will be represented by data and data structures that we will try to fit to the ideals that we describe at this stage.
Relationships between entities or phenomena are often the most critical parts of models that we want to understand. We consider entities and relationships at the same stage because they are often intertwined as an association between entities may be derived by means of attributes.
Example: in our example, to understand view corridors for the Zakim Bridge, we need to represent the bridge, we need to represent places from which the bridge can potentially be seen and we need to represent the locations of things that would obstruct our view of the bridge. Note that we need not represent the bridge in all of its detail, but we can assume that if we can see a point half-way up the upright pylon, that we can see the top half of the bridge. Therefore, we have only to represent some points at strategic places on the bridge -- we don't have to model the bridge in detail.
Representing view corridors requires that we model the locations and elevations of things that would obstruct our view of the bridge. To be even more precise, what we are particularly interested in modeling is the spaces between, over or through obstructions -- these would be potential view corridors -- these form tha association or lack thereof between our bridge entities and the potential viewpoints. Considering representations in this way permits us to have a better understanding of the precision required. To be considered as a view corridor, a gap between buildings may be as narrow as 5 meters. This means that we should consisder the precision of our obstructions layer to be something like 2.5 meters or so in the xy plane. If we are able to find data of this resolution or better we will probably want to do some experiments with various levels of resolution to check the sensitivity of our model to variations in the precision of the data, as this is likely to be a critical factor in the utility of our analysis.
Implication: Notice that at this stage we have created a very narrowly defined, simplistic universe that has all unnecessary detail eliminated. This will save us a tremendous amount of time that might otherwise be spent collecting information about entities and attributes that are of no consequense. Looking ahead to the selection of data, this specification of entities will have a large implication on the practicality of our model implementation.
Specification of Data Sources, Data Structures and Software Procedures
At this stage we move from thinking of formal ideals to actally finding real data to represent our entities and attributes, and real software procedures to model the relationships that we have decided are required for our model. The software procedures that we have available will determine to a large degree, the data structures that we must use, and this in turn will detremine many aspects of the implementation steps that we must follow to transform the data that we have into the structures that we need, to represent the entities and relationships that we have identified.
Example: In order to model intervisibility in our GIS software (ArcGIS, in this case) we can represent our objective points (on the bridge) with points having attributes to represent their height offset from the surface. The ArcGIS visibility function requires that we represent obstructions and their heights in a raster. We are lucky in this case, that we have a very good raster surface of the Boston area provided by the MassGIS LIDAR survey. This surface has a pixel resolution of 0.5 meters, which according to our formal description of our entitiy representation, should be sufficient. The LIDAR survey provides us with two rasters that will prove useful: the first_return layer, which provides the heights of everything that the scanner saw, including buildings and bridges. We also have a bare_earth layer that represents the elevations of the ground with the trees, buildings and other structures removed. The latter layer will come in handy if we need or wish to effectively remove some buildings, or the Zakim bridge, itself from our model of obstructions.
Implications the availability of data is often the biggest factor in establishing the practicality of building a data model. We also commonjly see prcatical problems with data structures and software procedures. For example, if we are using AutoCAD to model three dimensional objects, we will run up against limitations in our ability to assign attributes to these entities -- autocad's data structures are simply limited this way.
Implementation and User Interaction
Now that we have a notion that our model making plan is practical, we can begin to plan out the steps for implementing it. Atr this phase we often have to think about problems of obtaining data and translating it from one format or structure to another. We should think about organizing the data in a sensible, easy-to understand directory structure. If your model is going to support user interaction, you may want to set up scripts or models or other interfaces that will facilitate this interaction. All of these steps should be documented and automated where possible, in case we (or someone else) should need to recall, understand or repeat the process.
At this stage, there may be several implementation issues that arise that weren't anticipated. These aspects are the pieces that seem to take the most time. These are also the most important aspects in terms of building your own ability to estimate exactly how long it will take to build a data model. Some of of these issues are worth documenting in your project, as others will benefit from your experience. When we are just learning to create data models it is surprising how much of this unexpected work is simply just plain annoying. You need not chronicle all of this in detail, just the interesting bits.
Example: In our example there is little conversion required, bur we will plan to clip out pieces of the apropriate layers so that they can be used locally (rather than on the server) and we will resample them to an apropriate cell size. We will create our bridge points layer with its offset attributes. We will also anticipate a problem that our obstructions layer has the Zakim bridge on it, and if we locate points half-way up the bridge pylons, they will be concealed, effectively underneath the obstructions layer. So in order to use our ArcGIS visibility tools we will need to prepare a version of our lidar first_return layer that has the bridge removed from it. This can be accomplished by making a rectangle that covers the btidge and then replacing the pixels underneath this rectangle with values from the bare_earth layer. This same technique will also be handy for experimenting with secenarios where other obstructions (such as the Museum Towers project) are removed. We can use a similar technique to allow the user to create new buildings with heights, and add these to the obstruction layer. In order to facilitate this, we will create routines that allow these rojutines to be easily be repeated after a user updates an add/delete obstructions layer.
Implications Documenting the implementation tasks and discoveries is an important act of scholarship, but in the case of a data model that is intended to be used by others and particularly if the data model is intended to be maintained, it is crucial to document how the model works so that any updates to the data are done according to non-destructive conventions.
When creating new information by mixing together existing information that was gathered for various (other) purposes it is just as easy (easier!) to create garbage as it is to generate something useful. One thing you can say for sure about the quality of the information that you have created with a data model is that it lies somewhere along the scale with Trash at one enf and Truth on the other. So the most important phase of your information modeling process is in the evaluation. In fact, this is a piece of the process that each user of the data model should understand -- whether they participated in the model building process or not. This evaluation may take many forms. Perhaps the best method of evaluation is verification through Ground Truthing, which entails actually visiting places represented in your model and confirming whether or not the qualities predicted by your model are in fact happening. There are many other measures that you can use to check your data: Using photographs or other more detailed data that you may have for particular places. In cases where you don't have data -- such as models that are attempting to predict the future, you may learn some things about the sensitivity of your model to imprecision in your data, by casting errors into particular data sources deliberately and then looking at how these errors propogate into your model results.
The important thing is to establish the limits to the utility of your model in terms of the intended purpose as stated in your description of the problem.
Example: In our case, we want to verify that areas defined as having views of the Zakim Bridge in our model, actually have views, and areas that don't, don't. Of course, we don't expect the model to be perfect. For example our model is more or less 2-d but we may actually be interested in understanding the views from windows of particular buildings. Our model probably won't be this good. Part of this process may cause us to refine our statements about what we are hoping to get from the model. For example, once I have calculted my viewsheds, I will look for areas of particular interest, such as public parks and prospects that are indicated as having views that are particularly good or threatened, and then go visit these sites and take some pictures. If it seems that the model is working at this point, I can make a case that the model is useful, at least in terms of finding potential places of interest with regard to views of the Zakim Bridge.
I can go further with some of our vector-based 3d data, that has decent building massing models, and check views from particular places along the vertical faces of these buildings. This can yield photo-like verification of views without actually having to break into private offices or condos. Again, even if our model isn't perfect, this sort of verification protocol will be enough to establish whether or not the model is good enough to help us sort potentially interesting places from definitely non-interesting ones. In fact, this brings up an interesting aspect of model-building and validation. If we understand that our model is going to have errors in it, we can probably judge what sort of errors we would rather have. That is, if I want to find potentially important and threatened viewas of the bridge to consider for protection, I would probably rather have errors of comission than errors of omission. That is, since I am using this model to find Potential sites, I want my errors to identfy sites that MAY have good views. To achieve this, I can adjust the sintitivity of factors cush as the resolution of my elevation models and the offset heights and locations of my bridge objective points such that we can be sure that no potential site is missed, even though we may mis-identify several non-candidate sites.
Implications: Face it, once you put on the mantle of a university graduate, or begin to charge people money based on your information handling abilities, you assume responsibility for not only producing and presenting infromation, but for evaluating it. This is critical when you are the information producer, but equally so, when you are the information consumer. As a producer, you may as well short-circuit the evaluation process by doing it preemptively. At least this way you can put your own spin on it.
There are three different sorts of users of information systems: ones who don't understand error, ones who consider that all inaccuarcies are flaws, and the user who understands that error is inevitable. The latter case knows how to choose the errors he or she prefers.
So now you have seen the framework. You may have guessed that this process of data modeling is not always a simple linear process. Somwetimes the process of making a model is a learnign experience. One may discover something new about the subject of the model in the process of trying to represent it. Often new things are learned about modeling -- what is practical, what is impossible. These new things we learn may cause us to go back and revise some of the expectations that we have sketched out in the planning stages.
It is almost always the case that the implementation phase takes at least twice as long as was planned, and sometimes the most important aspects of the modeling process are those parts that were discovered in implementation. The amount of unexpected stuff one encounters when modeling is inversely proportional to the amount of experience one has in planning model-building ventures. It is ironic that in the professional world, that if competitive bids are taken for a project, from an experienced and an inexperienced analyst, the experienced analyst will almost always cost more, and ironically will encounter far fewer unexpected problems along the way. The same project may appear simple to the inexperienced analyst, and this one will pay for his learning experience with unpaid overtime!Another difference between the experienced modeler and the newbie is that the experienced modeler is better able to anticipate problems and opportunities in advance (through the use of a planing framework such as this one.) and also since the experienced modeler expects to have problems, he or she will actually plan a pilot implementation with a scaled back dataset in order to flush these problems out as quickly and painlessly as possible. The first and sometimes the most important goal is to prove a concept, and in this, your aim should be to simplify the problem and the data as much as possible.