Geographic Data Resources
Formats for Geographic Data
Data are nothing but references to observations and measurements related to real-world or imaginary entities and phenomena. There are three fundamental means for organizing spatial data: Tables, Vector Feature Classes, and Raster Layers. These basic data provide a predictable means of organization, --schema--, that permit our tools to exchange information and to engage information with operations and to discover associations concerning information from different data sources.
- ArcMap 101: Introduction to GIS Data and Portrayal provides a tour through an arcmap dataset with many different types of data and metadata.
- Sources of Geographic Data Discusses sources and strategies for finding data that can be used in GIS.
- Understanding GIS Data, Referencing Systems and Metadata. discusses a critical mindset for understanding the quality and fitness of data for use in particular situations.
- ArcMap 10 Help on Data Formats Supported in ArcMap
As we look more deeply into these structures, we will see that in each of the formal categories of data structure: Tabular, Vector & Raster, there are many choices in terms of the Technical Implementation for the way that they are encoded. Each of these is manifest in different data formats. In choosing one encoding scheme over another, we are often trade one virtue for another. For example we may choose to represent our tabular information as a text file or in excel, we may choose to represent vector data in a ESRI Shape File, a AutoDesk DWG file or in a Geodatabase Feature Class. Raster data may be stored an exchanged in any number of popular formats, TIF, JPEG, SID, JP2, GIF, or we may use a GIS specific format, such as GeoTIFF, ArcInfo GRID or Imagine .IMG. The following provides a brief summary of comparitive advantages and disadvantages of one choice of data structure over another.
Why are there so many choices?? Will humans ever agree find one set of data formats that we will use for everything? At the heart of the problem lies the tension between Stability (a strong standard is one that does not change) and Innovation (people keep coming up with better ways of doing things -- and new things to do that challenge the capabilities of new standards.) Likely as not, this is not going to change, and therefore data wranglers are going to have to keep learning new things about formats for encoding our spatially referenced observations.old and Stable vs New and Innovative may be thought of as one dimension upon which any data format might be placed. Here are several more considerations that may come into play when evaluating a particular data format.
- Simple, file based versus Complex database structures
- Open, community regulated formats versus Proprietary formats.
- Supporting georeferenced coordinate systems versus Not Georeferenced.
- Semantic Depth: Shallow and fixed versus Flexible and deep.
- Topology: Simple versus Sophisticated
- Desktop-oriented versus Client-Server based
We have discussed how tables serve as a containre for records about entities that can be distinquished by their attributes. We will discuss how this simple construction can lead to really interesting models later in the term. For now we will consider some of the pros and cons of different ways of authoring and exchanging tables. For more information, see Overview of Tables and Attribute Information from ESRI Online Help.
- Plain Text, Comma Delimited files. The ultimate in Simple, Open data formats. Text files are very shallow in terms of their ability to assign logical types to specific data fields, but software like ArcGIS will try to guess whetehr a specific column of text is referring to Text, Numbers, or Dates. Sometimes the software gets it wrong, such as treating zipcodes as numbers.
- DBase Format. Dbase tables are an open format for exchanging tabular information that offers a deeper capacity to assign specific logical datatypes to fields. One drawback of dbase tables is that column names can't be more than 9 characters. Dbase is a static specification, which makes it stable, and used in may tools. Yet it seems to be being abandoned by Microsoft.
- Excel Worksheets Excel is a wonderrful tool that offeres a very deep ability to represent tabular information. The fact that one column may be specified to be a dynamic function of other columns is an example of this depth. A drawback oif excel is that it is much more complex than text or dbase as a file format, and should not be expected to be transferrable toi a lot of other applications -- especially as this format may change at any time at the whim of its owner.
- Desktop-Oriented Database Formats Fro example, Microsoft Access is a tool that lets you make very complex arrangements of tables. From the filesystem, the whole complex of tables is represented as a single file. These database applications offer advantages in terms of the size of tables that may be supported, To get inside one of these things a software developer must pay licensing fees to Microsoft for the special programs involved.
- Enterprise Scale, Server-Oriented Databases IN an organization where many people and applications may need to access and even update the latest version of a table, we find server-based database management systems liuke Oracle, MYSQL, Postgres Microsoft SQL, and others. These Relational Database Management Systems (RDBMS) can handle (theoretically) unlimited amounts of data and serve it very efficiently to applications that make requests according to standardized protocols. These tools can also manage passwords and permissions and grant privileges to particular users. It is worth noting that the International Standards Organization, (ISO) has specific standards for the fundamental logical data teypes that must be supported in a relational database management system, and the protocols that they must support for exchanging data with applications that requst it. Even if the technical data format used to store the data on disks is completely incomprehnsible, these systems are cosidered to be open and stable, to the degree that they support the ISO specification. For archival purposes, dumps must be made in open format exchange files.
Vector Data Formats
With the exception of the popular CAD formats, .dwg and .dxf, vector formats supported in GIS are extensions of the tabular types discussed above. In effect, Points, Lines and Polygons merely extend the range of datatypes and associated logic traditionally offered in tables The ISO standards for logical datatypes were extended to handle spatial data: points, lines, polygons, and surfaces; in the mid 1990s. In the world of ArcMap, individual datasets representing collections of objects each having the same type of feature are known as Feature Classes.
- Text Files By including a numeric fields for X and Y coordinates, you can associate references to point entities in a text file. Note that the way that these coordinates relate to places on the planet requires some metadata that specifies the specific Spatial Referencing System that is employed.
- DWG and DXF Formats These are proprietary formats of the AutoDesk corporation, though they are considered to be open and documented, there is nothing preventing AutoDesk from changing these. The ability to attach semantic infomration to geometry in these formats is very limited and fixed. This leads people to develop elaborate schema that distinquish among entities according to their layer and color. These formats are also very flexible in terms of the ability to mix all sorts of geometry types in a single dataset, which canmake it difficult to create models that function predictably from a GIS perspective. ArcGIS can read DWG and DXF directly and can export to CAD format files and attach georeferencing metadata to them. See CAD Data Support in ArcGIS and related links. Also Converting GIS Data to CAD.
- Shape Files Shape files are more or less an extension of DBase data format to handle spatial data types. The Shapefile format is owned by ESRI, but it is openly documented and considerd to be a stable exchange format, for now. Shape files can cary their spatial referencing information in an associated .prj file, however, shape files created before 2003(?) wil not have a prj file and will therefore not cooperate with ArcGIS on-the-fly projection.
- File Based, and Personal Geodatabases ArcGIS has extended the Microsoft Access data format .mdb files for handling spatial data types. This gives the flexibility to have long field names and large datasets, and complex relationships among tables. One really useful thing about geodatabses is that the geometric properties of features, i.e. their area and perimter is automatically updated when the geometry is changed. In version 9.0, ESRI introduced its own File Geodatabase format for geodatabases that is not tied to the mixrosoft database format.
- Enterprise-Scale Geodatabases An Enterprise Scale Geodatabase is a means of using server-based relational database management tools, like Oracle of Postgres to store feature classes. The nature of relational database management systems is such that these datasets have no limit in terms of size.
- Web Feature Services (WFS) is a means of sending styled points lines and polygons to web clients (not supported in simple web browsers. Open protocols allow GIS operations to be performed on these features. A Transactional Web Feature Service (WFST)even allows clients to edit features through the web. Web Feature Services are stable community-based standards of the OGC..
Raster Image Formats
While Tabular data structutes allow us to distinquish and form associations among different classes of discrete entities, Raster Images provide containers for representations of locations. Locations are identified by cells or pixels, that can be associated with attributes. From a perspective of GIS, there are a couple of important aspects of rasters:
- Do they carry georeferencing information, linking the cell locations to specific coordinates and or relationship with places on the globe? ArcMap can Georeference any sort of image using its own scheme of world files and aux.xml files that can be associated with an image. But some image fomrats contain their own internal georeferencing information.
- What is the bit depth that is supported by the attribute referencing system.
- Does the raster format support compression, and how lossy is it?
- Bitmap Images Each cell can have one of two values: 1, or 0, Black or White.
- Color Mapped Images, 8-bit GIF, and PNG images typically support 256 different distinctions which can be mapped to specific colors, or transparancy.
- 8-Bit Gray Scale Images Each pixel is assigned an attribute from 0-255, representing a gradient.
- True-Color, Multi-Band Images TIFF is a format that integrates three channels of 8 bit inteinsity that is useful for representing mixtures of Red, Green and Blue. Because of these 3 8-bit channels, supporting 16,777,216 distinct combinations of color. There is a GeoTIFF format which supports internal georefencing.
- Compressed Image Formats Jpg is an example of a means of compressing image information. These images have a 24 bit depth, but depending on the compression, the edges of things, the actual assignment of data to pixels ins not necessarily predictable, even if it looks right.
- Wavelet Compression SID files and JPEG2000 format files use a progressive compression method that responds to how close you are zoomed in. The compression is much better than plain jpeg and both of these formats carry internal georeferencing capabilities -- though they are not always used.
- Deeper Raster Formats When you have a raster GIS dataset that requires a continuous range of attributes that extends beyond 256 values in a single channel, you need a deeper raster structure. These are provided by the proprietary ArcInfo GRID format or the simpler and more open Imagine IMG format.
- Image Map Services (WMS) are a means of sending georeferenced images to a web browser. These images may be pre-tiled or composed on the fly -- effectively making a map the server, taking an imae of it and sending it to the browser. Google Maps is an example of an image map services. The open Geospatial Consortium has a very stable and popular specification for these known as the Web Map Service (WMS). These may be viewed in ordinary web browsers.
Data Models and Schema
Typical GIS databases that we collect frm sources are fairly elemental, consisting of discrete collections of vector features or individual raster layers. However, it is also useful to consider how complexes of feature classes and rasters can be organized to make data models that are coherent in terms of the relationships among features and potenmtially also engaged with rasters. Higher order data collections that operate this way are thought of as Schema in the sense of their abstract organization, or as Data Models when they are implemented and used. An advantage of thinking of schema in this way is that toolkits may be developed that develop inferences and perform experiments involving the consitiution of elements and relationships among them. Data models are discussed in more detail on the page, Modeling for Decision Support and Scholarship
Most schema that we make are relatively ad-hoc. However a very important movement in the field of GIS an other information management endeavors is to develop elaborate schema with organizations of people who will be better able to exchange very deep information and tools. A hallmark of this movement is the use of XML (Extensible Markup Language) -- which is a sort of meta schema. The development XML schema by communities of interest is driving a revolution in collaborative information models that can be systematically exchanged -- known as the Semantic Web. A branch of the open source movement, these efforts usually involve cross-disciplinary collaborations in which participants undestand that the development of a shared language will enhance their niche in the ever-more diverse information ecology. There are several non-profit collaborations that are very active in developing very useful schema. For example, see The general Transit Feed Specification or CityGML