A New Discovery Platform for Geospatial Data

Darren Hardy and Jack Reed
Stanford University

The task of rummaging through stacks of CDs or scouring webpages for vital geospatial datasets might soon be over. Too often Geographic Information Systems (GIS) librarians are limited by such search techniques when faced with the challenge of providing patron consultations, metadata creation, and curation of geospatial datasets. Many of these librarians fill these roles while also maintaining software and data services. The issue of data discovery is not new, particularly within the sciences. A catalog for GIS datasets does not exist as it does for books and journals.

GeoBlacklight recognizes this and is designed to make discovery of geospatial data easier for both patrons and library staff. It is a discovery platform for geospatial data that provides for users to search and browse GIS data collections through text search, faceted browsing, and spatial search. This article discusses four software projects: the GeoBlacklight discovery platform, Stanford’s EarthWorks service, which is built on the GeoBlacklight platform, and metadata sharing using OpenGeoMetadata and GIS web service monitoring using GeoMonitor.

1. GeoBlacklight

The GeoBlacklight platform was initially developed by Princeton University Library, MIT Libraries, and Stanford University Libraries, and is fully open source (http://github.com/geoblacklight). It is a Ruby on Rails engine (Figure 1), based on the popular open source project Blacklight (http://projectblacklight.org), a discovery interface to Apache Solr indexes (http://lucene.apache.org/solr). Blacklight itself is highly customizable by design through overrideable behaviors, themes, and APIs. By default, it supports text and structured searches, faceted searching, stable URLs for each item in the catalog (i.e., items are bookmarkable and sharable), and user authentication.

Figure 1: GeoBlacklight architecture is modular and compatible with production-quality data stores and indexes.

GeoBlacklight is flexible enough to act as a discovery service for a variety of sources, but is designed for GIS data specifically. Geospatial data can take the form of vector and raster data, georeferenced scanned maps, or the geographic location of features on the earth. Traditionally GIS applications have not done a great job at discovery, thus leaving an opportunity for GeoBlacklight to advance the state of GIS data discovery. Building on the Blacklight platform, GeoBlacklight extends its functionality by providing the following:

spatial search with a spatial relevancy algorithm
download functionality using geospatial web services
map view of search results with dynamic updating
extendable to new types of data and functionality

GIS data can provide an additional layer of complexity for discovery as traditional text based search and faceted refinement are not always sufficient. Because geospatial data can be highly detailed over a large area, place name metadata coupled with text and faceted searching systems will often times prove troublesome during discovery for a patron. So, GeoBlacklight addresses these shortcomings by providing the ability for users to perform a spatial search using a map view. Importantly, this spatial search may be combined with full text search and faceted refinement to augment discovery. Another enhancement called “dynamic spatial search” (Figure 2), continuously updates results whenever the user moves the map position. This provides immediate feedback to the user as they zoom in to a specific location, like a city.

Figure 2: Dynamic spatial search in GeoBlacklight algorithmically ranking results

Libraries have a practical limit to the amount of geospatial data, or “layers,” they can hold and maintain, but through partnerships patrons can access much more. GeoBlacklight was designed with the idea that institutions could share metadata and provide a federated catalog to their patrons, so users can facet on a specific institution, or search within the entire catalog.

One of the core GeoBlacklight design decisions is that the application becomes progressively more useful based on the web services that back each layer (Table 1). For example, the minimum requirement for a layer is metadata simply with a bounding box and description. The metadata need not have external references to services that will actually provide layer data. But the software has native support for a variety of data access protocols including: Web Map Service (WMS), Web Feature Service (WFS), International Image Interoperability Framework (IIIF), and Hypertext Transfer Protocol (HTTP). With these protocols, GeoBlacklight will enable users to preview, inspect, and download geospatial data from GeoBlacklight in several formats (e.g., shapefile, GeoJSON, KMZ, and GeoTIFF). With HTTP, GeoBlacklight will provide downloading data via files available through URLs. With IIIF, GeoBlacklight will provide preview of non-georeferenced scanned maps.

TABLE 1: GeoBlacklight enables features based on the types of external services provided in the metadata records.

Type	Reference URI	Enables in GeoBlacklight
Web Mapping Service (WMS)	http://www.opengis.net/def/serviceType/ogc/wms	Layer preview, layer preview feature inspection, downloads (vector: KMZ, raster: GeoTIFF)
Web Feature Service (WFS)	http://www.opengis.net/def/serviceType/ogc/wfs	Vector downloads in GeoJSON and Shapefile
International Image Interoperability Framework (IIIF) Image API	http://iiif.io/api/image	Image viewer using Leaflet-IIIF
Direct download file	http://schema.org/downloadUrl	Direct file download feature
Full layer description	http://schema.org/url	Further descriptive information about layer
Metadata in ISO 19139	http://www.isotc211.org/schemas/2005/gmd/	Structured metadata in ISO format
Metadata in MODS	http://www.loc.gov/mods/v3	Structured metadata in MODS format
Metadata in HTML	http://www.w3.org/1999/xhtml	Structured metadata in HTML format

Since sustainability is often a priority for librarians, GeoBlacklight was designed with this in mind. Bridging the gap between the digital library and geospatial communities, GeoBlacklight aims to bring expertise from both fields to provide a better experience for finding geospatial data. GeoBlacklight also tries to leverage the successes of both the Blacklight and OpenGeoPortal communities to bridge a global community of institutional collaboration. In June 2014 GeoBlacklight development began on GitHub, which provides a rich interface for contributions and collaborators. GeoBlacklight builds on the Blacklight by integrating with an ecosystem of production-quality plugins and an already active developer community (http://github.com/projectblacklight). The assumption is that if an institution has a Blacklight installation, then they would already have much of the expertise in house to deploy GeoBlacklight. Likewise, GeoBlacklight builds on Ruby on Rails, another popular open source framework for developing web applications. With the popularity of Blacklight and Ruby on Rails, active communities write deployment instructions, helpful blog posts, code examples, etc. and thus GeoBlacklight adopters immediately have a large pool of documentation and resources on which to draw.

2. EarthWorks

In April 2015 Stanford launched EarthWorks (https://earthworks.stanford.edu), which is built on the GeoBlacklight platform. It indexes a federated catalog that provides both discovery and access services to over 22,500 geospatial datasets or “layers” (Figure 3). In addition to the geospatial data assets held by Stanford Libraries and deposited by Stanford faculty and students, EarthWorks also indexes datasets located at an ever-growing number of partner institutions, including Harvard, Tufts, Columbia, University of Minnesota, MIT, and UC Berkeley who participate in the OpenGeoPortal metadata and OpenGeoMetadata communities. Moreover, the catalog spans vast array of subjects and base layer information. Thousands of datasets in vector and raster format have been described at the individual layer level allowing for detailed searching and retrieval.

Figure 3: At launch, EarthWorks indexes over 22,500 geospatial datasets across eight institutions.

A common question from patrons is why isn’t a particular data layer available for download? Most data in EarthWorks are public, but restricted data are also discoverable and download access is tied to the institution that holds the content (Figure 4). Stanford does not actually hold the data from all of the partners, rather EarthWorks points to their data using links in their catalog records. Discovering the existence of a dataset, however, is the paramount concern rather than to providing download access. The GIS librarians may facilitate download access in these cases.

Figure 4: EarthWorks supports restricted data by suppressing preview and download features.

Developing EarthWorks as a tool for discovering and accessing data was a primary goal, however, it wasn’t the only challenge that needed to be addressed. “When we spoke with faculty and students, we realized there were deeper, more complex issues than access,” said Bess Sadler, Manager of Application Development for Stanford Libraries. Data was at risk of being lost and, according to Sadler, some data have likely disappeared already. For example, in 2014, the National Atlas, which provided a wealth of data, including information on water, roads, boundaries and environmental factors, stopped production. Years prior the atlas had replaced physical printing with publishing the data online. Despite high traffic to the online datasets, the website was also taken offline in 2014. Although that data has since resurfaced on new web pages, Sadler suggests digitally born data are increasingly at risk because online data links are often the only copy of the datasets. “When a web page is turned off, left unmanaged or a change in the page’s code occurs, the sole copy is in jeopardy if not assigned a persistent URL,” said Michael Keller, University Librarian at Stanford. “In addition to improving access (with EarthWorks), we are protecting the data of our scholarly community so others can build upon those works to further inquiry and scholarship,” said Keller, who has lead Stanford Libraries since 1993. “To date, no one yet offers the capability to preserve geospatial data like Stanford’s EarthWorks,” said Keller. When research data is deposited into a preservation repository, like it is with EarthWorks, Julie Sweetkind-Singer, Assistant Director of Geospatial, Cartographic and Scientific Data & Services at Stanford Libraries adds “it also enables data sharing, increases reproducibility, and satisfies the requirements being attached to federal funding for long-term access.”

EarthWorks’ preservation strategy requires that all of the Stanford GIS holdings indexed in EarthWorks come from the Stanford Digital Repository (SDR), an institutional long-term repository (Figure 5). As of May 2015, SDR preserves 200 million files across a thousand collections containing dozens of content types from articles to video, and now GIS layers. GeoHydra is the application that deposits GIS layers into SDR, and it’s an emerging suite of tools for managing geospatial data assets in a Hydra data repository. For all Stanford GIS content, currently GeoHydra manages and ingests vector and raster data and ISO 19139 metadata into SDR. Functionally, it’s a set of workflow automation steps (i.e., a pipeline) that performs various functions, such as projecting GIS data into a standardized projection or transforming metadata from ISO 19139 into MODS for descriptive purposes (see https://github.com/sul-dlss/gis-robot-suite).

Figure 5: The Stanford Digital Repository (SDR) provides the data and metadata for Stanford’s GIS holdings indexed in EarthWorks.

3. Metadata sharing and service monitoring

OpenGeoMetadata

OpenGeoMetadata is a new effort that aims to share geospatial metadata in an open and collaborative way, which builds on the groundwork laid by the OpenGeoPortal Metadata Working group. Rather than building an application that must be deployed at multiple institutions, OpenGeoMetadata uses GitHub as a common, highly available repository. Using GitHub as a platform also allows for software development to focus on conversion tools and harvesting tools.

A key design decision in OpenGeoMetadata is that the repository is format-agnostic and scalable. Any institution can deposit GIS metadata in whichever format they have operational, such as ISO 19139 or FGDC. Then, institutions can use GeoCombine (https://github.com/opengeometadata/geocombine) as a tool to perform transformations and harvesting between formats as needed. Much of this work has been community sourced and is documented and in the issues of the metadata repository (https://github.com/OpenGeoMetadata/metadatarepository).

GeoMonitor

Not all layers indexed in EarthWorks are always available for download. For example, the partner institutions may be in the process of loading the data on their servers, may have a server down for maintenance, or may be working on preparing the content for delivery. So, EarthWorks also uses GeoMonitor (http://github.com/geoblacklight/geomonitor) to monitor the availability of data layers being provided by the network of collaborating institutions.

GeoMonitor regularly checks WMS endpoints for all layers indexed in EarthWorks. It keeps the status checks in a PostgreSQL database, and generates availability statistics, or “scores,” which are fed into the EarthWorks index. EarthWorks then has knowledge of a given layer’s availability score, and tailors the user experience based on an acceptability threshold. When a layer is deemed unavailable, the user sees a message and the download and preview functions are suppressed. This reduces user frustration from trying to download or preview data layers that are not available. Furthermore, GeoMonitor provides a JSON API to query availability statistics, such as which layers are down for a given institution.

4. Resources

For further information, please consult these resources:

GeoBlacklight’s website (http://geoblacklight.org) hosts a blog and tutorials.
GeoBlacklight is open source and hosted on GitHub (https://github.com/geoblacklight).
EarthWorks is also open source and hosted on GitHub (https://github.com/sul-dlss/earthworks).
Stay connected to the GeoBlacklight community through the GeoBlacklight Google Group (https://groups.google.com/forum/#!forum/geoblacklight-working-group) or Twitter (https://twitter.com/geoblacklight).
OpenGeoMetadata is available on GitHub (http://github.com/opengeometadata)
GeoMonitor is available on GitHub (http://github.com/geoblacklight/geomonitor).
For more information about GeoHydra, please refer to the Hydra Geospatial Interest Group (https://wiki.duraspace.org/display/hydra/Hydra+Geospatial+Interest+Group)

Finally, GeoBlacklight is part of a larger effort to provide library services to geospatial data users. Several additional software projects augment GeoBlacklight and its discovery capabilities including:

GeoBlacklight-Schema – the metadata schema used in GeoBlacklight
GeoBlacklight-Icons – the custom icon font used in GeoBlacklight
GeoMonitor – a WMS service monitor that provides atomic updates to Solr
GeoBlacklight-Docker – Docker images for GeoBlacklight

<< issue home
next feature >>