HDF-EOS Tools and Information Center

The Role of Data Formats in Long-Term Preservation of Earth Science Information

Long-term preservation of the information contained in Earth science data requires careful assessment of the risks associated with a variety of threats, such as institutional instability (and funding loss), operator errors, IT security incidents, hardware and software failures, as well as the dangers of migrational transformation. The CASPAR project in Europe has proposed using a registry of Representation Networks (RNs) as a way of starting to deal with the issues of threat assessment. Such networks identify the digital artifacts required to allow the Designated User Community to understand the information, which include not only the data itself, but also the artifacts that make the data visible to the community, including data formatting software and documentation. It appears likely that a RN can become the foundation for quantitative risk assessment by extending it to a stochastic reliability network.

The key question in this brief note is how to create a RN for a data format. The start to answering this question appears in Annex E of the OAIS Reference Model (which is not part of the ISO standard, but is still quite useful). This Annex suggests a layered Information Model with layers that start at the media level and move up to the application level where analysis and display programs make data visible to users. For our purposes, we identify the following layers

The Media Layer that includes disks, tapes, and networks

The Bit Stream Layer that consists of an array of bits

The Data Element Layer that consists of an array of primitive data types (characters, integers, floats, etc.)

The Aggregation Layer, in which the individual data elements of the Data Element Layer are aggregated into structural groupings, such as records, lists, and arrays

The Object Layer, in which the Aggregations identified in the Aggregation Layer are classified into objects that are recognizable and meaningful in the application domain, such as images

More concretely, we can formalize the transformations that occur between the five layers identified above and identify the digital artifacts that transform the representation from one layer to the next. For example, if we were dealing with an ASCII text file created by a FORTRAN program, the two digital artifacts required to read and convert the file into Earth science information are the FORTRAN program, which is a text file itself containing the format description, and the FORTRAN compiler that creates an executable program for reading the ASCII data file. Representational information that determines the aggregations, such as array sizes and order, are explicitly or implicitly contained in the FORTRAN code. On the other hand, if the ASCII text file were written as an XML file, certain of the data elements are reserved to delimit the tags in the file, which might contain the dimensions and even the indices of the arrays. In this case, the implicit structure that may be embedded in the FORTRAN program can be explicitly included in the XML parser, which serves as a translator from the XML text to an alternative data structure that contains, say, an array of floats that was not present in the XML text (at least under normal XML conventions). In the case of HDF and HDF-EOS, the intent is usually to embed the structural data elements, such as array sizes and array data element ordering, explicitly within the file and to allow the HDF software to transform the original data into alternative (and transformed) elements.

In dealing with long-term preservation of information, where software may translate from one representation of data to another, it appears important to recognize that we can categorize possible transformations as being

Archivally Safe, where such transformations include bytes -> integers -> long integers, floats -> double precision floats, ASCII -> Unicode, and permutations of data element order

Archivally Risky, where a transformation might separate components of one file into several files or replace a self-documenting file with one that relies on tacit knowledge or encoding to preserve information

Archivally Damaging, where a transformation is certain to lose information, as in long integers -> bytes or doubles -> floats

To deal meaningfully with the RN for a particular collection of Earth science data, it will probably be necessary to register the Aggregations in the Object Layer, including such objects as array sizes and element orderings and then quantify the risk associated with migrational transformation.

In this brief note, we will attempt to identify the key steps required to deal explicitly with identifying the digital artifacts involved in preserving Earth science data, as well as dealing with the translator objects that convert one representation to another. These identifications are the key to providing quantifiable estimates of the risk of loss that the translators have, as well as developing strategies for mitigating that risk.

Back to Agenda

Last modified: 06/02/2017