Authenticity and provenance: a view from producers and users

Article dated: 20-Mar-12

The recent involvement of the UK Data Archive with the Alliance for Permanent Access to the Records of Science Network (APARSEN) project raised issues surrounding the concepts of authenticity and provenance as used in the digital preservation community and how social scientists understand them. sculpture

The Archive believes that aligning the way in which these terms are understood helps our users and allows us to contribute to best practice across the digital curation landscape.

The Open Archival Information System (OAIS) Reference Model for archives defines authenticity as the degree to which a person (or system) regards a digital object as what it is purported to be. It is judged on the basis of the evidence provided by provenance information.

In the APARSEN project we assessed the level to which data users and data producers thought it necessary to capture and retain provenance information by interviewing a small group from the social science domain. Our aim was to gauge their understanding of certain concepts, their assumptions about what the Archive does with data regarding gathering and documenting evidence, and how useful they thought this evidence was.

We discovered a variety of interpretations of authenticity and provenance, but some consistent themes did emerge.

While most interviewees initially felt that there was little or no need to capture formal authenticity and provenance information, when further questioned it became clear that some of this information was necessary to use the data properly, and would be critical for long-term use. Users understood that anonymisation techniques might form part of provenance information, thus bringing the concepts closer to their sphere of comprehension as well as being critical to the curation of these data.

However, interviewees felt that providing access to any provenance information which did not have immediate relevance to the use, interpretation and analysis of the data would be a low priority. Thus including additional authenticity and provenance data in the standard data download package would be unhelpful, but, knowledge that it existed provided further confidence in the Archive's activities and in its data collections.

Some provenance information, especially detailed custody and transformation history, were considered to be very low priority - for data capture if not access.

Standard resource discovery metadata, including funding body and data creator, provided users with sufficient detail to confirm the 'history' of the data collection. Users also wanted to know, briefly at least, the relationship with earlier versions of the data, and assumed we would have, for internal use, more granular information, detailing the change history of the data collection.

Data producers are aware of (but unhappy with) scenarios where datasets are circulated among researchers without formal control before being used for secondary research. Producers consider increased direct use of an 'authentic version' for research as the immediate priority, promoted strongly through data citation.

Users had the general expectation that the focus on authenticity and provenance will increase when administrative data are supplied for curation. Interviewees prioritised sufficient context about the business processes used in generating the administrative data, as opposed to more formal authenticity and provenance evidence in OAIS terms.

We discovered that while the language of the two communities differs, the outcomes desired are generally similar. Authenticity and provenance do matter, and will matter more as we move towards 'big data', with more administrative data in use. The provenance of each component of complex hybrid surveys will need to be available to support the data.

This evidence strengthens our resolve to ensure that the collection of granular authenticity and provenance information throughout the data lifecycle is carried out to benefit the end user, and once this information is collected it should be made truly usable by users so we will be demand-led in considering what information to add to an already complex download package.