Quality control

The Archive is at the forefront of developing international standards for data processing, for quantitative and qualitative data processing.

We use different levels of quality control depending on how much ‘additional value’ is to be added to the data.

We assign one of four levels of data processing to each incoming study, depending on anticipated future usage (A*, A, B or C). Data review, privacy assessment, and content checks and validation are critical.

This is a summary of our content and validation checks. See UK Data Archive Data Processing Standards (PDF) for full details.

Quantitative data

Dataset dimension checks

Level A*: the number of cases and variables are checked against the documentation
Level A: as for A*
Level B: as for A*

Metadata checks

Level A*: the dataset must be comprehensible in itself - i.e. all variables should have variable labels and all categorical variables should have value labels
Level A: the dataset must be comprehensible in association with the documentation given to users
Level B: visual checks on quality are undertaken; action is taken for systematic problems

Data validity checks

Level A*: all categorical variables checked for out-of-range values/wild codes; where possible, interval variables checked for improbable or impossible values; variable and value labels need not be present in the data file as long as they can be found in the documentation
Level A: as for A*
Level B: visual checks on quality are undertaken; action is taken for systematic problems

Confidentiality checks

Level A*, A and B: always undertaken

Metadata enhancements

Level A*: for online browsing, the following may be added: literal question text, routing information and interviewers' instructions, frequencies and summary statistics, variable groups; bookmarked PDF user guides are produced; additional related resources may be provided on a dedicated web page; additional notes to users are given in the 'Read file'
Level A: bookmarked PDF user guides are produced; additional related resources may be provided on a dedicated web page; additional notes to users are given in the 'Read file'
Level B: bookmarked PDF user guides are produced; additional notes to users are given in the 'Read file'

ReShare

Level B: sample of 30 + 10 per cent of the remaining categorical variables must be checked for out-of-range values/wild codes; sample of 30 + 10 per cent of the remaining suitable interval variables must be checked for improbable or impossible values

Qualitative data

In addition to the levels above, most qualitative data collections are processed to A standard, with a select few being nominated for enhancement to A*. B and C are seldom used, but apply when handling older paper-based studies.

Level A*

data are fully digitised and anonymised
metadata and documentation are fully digitised and anonymised
for online browsing, data are marked up in XML
enhanced user guide is prepared for QualiBank

Level A

the dataset must be comprehensible in association with the documentation given to users
data are fully digitised and anonymised
metadata and documentation are fully digitised and anonymised

Level B

data are digitised at least to the level of scanned images and anonymised
metadata and documentation are digitised at least to the level of scanned images and anonymised
only major problems with data are resolved

Level C

no checks are made
data remain in the format in which they were received
non-digital collections are not anonymised or digitised and are transferred to another repository
only a basic catalogue record is created

For level C studies, a minimum of dataset dimension checks and confidentiality checks is carried out, with metadata enhancements as for B studies.

Format translation checks

Check are carried out when converting from:

the ingested format to our preservation format (tagged or delimited text of defined character set)
the preservation format to the dissemination formats (Stata and tab delimited text; or MS Excel, MS Access, SIR and SAS)

We use in-house programmes to automate most data format conversions for all levels of processing. These make sure no data or 'internal metadata' (variable and value labels, missing value definitions, variable format information, etc.) are lost beyond any that would occur because of differential data handling limits in specific software formats.

For data formats the following checks below are currently performed manually, but will be replaced by automated checking using the QAMyData tool.

Numbers of rows and cases the same

Level A* and A: Relevant checks made, problems corrected
Level B: Relevant checks made, problems corrected
Level C: Format conversion is not usually undertaken for C standard datasets. C standard is rare, but one of the reasons for it is that the data file cannot be converted from its original format, so normal processing cannot be undertaken. Relevant checks must be made, problems corrected

Number of decimal places the same for numeric formats

Level A*, A and B: Relevant checks made, problems corrected

String variables not truncated

Level A*, A and B: Relevant checks made, problems corrected

Date/time variables correctly formatted

Level A* and A: Relevant checks made, problems corrected
Level B: Relevant checks made, problems noted in the user 'Read file'

Internal metadata (variable names, variable labels, value labels and definition of missing values) not lost or altered

Level A* and A: Relevant checks made, problems corrected where possible
Level B: Relevant checks made, problems noted in the user 'Read file'

Data download validation

For data available via the UK Data Service download system, the names of the zip files include an MD5 checksum. This 32-character string can be used to verify that the file we make available is identical to the one the user downloads.