Skip to main content

The UK's largest collection of digital research data in the social sciences and humanities

Quality control

The Archive is at the forefront of developing international standards for data processing, for both quantitative and qualitative data.

We use different levels of quality control depending on how much ‘additional value’ is to be added to the data.

We assign one of four levels of data processing to each incoming study, dependent on anticipated future usage (A*, A, B or C). Processing activities are then carried out in accordance with each processing level, as described in the tables below.

Data review, privacy assessment, and content checks and validation are critical.

The main content and validation checks for data and documentation are listed below. See UK Data Archive Data Processing Standards (PDF)

Quantitative data

Dataset dimension checks

  • Level A*: the number of cases and variables are checked against the documentation
  • Level A: as for A*
  • Level B: as for A*

Metadata checks

  • Level A*: the dataset must be comprehensible in itself - i.e. all variables should have variable labels and all categorical variables should have value labels
  • Level A: the dataset must be comprehensible in association with the documentation given to users
  • Level B: visual checks on quality are undertaken; action is taken for systematic problems

Data validity checks

  • Level A*: all categorical variables checked for out-of-range values/wild codes; where possible, interval variables checked for improbable or impossible values; variable and value labels need not be present in the data file as long as they can be found in the documentation
  • Level A: as for A*
  • Level B: visual checks on quality are undertaken; action is taken for systematic problems

Confidentiality checks

  • Level A*, A and B: always undertaken

Metadata enhancements

  • Level A*: for online browsing, the following may be added: literal question text, routing information and interviewers' instructions, frequencies and summary statistics, variable groups; bookmarked PDF user guides are produced; additional related resources may be provided on a dedicated web page; additional notes to users are given in the 'Read file'
  • Level A: bookmarked PDF user guides are produced; additional related resources may be provided on a dedicated web page; additional notes to users are given in the 'Read file'
  • Level B: bookmarked PDF user guides are produced; additional notes to users are given in the 'Read file'

ReShare

  • Level B: sample of 30 + 10 per cent of the remaining categorical variables must be checked for out-of-range values/wild codes; sample of 30 + 10 per cent of the remaining suitable interval variables must be checked for improbable or impossible values

Qualitative data

In addition to the levels above, most qualitative data collections are processed to A standard, with a select few being nominated for enhancement to A*. B and C are seldom used, but apply when handling older paper-based studies.

Level A*

  • data are fully digitised and anonymised
  • metadata and documentation are fully digitised and anonymised
  • for online browsing, data are marked up in XML
  • enhanced user guide is prepared for QualiBank

Level A

  • the dataset must be comprehensible in association with the documentation given to users
  • data are fully digitised and anonymised
  • metadata and documentation are fully digitised and anonymised

Level B

  • data are digitised at least to the level of scanned images and anonymised
  • metadata and documentation are digitised at least to the level of scanned images and anonymised
  • only major problems with data are resolved

Level C

  • no checks are made
  • data remain in the format in which they were received
  • non-digital collections are not anonymised or digitised and are transferred to another repository
  • only a basic catalogue record is created

For level C studies, a minimum of dataset dimension checks and confidentiality checks is carried out, with metadata enhancements as for B studies.

Format translation checks

Check are carried out when converting from:

  • the ingested format to our preservation format (tagged or delimited text of defined character set)
  • the preservation format to the dissemination formats (Stata and tab delimited text; or MS Excel, MS Access, SIR and SAS)

We use in-house programmes to automate most data format conversions for all levels of processing. These make sure no data or 'internal metadata' (variable and value labels, missing value definitions, variable format information, etc.) are lost beyond any that would occur because of differential data handling limits in specific software formats.

Data formats

The following checks below are currently performed manually, but will be replaced by automated checking using the QAMyData tool.

Numbers of rows and cases the same

  • Level A* and A: Relevant checks made, problems corrected
  • Level B: Relevant checks made, problems corrected
  • Level C: Format conversion is not usually undertaken for C standard datasets. C standard is rare, but one of the reasons for it is that the data file cannot be converted from its original format, so normal processing cannot be undertaken. Relevant checks must be made, problems corrected

Number of decimal places the same for numeric formats

  • Level A*, A and B: Relevant checks made, problems corrected

String variables not truncated

  • Level A*, A and B: Relevant checks made, problems corrected

Date/time variables correctly formatted

  • Level A* and A: Relevant checks made, problems corrected
  • Level B: Relevant checks made, problems noted in the user 'Read file'

Internal metadata (variable names, variable labels, value labels and definition of missing values) not lost or altered

  • Level A* and A: Relevant checks made, problems corrected where possible
  • Level B: Relevant checks made, problems noted in the user 'Read file'

Data download validation

For data available via the UK Data Service download system, the names of the zip files include an MD5 checksum. This 32-character string can be used to verify that the file we make available is identical to the one the user downloads.