HOW WE CURATE DATA
OUR QUALITY CONTROL
The Archive is at the forefront of developing international standards for data processing, for both quantitative and qualitative data. We use different levels of quality control depending on how much ‘additional value’ is to be added to the data.
We assign one of four levels of data processing to each incoming study, dependent on anticipated future usage. A processing standard can be A*, A, B or C. Processing activities are then carried out in accordance with each processing level, as described in the tables below.
Data processing activities for the majority of data fall into validation and content checks, and format translation checks by level of processing.
Validation and content checks
The main validation and content checks for data and documentation are listed below. Further details may be found in the UK Data Archive Data Processing Standards document.
VIEW DATA PROCESSING STANDARDS
Qualitative data
Most qualitative data collections are processed to A standard, with a select few being nominated for enhancement to A*. B and C are seldom used but apply when handling older paper-based studies.
| Level A* | Level A | Level B | Level C |
|---|---|---|---|
|
|
|
|
Quantitative data
| Level A* | Level A | Level B | |
|---|---|---|---|
| Dataset dimension checks |
|
|
|
| Metadata checks |
|
|
|
| Data validity checks |
|
|
|
| Confidentiality checks |
|
|
|
| Metadata enhancements |
|
|
|
For level C studies, a minimum of dataset dimension checks and
confidentiality checks is carried out, with metadata enhancements
as for B studies.
Format translation checks by level of
processing
These checks are carried out on conversion from the ingest
format (the format the data arrive) to the preservation
format (tagged or delimited text of defined character set). They
are also carried out from the preservation format to the
dissemination formats (typically Stata and tab delimited text) but
also sometimes MS Excel, MS Access, SIR and SAS.
At the Archive we have in-house programs to automate most data
format conversions for all levels of processing. These ensure that
no data or 'internal metadata' (variable and value labels, missing
value definitions, variable format information, etc.) are lost
beyond any that would occur because of differential data handling
limits in specific software formats.
The checks below are performed manually for the few types of data
conversion that do not have a quality checked automated conversion
programme.
Data processing format conversion checks:
| Level A* | Level A | Level B | Level C | |
|---|---|---|---|---|
| Numbers of rows and cases the same | R + C | R + C | R + C |
R + C |
| Number of decimal places the same for numeric formats | R + C | R + C | R + C | |
| String variables not truncated | R + C | R + C | R + C | |
| Date/time variables correctly formatted | R + C | R + C | R + N | |
| Internal metadata (variable names, variable labels, value labels and definition of missing values) not lost or altered |
R + C where possible |
R + C where possible | R + N |
R = relevant checks must be made
C = problems encountered must be corrected
N = problems encountered need not be corrected but must be noted
in the 'Read file' supplied to users with each order