STUDY-LEVEL DOCUMENTATION
Study-level documentation covers the data collection as a whole, rather than documenting individual files or variables, and is key to enabling the secondary user to make informed use of the data. It provides an overview of the research context and design, data collection methods, data preparation and results or findings. It is used to produce a User Guide to accompany archived data. Much information is likely to be available from final reports (e.g. ESRC End of Award Report), technical reports, working papers, laboratory books, grant applications and publications. Study documentation can be submitted as a series of documents that together contain all the needed documentation (which is compiled into the User Guide by UKDA staff) or can be submitted by the researcher/data depositor as a purpose-built user guide. It is always useful to submit the final end of award report alongside deposited data and documentationFurther information about many of the individual elements mentioned below, and how they may be documented at the individual data element level may be found on the data-level documentation page.
The following types of information should be included:
- contextual information
- data collection methods
- structure of the dataset
- variable list, coding and classification schemes
- derived variables
- weighting and grossing methods
- data sources used
- confidentiality and anonymisation details
- data validation details
- whether the dataset is part of a cross-sectional, longitudinal or time series
- new editions
Contextual information
Providing context about the research and the dataset helps explain conditions under which the data were collected and uses to which the data were put. It also forms a historical record for future researchers using the data. Contextual information can cover:- the history of the project or dataset (intellectual, financial and organisational origins and developments)
- the aims, objectives, hypotheses of the research
- for repeated cross-section surveys or experiments or time series datasets, information describing any changes in the variable content, question text, variable labelling or sampling procedures
Data collection methods
Information should describe, where relevant:
- data collection process (survey, experiment, data compilation, digitisation, transcription, interviews, etc.)
- details of sampling design and method
- observations recorded or measured and how they were recorded or measured
- pilot research or monitoring processes undertaken
- scale and resolution of data collection
- geographic coverage of the dataset (country, region, place, latitude/longitude, etc.)
- time coverage of the dataset
- equipment, hardware and software used
- confidentiality and consent agreements in place
- quality controls used during the data collection (see also quality control)
There are obvious pieces of documentation that will aid interpretation of the study methodology and data, which could, depending on the methodology, be:
- copy of a questionnaire used, annotated with variable names and derived variables
- copy of interviewer's instructions
- topic guides and interview schedules
- copy of original source material
- experimental protocols
Structure of the data
Information about the structure of the dataset:- a list of data files, records, tables, images, etc. in the dataset
- relationships between individual files or tables
- entity relational diagram for relational databases
- relationships between records, cases, fields, individuals
- number of records and variables in each data file
- medium on which the data are stored
All individual data files such as images, audio files, text files should be clearly labelled with name, description, date, author, etc.
Variable list, coding and classification schemes
In addition to preparing data-level documentation, a variable list which describes all variables (or fields) in the dataset, especially when abbreviations or non-intuitive variable names are used, is likely to be useful for the User Guide. Any coding and classifications schemes used (including the version of such schemes) should be recorded, indicating which variables they apply to. It is preferable to refer to schemes with a bibliographic reference.
Derived variables
New variables derived from original data may be as simple as grouping age data into age intervals; or as complex as derivations using self-constructed algorithms, queries or commands. Whatever method is employed it is important that the logic of each derivation is made clear and documented, see data-level documentation for details.
Weighting and grossing
Weighting and grossing applied to variables must be fully documented, explaining the construction of the variables with a clear indication of the circumstances in which they should be used.
Data sources used
If data are derived or summarised from other data or sources, the provenance of the materials should be made clear, such as history of data ownership/copyright. If unsure about the copyright situation of any data, checks should be made before depositing the data. Further information may be found on the copyright and the use of existing sources page. If in doubt please ask for advice at acquisitions@esds.ac.uk.
Confidentiality and anonymisation
It is important to record if the data contain any confidential information concerning individuals, households, organisations or institutions. See the guidance on the consent, confidentiality and ethics pages. Where secondary analysis requires confidential or otherwise sensitive information to remain in the dataset, agreements about any special access conditions to end users should be discussed with acquisitions@esds.ac.uk.
Data validation
This comprises details of known data errors, and any data checking, quality control by experts, or data cleaning carried out. For quantitative data, information should be provided on the resolution of data and repetitions of measurements or sampling. For data gathered by scientific instruments validation refers to checking for equipment and transcription errors, calibration, resolution and repetitions. For transcriptions of interviews or historical sources, it should be noted whether they have been proofed or quality controlled in any way.
Serial and time series datasets
For ongoing or repeated cross-sectional or longitudinal surveys or experiments, and for time series data collections, additional information describing any changes in the methodology, variable content, question text, variable labelling, measurements or sampling procedures is enormously helpful.New editions
If a data deposit forms a new edition of a data collection already available from the UKDA, all changes to the data should be documented, and any obsolete parts of the study documentation updated.For further help with documenting data contact UKDA acquisitions@esds.ac.uk.
















