Data Management Guidelines for Researchers

Data Management Guidelines for Researchers

These guidelines are intended to give an introduction to the principles of data management. They are aimed primarily at academic researchers who collect, create, store and share data, to give you an idea of how you can make sure your data is genuinely reusable, particularly for text and data mining (TDM) projects. However, the general principles of best practices in data management apply to all cases of storing and sharing content.

Accessing and using content for TDM often involves quite different processes to those used by an individual reader or researcher. New TDM technologies are being developed every day, and managing your data with TDM in mind means you will be better able to use these technologies to discover new knowledge from your data in the future.

Data management for TDM

Many TDM activities are carried out using content that is the intellectual property of other people, and subject to intellectual property (IP) rights.

According to DAMA, The Global Data Management Community 1

“Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets.” 2

In the specific case of research data, good data management fundamentally aims to make sure that research data are managed according to legal, statutory, ethical and funding body requirements. This means that good data management is relevant to all stages of the data lifecycle, from the procedures of data planning (specifying the types of data to be used) through:
• data generation, collection and organisation;
• documentation and metadata usage;
• curation, maintenance and preservation; and
• ultimately, policies for publishing, sharing and providing access to data.

Data management, particularly the curation and preservation of data, is valuable for two key reasons. Firstly, it allows third parties to validate experimental methods and results. And secondly, it allows for the re-use and re-purposing of data in other contexts, including other disciplines, with different research goals. Many would say that the data on which a research project has been based are as important as the scientific results themselves!

TDM cannot happen without access to large amounts of data. This data could be of any type – from scientific data, to data related to aspects of everyday life, in domains from meteorological, to biological, to economic and geographical data. With this in mind, it is more than obvious that data management is of crucial importance for TDM.

Although huge amounts of data are produced globally on a daily basis, only a small part of that data is widely known, let alone published and accessible in realistic and practical terms. Data creation is a time-consuming and expensive process, involving not just simple data collection, but additional steps of data curation, metadata addition and annotation, maintenance and preservation, and – last but not least – legal clearance of data.

In many scientific fields, we have already seen that data and related services create added value when they are opened and shared for secondary purposes, from fundamental research to the development of innovative technologies and applications.

Therefore, there is a need for appropriate tools and mechanisms (scientific, technical, legal, organisational – and even social) which will allow efficient access to, sharing of, re-use and re-purposing of data. This all starts with an appropriate Data Management Plan, which we will discuss in the following sections.

As the EU Guidelines on FAIR Data Management in Horizon 2020 (Version 3.0)1 tell us, a Data Management Plan (DMP) “…describes the data management life cycle for any data to be collected, processed and/or generated by a Horizon 2020 project. As part of making research data Findable, Accessible, Interoperable and Re-usable (FAIR), a DMP should include information on:
• the handling of research data during and after the end of the project;
• what data will be collected, processed and/or generated;
• which methodology and standards will be applied;
• whether data will be shared/made open access; and
• how data will be curated and preserved (including after the end of the project).“

The EU Guidelines particularly stress the importance of:
• open access (while respecting existing copyright restrictions);
• data discoverability, through metadata and persistent identifiers;
• interoperability allowing data exchange and re-use, by adherence to common standards and best practices for data description;
• usage of standard vocabularies; and
• use of certified deposition mechanisms and infrastructural facilities that cater for data curation, maintenance, security, storage and long term preservation, as well as for user management (authentication and authorisation).

Data Management Plans (DMPs) are produced by organisations, projects and companies dealing with data of any type, as well as by their funders. Researchers need to follow DMPs when preparing their research data, both to organise their data and to deposit their data to a repository or infrastructure. DMPs define the purpose of data collection and generation, the types and formats of the data to be collected, their size and the target users, the mode of distribution (if planned), and the preservation model adopted.

Why care about data management for TDM?

Don’t let your valuable data lie underused in poorly accessible formats – start thinking about and planning data management for TDM!

What are the benefits of data management?

Imagine a common scenario: a researcher has produced or collected data for their research, which need to be submitted to their organisation’s repository, or the organisation that funded their research. If their organisation has an efficient Data Management Plan in place, the DMP’s guidelines can help the researcher and their data to benefit from:

Discoverability: When the data is registered in the organisation’s inventory, catalogue, or repository, they become visible and discoverable by others.
Documentation: When the data adhere to common standards for documentation, including metadata descriptions, they become valuable not only to human users, but to machine processes as well.
Security: When the data are securely stored in the organisation’s platform (which could be a repository or other type of organised storage facility), they are safeguarded and the risk of data loss is minimised.
Maintenance and preservation: When the data are maintained and preserved by the procedures put in place by the storage facility, individual researchers are relieved of this burden.
Deployment of powerful computational facilities: The computational facilities of the organisation, in terms of storage capacity and processing power, greatly exceed those of any individual researcher.
Processability and interoperability: By adhering to standards and by using metadata descriptions, the data become interoperable with TDM tools and technologies, and processable for further investigations.
Lawful sharing: By adhering to the repository’s deposition guidelines, the researcher (in collaboration with the repository) ensure that access, sharing and distribution of the data are respecting all relevant legislation and legal procedures.
Recognition: The data and the provider are permanently connected through the repository; in other words, the researcher’s ownership of the data is manifest and unquestionable.
Citation and publicity: The data and their provider appear in the organisation’s catalogues. This brings them publicity, and the common practice of harvesting among infrastructures significantly increases this publicity.
Added value: Re-use and pre-purposing of the data adds value to it, through the discovery of new modes of use and research perspectives.
New collaborations: By sharing their data, the researcher increases their chances of discovering new collaborations, possibly even across disciplines, which can lead to new discoveries, shed light on different aspects of the original data, and produce new research results or technological applications.

Almost anyone can be a user of data, from researchers, to private companies, to the general public and citizen scientists. When data are stored in accordance with a good Data Management Plan, all potential users can benefit from:

Access to large amounts of data, tools and technologies: Sharing data provides users with access to much more data than they could ever create or collect on their own.
Ease of identification and access: Data stored in official catalogues (rather than personal computers) and accessible through a simple user interface are easier to find. When they are accompanied by metadata descriptions and relevant documentation, users can easily identify and assess how appropriate the data is for their needs.
Persistence: When data are permanently stored by a repository committed to their maintenance and preservation, the is less risk of users finding and identifying a dataset which later disappears.
Licences or explicit terms of use: When data come with a licence or with terms of use, explicitly defining the actions a user can legally perform with the data, users face less uncertainty about whether they have the right for personal use only, re-distribution of the data, production of derivative datasets, etc.
New collaborations: The opportunities for creating new collaborations is bi-lateral; users may identify interesting datasets and/or tools and technologies which could lead to new collaborations with the data owner.

Data management guidelines for researchers

If you are a researcher who has generated or collected data, how can you make sure your data is genuinely useful and re-usable when you deposit it in a repository? This section provides a set of guidelines to help you make your data as valuable as possible for future re-use.

In The Guidelines on the Implementation of Open Access to Scientific Publications and Research Data in Projects supported by the European Research Council under Horizon 202022, the European Research Council (ERC) strongly encourages ERC-funded researchers to use discipline-specific subject repositories for their publications, and provides a list of recommended repositories.

Subject repositories (also called thematic or disciplinary repositories) host depositions of publications and/or research data in a specific domain, regardless of the author’s institutional affiliation. A well-known example is Europe PubMed Central. You should try to identify the most appropriate subject repository in your domain, where you can deposit your data. Subject repositories provide requirements for what kinds of data they accept, which will be reflected in the repository’s metadata schema; you can use these as guidelines for submitting your data.

If there is no appropriate discipline-specific repository, you can also make your data available in an institutional repository or in domain-independent centralised repositories such as Zenodo3.

It is important to use the right metadata elements for your dataset. Some metadata elements are common to all data types; these are usually administrative elements, and give information on phases of the resource’s life cycle (e.g. creation, validation, usage, distribution and licensing). Other metadata elements are only relevant to specific types of data, such as captures for audio, video and image resources, linguistic annotation for textual corpora, etc.

Some elements can therefore be inappropriate for the description of certain datasets – for example minutes is an appropriate unit when referring to the size of an audio dataset, but not when referring to the size of a textual dataset.

The guidelines below list the most important requirements for the creation and management of metadata for all types of data.

Data quality must be defined in terms of a particular user and use case; a dataset might be perfect for one user’s use case, but not so good for another. For example, a dataset from a medical database might be appropriate for a medical researcher who works on diabetes, but quite useless to a political scientist searching for patterns in protest movements.

Content-wise, data quality needs to be defined as ‘operational usability’. Data quality metrics are therefore domain-specific, based on data type, research domain and intended use.

Data quality extends to and is affected by metadata quality: the data should bear valid metadata, as detailed as possible, including production date, ownership and contact information.

Metadata should also be accompanied by a licence (preferably an open licence, such as CC-BY) to maximise their usability for TDM, and should also be harvestable, in order for them to be included in the inventories of other infrastructures, aiding data visibility and publicity.

The data provider or creator’s responsibility is in the preparation of datasets with the extensive documentation described above, and accurate and up-to-date metadata. Data security, curation, maintenance and sustainability are the responsibility of the hosting infrastructure or repository.

Summary of key points

Good data management is a prerequisite to share research data in an effective way. Sound data management procedures result in:
• increase of data quality
• increase of research efficiency
• exposure of research data and results through sharing and dissemination
• facilitation of reproducibility of experimental procedures
• facilitation of validation and verification of results
• increase of interoperability between data and between data and tools
• improvement of repositories’ and infrastructures’ operation
All of these help to create scientific and economic value. Particularly given the tremendous potential of TDM technologies to create value,42 it is important to design and follow a good Data Management Plan when starting any research project, to ensure the data you create and collect will be as valuable as possible.

The importance of data for TDM

In many scientific fields, we have already seen that data and related services create added value when they are opened and shared for secondary purposes, from fundamental research to the development of innovative technologies and applications.

Human vs. Machine access and use of data

Access to and use of content and data in the framework of TDM requires an entirely different approach to data, in terms of the tools used for accessing and processing data, but also in terms of data management, which needs to be reflected in Data Management Plans.