Phase I: Content Availability

Phase I: Content Availability

Without data, there can be no data analytics. The first step towards increasing the use of TDM is to ensure that there is as much data available to practitioners as possible.

OBJECTIVE

Ensure more large datasets are genuinely available to as many practitioners of TDM as possible.

SITUATION

Without large datasets that are genuinely available for TDM – that is, legally and practically discoverable and re-usable – there can be no text or data mining. While large industries may have significant stores of private and proprietary content to work with, academic and smaller commercial practitioners can be severely limited by the lack of large, open datasets. The first step to increasing the uptake of TDM in Europe must be to increase the amount of data that is open for all TDM practitioners to use.

CHALLENGES

When compared to countries like the USA, whose ‘Fair Use’ exception to copyright extends to the processing and re-use of data for ‘transformative’ TDM activities, the lawfulness of TDM is often unclear in Europe (See FutureTDM D3.3+: Baseline Report of Policies and Barriers of TDM in Europe). Where copyright exceptions exist they are limited to specific member states, and the precise scope of beneficiaries is unclear.

Without legal clarity on what can or cannot be done with third parties’ datasets without obtaining explicit permissions from rights holders, such datasets are in effect unusable to many TDM practitioners.

Obtaining explicit permissions from rights holders is in many cases prohibitively resource-intensive for start-ups, SMEs, and academic researchers, and in other cases simply impossible for anyone to achieve in practice. For example, for a company who wishes to scrape and analyse content from the open web, finding, contacting, and obtaining permissions from the rights holder of every website scraped would simply not be feasible.

Data Protection laws also vary across member states, with different national interpretations of EU directives, making it difficult for TDM practitioners to know how to comply.

Although the Open Data and Open Knowledge movement is a growing influence, particularly among publicly-funded researchers, much of the focus of sharing research data is on supplying data to human readers, rather than making it available for machine processing. A dataset that may be ‘open’ for a human to access and use is useless to a computer algorithm if it does not have machine-readable metadata including, among other things, clear licensing information. See FutureTDM Guidelines for Data Management.

While academic institutions are increasingly offering repositories for researchers to store and share data, at a technical level these are extremely diverse, often lacking machine-readable metadata and licensing information. Therefore, although more and more research data is being nominally made available under open licences, these data may be effectively isolated in un-interoperable ‘silos’ which are difficult or impossible to discover and integrate with other data to form large datasets for TDM. Heterogeneity of data sources is not an insurmountable problem, but is made much more difficult by poor quality annotations, metadata, and architecture, all of which hinder TDM practitioners from combining datasets from multiple sources.
Datasets may also exist in formats that are difficult for machines to read and re-use, such as PDFs or images, or use varying technical standards and protocols, requiring significant manual effort to homogenise and make available for large-scale analysis.

PRINCIPLES

To maximise the re-usability of datasets, the findings of the FutureTDM project make it clear that Europe needs to broaden, harmonise, and clarify exceptions to copyright and database laws, as well as harmonising and clarifying the balance between privacy and big data in data protection regimes.

Initiatives aimed at encouraging open data sharing must highlight the needs of machine reading as well as potential human readers, in particular the importance of machine-readable licensing metadata.

Europe must continue to support the creation and use of data and metadata standards, centralised access to data, and other ways of connecting data from multiple sources.

ACTIVITIES

The rights of intellectual property rights holders must be respected, and exceptions should not apply unless TDM practitioners have lawful access to rights holders’ content. But in cases where TDM practitioners do not trade on the underlying creative or artistic expressions of the content they analyse and process, it is not reasonable for their activities to be restricted by copyright.
Limiting exceptions to ‘non-commercial’ purposes or ‘research organisations’ would in practice introduce significant legal uncertainty, given that university researchers often collaborate with or are partly funded by commercial partners. It would also exclude all commercial players from carrying out any TDM activities in cases when it is simply not possible to obtain permissions or licences from all relevant rights holders (e.g. mining the open web). This would have a significant impact on the potential economic benefits of TDM in Europe,1as much of the economic value created in Europe in the sphere of data analytics comes from the private sector.
In order to be genuinely effective, exceptions introduced must apply to both copyright and database rights. They must be mandatory across Europe, to minimise the possibility that different implementations and interpretations by member states will continue to fragment the TDM landscape. They must not be overridable by contract, and in order to ensure the integrity of research outcomes they must allow copies of datasets to be retained for the specific purpose of verifiability and reproducibility of research results.

Rights holders must be allowed to protect the security and technical integrity of their content supply mechanisms through the use of reasonable and proportionate technical protection measures (TPMs). But lawmakers must clarify and provide guidance on what constitutes a ‘reasonable and proportionate’ technical protection measure, without unduly restricting the activities of machine access to content.
European universities currently spend over ca. one billion euros each year on subscriptions to electronic journal content, much of which funding comes from public sources. The needs of content owners to protect their systems must be balanced against the needs of subscribers, in practical terms, to be able to use the content they subscribe to for large-scale TDM activities.
Any other limitations to legal exceptions must likewise be explicitly defined to minimise legal uncertainty.

The General Data Protection Regulation must be supplemented with clear guidance on terms such as ‘archiving purposes’ and ‘statistical and scientific purposes’ so that TDM practitioners may understand to what extent Data Protection regulations restrict their use of personal data. As with any other legal frameworks affecting TDM, the GDPR must seek to balance the interests of the various affected parties, bearing in mind the public interest in having a vibrant Big Data environment in Europe.

Many potentially highly valuable applications of TDM, for example in the health sector, will inevitably require the use and processing of personal data. Offering a clear process by which organisations carrying out TDM on personal data can have their Data Protection processes evaluated and approved will help reduce legal uncertainty for those organisations, and help them to comply with Data Protection measures.

The European Commission has committed to supporting the European Open Science Cloud, whose first report emphasised the importance of machine-readable and machine-actionable data to enable automation of data processing (See Realising the European Open Science Cloud). The remit and objectives of the Commission’s FAIR Data Expert Group likewise aims to evaluate the European Commission template for FAIR Data Management Plans in the context of making DMPs more machine-actionable (See FAIR data Expert Group Call for Contributions). Machine actionability must be a core consideration for any initiatives which encourage open sharing of content.

Many public funding bodies for research across the EU are adopting policies that mandate researchers make their research data open; such mandates should specify the ways in which data can be made re-usable for machines as well as humans. For example, mandates should specify the use of standard open licences wherever possible;10 oblige researchers and publishers receiving public money to include licensing information in machine-readable metadata; encourage content creators to use open, machine-readable standards for their data; and supply guidance for researchers on how to do this.

The best way to ensure datasets are discoverable, accessible and interoperable is to aggregate as many as possible in centralised, standardised, and integrated content repositories, which make clear the rights associated with the materials ingested. Such repositories11 should be supported at national and international levels, use open standards for content and metadata wherever possible, and ideally provide access to data via open, user-friendly APIs.
Such repositories also provide a place for researchers and other content creators to share their content and data in cases where they may not have a dedicated institutional repository.

In some cases, organisations may still prefer to host their own content repositories rather than depositing content and data into a central repository. In such cases, providing access to open, machine-readable metadata that conforms to consistent standards at least ensures that the content of those repositories is still discoverable by machines and automated processes, and therefore visible to the TDM world.