TDM data holders and aggregators: European infrastructures offering open access to their big collections of data

The FutureTDM project recently published deliverable D4.1 (European Landscape of TDM Applications Report), which investigates TDM in the European Union and tries to paint the landscape of TDM research, development and applications in a number of areas. This landscape is depicted in different economic sectors, scientific areas and domains of activity by exploring data relevant to the available technology and (research) infrastructures, the R&D investment (mostly in terms of funded projects), the research output (in terms of scientific publications produced), the resources and tools available for TDM as well as the commercial activity (in terms of companies and organizations investing in, using and offering TDM services). Due to the fact that TDM is currently a very hot field in terms of research, development and business applications, highly convoluted with the big data hype, and as a result constantly changing, the present report aims at the creation of a landscape which is representative of the status quo but not exhaustive concerning the information provided. The report addresses the main TDM data holders and aggregators for which you can find more in this blog.
The full report is available at:

There are three main major data holders and providers:

  • The commercial-private sector including social media companies (e.g. Facebook, Twitter, LinkedIn, etc.): they collect data, mine it and use the results for strategic planning and also to optimise internal processes, economise on use of resources and perform predictions etc. in order to offer improved services and increase profits.
  • The public sector: as a result of "the PSI Directive" issued in 2003 on the reuse of Public Sector Information (PSI)1 which has given a tremendous impetus to the trend for open data, appropriately organised and stored and available for re-use.
  • The scientific research domain: available data for TDM are research publications and also research data. The recent past has witnessed the development of Research Infrastructures (RIs), an important pillar in EU research. The importance of RIs lies not only in the expertise of human resource or the technology and tools developed and used but, most notably, in the volume of data collected from experiments, measurements and observations. They adhere to the Open Access policies following the principles of Open Science2 in the EU, the Open Access pilot actions3 launched in both the FP7 and Horizon 2020 EU programmes and the Open Access to scientific information4 initiative, aiming to maximise the impact of scientific results in the Digital Single Market5. RIs consist of repositories and aggregators, providing facilities, resources and related services used by the scientific community to conduct top-level research, either domain-specific or interdisciplinary, ranging from social sciences to astronomy, genomics to nanotechnologies6.
        • Repositories are digital archives that collect, preserve and disseminate the intellectual output of mainly academic and research institutions. Until recently the institutional repositories stored publications, reports, journal articles, books etc. Current practice includes the storage and curation of research data as such; this is used (and re-used) for mining, training or evaluation of tools and also re-purposed, i.e. data produced in the framework of a research field might be reused in another scientific domain with different goals; for instance, social data produced within a social sciences experiment can be valuable for linguistic research. This led to the development of interdisciplinary data repositories, providing valuable insights on data concerning interdisciplinary research. Examples of linguistic repositories are:
          CLARIN7, the Common Language Resources and Technology Infrastructure, which provides access for scholars in the humanities and social sciences to digital language resources, i.e. links to data (in written, spoken, or multimodal form), and to processing tools. CLARIN harvests metadata on language resources residing in repositories across Europe, covering all European languages.
          META-SHARE8, a European infrastructure for sharing and exchanging language data, tools and related web services73. It is designed as a network of distributed repositories of language resources (LRs), which makes available datasets documented with a common metadata schema as well as language processing services. It is devoted to the sustainable sharing and dissemination of Language Resources for the Human Language Technologies domain but also for all domains where language plays a critical role.
          The significance of linguistic repositories lies in their outcome: making available methodologies, techniques, tools and services that can be used for mining across sectors (e.g. a part-of-speech tagger for medical texts or a tokenizer and sentence splitter for chemical texts) but also data and other resources for training and evaluating tools and services.
        • Aggregators harvest existing repositories, providing unique points of access to a great variety of repositories and their data.  Indicative cases of interdisciplinary aggregators are:
          OpenAIRE9, an EU-wide infrastructure, developed a repository facility and scientific data management services, as well as an e-Infrastructure for accessing scientific publications. It is integrated with Zenodo10, a digital repository for everything not served by a dedicated service, that enables researchers, scientists, EU projects and institutions to share, preserve, showcase and share multidisciplinary research results (data and publications) of any size, any format and from any science.
          CORE11 (COnnecting REpositories)'s mission is to aggregate all open access research outputs from repositories and journals worldwide and make them available to the public. It aims to facilitate free access and reuse of open access research outputs distributed across many systems, providing services for different stakeholders including academics and researchers, repository managers, funders and developers.
          EUDAT12, the European Data Infrastructure is creating a pan-European infrastructure for e-science to enable European researchers and practitioners from any research discipline to preserve, find, access, and process data in a trusted environment, combining numerous community-specific data repositories with the permanence and persistence of some of Europe’s largest scientific data centres.




or login with: