One of the clear results emerging from the recent FutureTDM Knowledge Cafe series is the low level of awareness among researchers about what text and data mining (TDM) is and particularly how to go about mining. This is a major barrier to overcome to encourage uptake of TDM in Europe and one that ContentMine aims to pragmatically tackle.

We have run training workshops for over 100 researchers and recently supported six early career researchers as ContentMine Fellows to help them use TDM in their work. As part of these efforts to raise awareness of TDM we wanted to illustrate the process in a way that resonates with people from many backgrounds and so we chose to deploy our software on Open Access papers about Zika virus. You can see a video demonstration below and follow along the in-depth process of mining some literature!

Getting the Content

Firstly we search for ‘Zika’ in the Open Access subset of Europe PubMedCentral using ‘getpapers’. The software will automatically get all of the associated documents and in a matter of a couple of seconds a total of 123 files are downloaded. One of the best features of automated mining is the ability to use complex queries and make use of other datasets, for example instead of searching for just ‘Zika’ we could use a public dataset of other viruses from the same Flaviridae family and search for each of them in turn with some simple code.1

If we were searching more sources, each would require an integration with getpapers or use of another piece of software ‘quickscrape’ that finds the files online using a piece of software called a scraper. Every publisher requires a ‘scraper definition’ to deal with their particular website and we estimate that around 40 scrapers would cover 70% of the academic literature. There is a very long tail of smaller journals and publishers so achieving 90% coverage would push that up to ~100 scrapers, demonstrating the challenge for individual researchers wishing to conduct TDM using the comprehensive literature.

Recent initiatives such as Crossref Text and Data Mining Services are also attempting to address this issue for some publishers by providing a consolidated access point to request full text papers for mining. It is worth highlighting that as we used Open Access papers only, we bypassed the legal barriers that restrict or prohibit this first step of mining in many contexts.

Formatting the Content

A single command using ‘norma’ will normalise the downloaded documents to ensure they are all in a comparable format for analysis. This step requires a style sheet that describes the document and again must be created per publisher and updated frequently as publishers change their layouts or publishing systems. At a practical level this once more makes individual efforts to mine the literature challenging, so we provide instructions for researchers wanting to make their own scrapers and style sheets for use with our open source pipeline.

Mining Data!

The analysis step is undertaken by ‘AMI’, which searches the documents using a range of queries from pattern searches to dictionaries to text indexing. For example, species can be distinguished because they are in italics and follow a standard form while human genes might be identified from a list and word frequencies will be obtained from standard text mining techniques. The extracted facts can be explored and manipulated by the researcher to deliver insights either in their raw or condensed form. One of the outputs is a table of key data and with very basic knowledge of coding it is possible to get ranked lists of data from across all papers.

The Results of our Exploration

As might be expected, the most common mentions in Zika papers are viruses and mosquitoes, with many mentioning related diseases, insecticides and associated insecticide resistance genes. It is interesting that to the trained eye, the data table provides a very good overview of the paper without even seeing the title. One surprising results was that the top-ranked genus is Wolbachia, which is a bacterium that has been found to prevent transmission of dengue virus in the mosquito Aedes aegypti which also transmits Zika virus. In less than five minutes, the potential importance to Zika virus of a novel control method for mosquito-borne disease has been highlighted, something which would likely have taken a researcher reading several full-text papers to appreciate.

In some ways, Zika virus is a poor example of the power of TDM because we know so little about it that only 120 of the 1.2 million papers in the Europe PubMedCentral Open Access subset mention it at all. This makes statistical analysis and correlations difficult and doesn’t use the potential of TDM to enable the scientific literature to be treated as ‘big data’ for larger scale analyses.

Nonetheless, our simple exploration has highlighted the practical side of undertaking TDM and demonstrated that there are accessible entry points to get researchers started with applying or deciding how to apply the technique in their own work.

See code tutorial at



or login with: