Home FTDM Project The Text and Data Mining process

The Text and Data Mining process

February 23, 2016

5450

Information Retrieval (IR) extracts words from the text to classify the documents, and also to index them. For example, frequencies of words in the bulk text are often effective at determining the subject area of the document. That can be incredibly useful, for example, for researchers to identify a core body of documentation to work from. Other pieces of information located often include named entities such as names of people, places and organizations. In science and medicine, genes, species, chemicals, software, diseases, are frequently indexed.

Information Extraction (IE) extracts specific words, numbers and phrases, often using templates for stock phrases: “"This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 1223456“, TDM tools can be trained to extract the grant numbers and Sponsor. Similarly “The nucleotide variation was analyzed for five species, H. agilis, H. lar, H. pileatus, N. leucogenys and S. syndactylus.” The mining software is trained that italic type represents species, and can look the results up in European Bioinformatics Institute (EBI) Databases.

The machine has to guess the meaning (semantics) of a term: is “mouse” an animal or part of a computer? It helps greatly to know the structure of the document. In scientific publication, “E.coli” might be a human pathogen the Introduction; in the Methods it is probably a research tool. It is also difficult to interpret tables, lists, figures and figure captions, which are often the most valuable part. Mining is never 100% correct and re/users of the output have to accept that recall (the percentage extracted) and precision (its correctness) may have significant losses.

Mining depends critically on the quality of the input.

TDM has the potential to facilitate knowledge discovery - to address some of society’s ‘grand challenges’ through better understanding and use of existing data.

One example of this is in finding of cures for diseases. TDM has already been used to discover how existing drugs can be used to treat other conditions. Its use is not limited to the field of health. TDM is recognised as having potential in a range of sectors, boosting jobs and growth in the EU and enriching our heritage and education through innovation.

TDM is useful and necessary in scientific research and R&D to keep pace with the growing literature and ‘data deluge’. TDM is also used in digital humanities, business intelligence, market research, patent information and is of growing interest in other sectors. Despite this, TDM in the EU is currently less prevalent than in other parts of the world, such as the US and Japan. The FutureTDM project aims to identify and help remove the barriers to TDM uptake in the EU.