Text and data mining (TDM) lies at the intersection of research interests from different research communities: computational linguistics (CL), information retrieval (IR), artificial intelligence (AI), and information science (IS).1 This collaboration yields results both in terms of scientific and commercial applications across various economic sectors.
Within the wide range of potential TDM use scenarios, TDM for scientific publications represents just one niche case. However, successful TDM implementations and adaptations of algorithms for this specific purpose have the potential to influence overall scientific development in several crucial aspects, such as efficiency and quality control.
The research in this sub-field is often presented at specialized workshops co-located with main conferences in the field. The workshops provide a venue for targeted discussions, and invite the wider scientific community to engage in dialogue.
Successful implementation of TDM for scientific publications has the potential to influence overall scientific development in terms of efficiency and quality control.
During the early 2000s a range of yearly workshops started individual efforts to build communities around specific tasks within TDM for scientific publications. These include, for instance, the automatic identification of citations for constructing underlying reference networks, and of qualitative aspects of these citations in context (reflecting the opinion or stance of the author). These tasks often focus on specific types of scientific texts, such as biomedical texts. Below we highlight five regularly recurring events of this type, and we list three one-off events from the recent past as well.
1.BioNLP/BioASQ/BioNLP-ST workshop at Annual Meeting of the Association for Computational Linguistics (ACL) / Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT) (2009, 2011, 2013, 2016)
This workshop series brings together researchers from the CL community, with an aim to focus on TDM applied to specific types of biomedical content. The growth and consistency of this community is ensured by the format of a shared task when one collection per subtask is collected, curated and licensed by the organizers to share across participants. The decision on the task topic choice is primarily defined by the need of biologists and hot topics in their field. The biomedical community is involved in data annotation and results assessment, while CL researchers adapt their algorithms to the challenges of specific biomolecular language conventions.
Over the years, the tasks followed a typical TDM development path, starting with information extraction at the sentence level, and gradually moving towards generalisation of discovered knowledge and knowledge base construction. For the sake of progressive algorithms development and further implementation, the tasks start with analysis on the level of parts of the papers, and only once the techniques are mature enough, they are applied to the full texts.
The challenges have been about:
- extraction of specific items, i.e. proteins, genes;
- classification of events and relations holding between the items according to biomedical taxonomies;
- negations and speculations regarding extracted events;
- co-reference resolution to ensure proper anaphoric reference assignments to proteins or genes.
2. International Workshop on Bibliometric-enhanced Information Retrieval (BIR) at European Conference on Information Retrieval (ECIR) (2014, 2015, 2016) and at Joint Conference on Digital Libraries (JCDL) (2016)
This workshop intends to build a connection between the CL, IR and IS communities, as it opens questions on the use of text analytics in combination with bibliometrics. Citations network mining is carried out from two perspectives:
- positioning a paper in the context of a general research field;
- definition of research field trends, and forecasting the direction of further discoveries in the field.
While the majority of papers at this workshop focuses on the improvement of search from within a collection of scientific publications using co-citations clusters, the other type of experiments uses the knowledge of the rhetorical structure of the papers when examining the context of citations. Further, these algorithms implementations are packaged into tools for wider community use.
This workshop brings together researchers from IR and IS communities that have an interest in data visualization of scholarly material. Scientific publications are considered as sources for visual representation of the underlying fields and their developments, beyond the textual content. Interests of this community revolve around the following main topics:
- Track and visualize the evolution of a scientific field using TDM algorithms, detecting embryonic research topics potential and establishing the trends;
- Promote technologies for the mining of non-textual information present in scientific publications, e.g. figures, graphs, tables, etc, or in demos and presentation videos;
- Introduce a ‘crowd’ or ‘social’ component to the visualisation of scientific publications context, e.g. visualize discussions about publications on social networks such as Twitter.;
- Support scientific literature analysis by contextualising the text of the publications, as well as authors track record and citations network.
1. Negation and Speculation in Natural Language Processing (NeSp-NLP) Workshop at ACL’10
This one-time workshop brought together researchers within the CL community who focus on the identification of negation and speculation in natural language, with the use-case of negation and speculation processing within scientific papers, abstracts or full texts, as the main focus. The experiments were carried out mostly on biomedical scientific texts, ranging from annotation issues to systems implementations.
2. Detecting Structure in Scholarly Discourse (DSSD) workshop at Annual Meeting of the Association for Computational Linguistics (ACL’12)
The goal of the DSSD workshop was to discuss and to compare TDM techniques and principles, to consider ways in which they can complement each other when applied to scholarly materials, and to initiate collaborations to develop standards for annotating appropriate levels of discourse.
3. Panel on Detecting and using document structure in scientific text at ACM International Health Informatics Symposium (IHI ’12)
The panel format is often used to raise awareness within a community on a technological area. The format of panels allows researchers from the CL and IS communities to expose their vision of the development of the field to experts in the area of application, such as the life sciences in this particular case.