The Future is all Mine!
On 7 December 2015, the text and data mining projects FutureTDM and OpenMinTeD organised a joint workshop about the text and data mining challenges for cultural heritage institutions. This workshop took place at the DISH conference, a biennial international conference on digital heritage and strategies for heritage institutions. The aim of the workshop was to stimulate discussion with cultural heritage stakeholders about their TDM experiences so that information could be fed back into the projects in their development of policy and technical solutions.
Presentation on Text and Data Mining: Examples and Projects
In the first part of the workshop, the participants were given a quick peek into the world of text and data mining: Hege van Dijke (LIBER) gave a presentation about the need for text and data mining, and the European projects OpenMinTeD and FutureTDM that both work on this topic (See Slideshare). Steven Claeyssens (National Library of the Netherlands) presented how researchers can mine the data of the National Library and how the National Library has made this possible (See Slideshare).
» Show more
Interactive Session: Text and Data Mining, What does it mean for cultural heritage institutions?
In the second part of the workshop, the participants were invited to give their input on what they perceive as the biggest barriers to making their text and data available for mining. The identified barriers can be divided into three categories:
The cultural heritage institutions felt they don’t have the in-house knowledge to understand what researchers want or technically need in order to work with their data. There is a clear knowledge gap, and both the institutions and the researchers would benefit from more cooperation and interactions. Researchers can also play a role in convincing the institutions of the benefits of text and data mining of cultural heritage data. The institutions are protective of their data. They want to prevent that their data gets misused or used to make a profit. With that regard, they feel opening up their text and data for mining is a risk. Also, they find it important that their data gets credited when used.
The cultural heritage institutions find the current copyright laws very difficult to understand. Questions they are dealing with are: how do we deal with personal records among our data? When do we need to get permission from authors? What if the authors are deceased? What are the copyright rules on old images, old newspapers, and legal documents?The institutions have data with many different kind of licenses attached to them. Some items even have a per-item license, instead of a per-dataset license.
The cultural heritage institutions mentioned that the quality of their data differs very much per item. Some items are well fit for optical character recognition (OCR), others not at all. The institutions are afraid that the OCR will not turn out a 100% perfect. However, other institutions pointed out that it’s always better to try: the OCR doesn’t have to be a 100% perfect, this is almost impossible. Also, researchers can improve datasets along the way by pointing out mistakes in the OCR (open source).
The institutions also saw cost barriers: the price of creating and maintaining metadata is quite high, and investments in harmonisation of formats and licenses would be needed.
The cultural heritage institutions also brought to the table that they need external support: they need IT skills, sustainable tools, and TDM knowledge.
Focussing on finding policy and technical solutions (respectively), over the coming months, both FutureTDM and OpenMinTeD will be working hard to meet stakeholders across the EU to get input on barriers and solutions to TDM uptake. Keep an eye on our websites and follow us on twitter for more on upcoming stakeholder events and opportunities to feed back.
» Show less