UKSG breakout sessions on text and data mining

11 Apr 2012

Maurits van der Graaf and Eefke Smit

During the UKSG conference at Glasgow, two breakout sessions on text and data mining issues were held with a total of 60 participants, attended by representatives from publishers as well as the library community. After a short presentation of the results of the Publishing Research Consortium study on journal article content mining, participants discussed the various issues in this domain. In this short report we would like to highlight the main outcomes of these group discussions.

Obstacles and solutions

The attendees concurred with the main obstacles for text and data mining as highlighted in the study and added a number of potentially important issues from their own perspective. After identifying a potential obstacle, often a solution was suggested as well. The results can be summarised as follows:

  • Sample licence of STM seen as first step: the variety in permission rules for content mining between publishers was seen as a major obstacle for many text mining initiatives. The sample licence recently published by STM was therefore seen as a major step forward. However, participants suggested that this sample licence should be discussed in detail with librarians and with a group of text mining technologists to see if further adaptions and adjustments are necessary. Also, as the sample licence was seen as difficult to comprehend for non-lawyers, it was suggested that use cases would be added to illustrate better in non-legal terms the kind of text mining covered by the licence.
  • Aggregation of content needed: the spread of content over different platforms was also seen as a major obstacle for text mining applications. This is especially true for the nearly 5000 small journal publishers (less than ten journal titles each) that account for 30% of the journal articles published. Suggestions were made to allow a role for already existing aggregators of content: uniform discovery tools such as Summon, Primo and/or OCLC WorldCat were mentioned as examples. Other participants doubted if these aggregators could deliver the quality levels needed for text mining applications.
  • Issues with derivative products: many participants discussed the difficulties with derivative products, especially from the publishers perspective. Will a derivative product increase or decrease traffic to the original content ? Some suggested that a derivative product should always provide links to the original content to mitigate the substitution effect of a derivative product. There were also questions and sometimes doubts about the methodology of text mining and validation of the results. In addition, language issues and the lack of standard vocabularies were mentioned. One respondent remarked, "not all articles are equal". He meant that derivative products using articles based on text mining could lead to wrong conclusions: some articles coming to a certain conclusion might later be proven wrong scientifically. The trustworthiness of derivative products could therefore be an issue. Another issue that came forward was about the use of snippets of information from the original content in derivative products: ‘intelligent’ snippets can easily substitute the original information. How should the original content owner respond to this phenomenon?
  • Role for librarians: in general, it was felt by most participants that at the moment, the issues around text mining are not yet in the focus of most librarians. However, many participants saw a role for librarians as a liaison between researchers and publishers.
  • From a publisher's perspective among the participants there was the notion that more active use of the content is always in the interest of the publisher. In that sense, content mining is not a threat but rather an opportunity.
  • From the librarians among the audience there was great curiosity about how mining would develop further and the role the librarian can take in this new area.

