The First International Workshop on Recent Trends in News Information Retrieval (NewsIR’16) was held as part of the ECIR 2016 conference in Padua (Italy) on 20 March 2016. The workshop provided an opportunity for a diverse group of stakeholders—researchers, professionals and practitioners—involved in news-related information to come together and discuss the latest and most powerful uses of IR technology applied to news sources. Interest in the subject, combined with appropriate management and good advertisement, resulted in a fairly successful turn-out: more than 40 attendees were present in the room at all times and for all sessions. To take full advantage of the wide range of perspectives brought by the participants, the presentations were brief and allowed time for questions. Additionally, there were plenty of opportunities for networking during the coffee and lunch breaks, poster and break-out sessions, and the welcome reception at the end of the day.
1 The Origins of NewsIR’16
The use of news collections to support IR research has been a common practice for several years. Indeed, the Reuters-21578 collection, which contained Reuters newswire material, was compiled in 1987. Other collections employed for IR research, such as those in some tracks of the Text REtrieval Conference (TREC) and the CLEF Initiative, also comprise feeds from news broadcasters and other news-related data. However, it would be a mistake to assume that all problems associated with news search have been successfully “solved”. Moreover, mainstream media outlets are often among the most prominent sources of information—ranging from the influence that newspapers have on elections to the damage to a brand’s reputation that a negative article on a popular blog can cause.
Realising the importance of news datasets for the IR community, a group of participants who attended ECIR 2015 in Vienna started to talk about the use of news articles as input to various tasks, ranging from news recommendation to temporal summarisation or real-time clustering. It emerged that there was a good number of researchers interested in news-related IR. Yet, it seemed that the available collections for this kind of research were outdated and usually represented biased versions of the news—because they did not contain enough articles, or only a few sources were considered. Additionally, in some cases the data is cleaned and filtered, as opposed to the noisy entries found in the real word, which creates an unrealistic environment for evaluation. Consequently, Miguel Martinez-Alvarez—Co-Founder and Head of Research at Signal Media Ltd—set up a Google group to continue the discussion on news-related IR after the end of ECIR 2015.
Initially, 55 people joined the Google discussion forum and talked about subjects such as news bias, summarisation, clustering, topic classification, entity linking, entity recognition and disambiguation, event detection and social media integration. A couple of questions that permeated the discussion from the very beginning were whether news-related information had been relegated to be “less” important than it really is, and whether it would be worth organising a workshop combining news and IR.
Eventually, a proposal to start up an international workshop was drafted and submitted to ECIR 2016, where it was well received. A total of 9 full papers and 3 short papers were selected by the Programme Committee from a total of 19 submissions—each submitted paper was reviewed by at least 3 members of an international reviewing group made of 30 members. Apart from the selected papers, two keynote speakers joined the workshop: Jochen L. Leidner (Thomson Reuters) and Julio Gonzalo (National University of Distance Education). The NewsIR’16 Workshop website is available here.
2 The Signal Media One-Million News Articles Dataset
To accompany the Workshop, and facilitate conducting research on news articles, Signal Media released a dataset intended to serve and encourage the research of the news retrieval community. Such a dataset consists of approximately 1 million news articles from a wide range of sources.
Originally, the articles of the dataset were gathered by Moreover Technologies for a period of 1 month—precisely, 1-30 September 2015. Most of the articles are in English, but there are a few non-English and multi-lingual articles. The sources of the articles include companies such as Reuters, and also local news sources and blogs. The number of individual unique sources is over 93k. The dataset contains 265,512 blog articles and 734,488 news articles. The average length of an article is 407.75 words.
3 Session 1: Media Monitoring
The first session was chaired by Gabriella Kazai (Lumi, UK) and included the first keynote, by Jochen L. Leidner , Director of Research at Thomson Reuters. Short descriptions of the keynote and papers presented in this session are listed below.
Keynote: Recent Advances in Information Access at Thomson Reuters R&D – News and Beyond
By Jochen L. Leidner (Corporate Research & Development, Thomson Reuters—London, UK)
Jochen is currently responsible for the Research and Development at the Thomson Reuters site in London, UK, where approximately 40 scientists and developers are engaged on research activities—strategic forward-looking, and applied and contract research (no product development). As pointed out by Jochen, his keynote was largely an “ideas” talk, as opposed to a technical talk.
Jochen reported briefly on some recent developments made by the Corporate R&D Group at Thomson Reuters. For example, a news recommender system called NewsPlus, and a real-time Twitter rumour detection tool for journalists called REUTERS Tracer. From the area of pharma within IP & Science, Jochen talked about SoMeDoSEs, a pharmaco-vigilance system that uses Twitter to mine adverse events associated with medical drugs, which is used for drug repositioning. From the area of law, Jochen discussed the advanced search engine technology that powers the Westlaw search engine. Finally, in the area of Finance & Risk, the Risk Mining, a computer-supported risk register extraction application was introduced.
According to Jochen, the future of news-related information retrieval lies on the ability to transform news into actionable intelligence. This is critical for proactively preventing and reacting to future events.
Boolean Queries for News Monitoring: Suggesting New Query Terms to Expert Users
By Suzan Verberne (Radboud University, the Netherlands), Thymen Wabeke (TNO, the Netherlands) and Rianne Kaptein (TNO , the Netherlands).
The paper evaluates query suggestions for Boolean queries in a news monitoring system. Users of the system receive news articles that match their queries on a daily basis—but the queries need regular updates, as the news changes continuously.
Suzan Verberne emphasised during the presentation the importance of tasks that are recall oriented—i.e., tasks where missing a single relevant document is not acceptable. One of the traditional ways to address these tasks is by means of long and complex Boolean queries. This research, however, introduces a method to have candidate query terms suggested from retrieved documents.
After presenting experimental results and qualitative feedback obtained through a questionnaire answered by the participants in the experiments, the authors conclude that the use of relevance ranking, instead of Boolean retrieval, and a post-filtering mechanism for removing non-relevant terms, will give better user satisfaction.
Detecting Attention Dominating Moments Across Media Types
By Igor Brigadir, Derek Greene and Pádraig Cunningham (Insight Centre for Data Analytics, University College Dublin).
This paper focuses on identifying attention dominating moments in online media—i.e., moments when everyone seems to be talking about the same issue. To explore attention dominating news stories, three different media sources were studied: mainstream news, blogs and tweets. For the first two sources, the Signal Media dataset was used. For the final source, the authors collected a Twitter corpus comprising a large set of newsworthy sources curated by journalists, instead of retrieving tweets based on keywords.
The paper suggests that it might be possible to identify and track major developments with global impact by linking attention dominating moments across multiple sources on different platforms. Social media communities both influence and are influenced by traditional news media—in fact, stories break simultaneously on both Twitter and traditional news publishers.
Exploiting News to Categorize Tweets: Quantifying the Impact of Different News Collections
By Marco Pavan, Stefano Mizzaro, Matteo Bernardon, Ivan Scagnetto (University of Udine – Italy).
The last paper of the first session was also the recipient of the best paper award of the workshop. This paper is a part of a longer term research that aims at understanding the effectiveness of enriching tweets with information derived from the news, instead of the whole Web as a knowledge source.
Stefano Mizzaro delivered the presentation and explained how to exploit news articles to enhance tweet categorisation using sets of words extracted from news articles with the same temporal context. Three different features of the news were tested as part of this research: volume, variety and freshness. The experiments confirmed the importance of these three features.
Future work will look at the impact of the number of documents extracted from the news collection to categorise short texts. There are also plans to investigate which kinds of news is important to consider and which ones are marginal.
4 Session 2: News Events
Semi-Supervised Events Clustering in News Retrieval
By Jack G. Conrad (Thomson Reuters Corporate Research & Development, Minnesota – USA) and Michael Bender (Thomson Reuters Global Resources, Switzerland)
The authors introduced a news retrieval system, eventNews, which employs an event-centric algorithm; thus, allowing users to monitor developing stories based on events, rather than by examining an exhaustive list of retrieved documents.
News articles are clustered around an editorially supplied topical label, called a “slugline”. Decisions about merging related documents or clusters are made according to two distinct sources: a digital signature based on the unstructured text in the document, and the presence of named-entity tags assigned by the Thomson Reuters’ Calais engine, a named entity tagger. Human assessments were used to evaluate the system on a 5-point scale, and the average quality achieved was around 80%.
The development of a more robust working model for eventNews is anticipated in the near future, while further work will focus on testing the recall of the system—i.e., how many events are captured and represented from all the possible news events in the dataset or sample.
Cross-Lingual Trends Detection for Named Entities in News Texts with Dynamic Neural Embedding Models
By Andrey Kutuzov (University of Oslo – Norway) and Elizaveta Kuzmenko (National Research University Higher School of Economics – Moscow, Russia)
This paper discusses the use of vector space models, particularly “neural embeddings”—prediction-based distributional models—to detect real-world events as manifested in news texts. Temporal shifts of the embeddings might potentially predict specific events.
The models are trained on a large corpus consisting of English and Russian news: English text is derived from the Signal Media dataset, and the Russian text comes from a corpus of news articles in Russian published in September 2015 that include 500,000 extracts from about 1,000 Russian-language news websites—unfortunately, not available publicly due to copyright restrictions.
After training the models on the ‘reference’ corpus, they are successively updated with new textual data from daily news. The approach effectively retrieves meaningful temporal trends for named entities regardless of language. Plans to continue this work by experimenting with different algorithms or parameter sets for different languages are already on their way, and preliminary tests show promising results.
Using News Articles for Real-time Cross-Lingual Event Detection and Filtering
By Gregor Leban, Blaž Fortuna and Marko Grobelnik (Jozef Stefan Institute – Ljubljana, Slovenia).
This presentation referred to a system called Event Registry which is able to group articles about an event across different languages, and extract core event information from them in a structured manner.
The immediate advantage that Event Registry offers to news readers and analysts is a significant reduction on the amount of content that has to be reviewed while gathering the global coverage of a particular event. Moreover, since all the event information is structured, Event Registry provides several options for searching and filtering that are not available on existing news aggregators.
According to the presenter, Gregor Leban, traditional news aggregators overwhelm users with duplicate news articles—articles referring to the same event—whereas the approach followed by Event Registry, based on semantic annotation per document, event clustering and the extraction of main event facts, is capable of showing events rather than articles, providing a better alternative and a more general understanding of particular events.
Exploring a Large News Collection Using Visualisation Tools
By Tiago Devezas (INESC TEC and DEI – Porto, Portugal), José Devezas (DEI – Porto, Portugal) and Sérgio Nunes (INESC TEC and DEI – Porto, Portugal).
The final paper of the session explored the Signal Media dataset using the visualisation tools provided by the MediaViz platform. MediaViz aims to assist in gaining insight from large archives of news through interactive visualisation tools.
The visualisation analysis of the Signal Media dataset revealed the following:
* News and blog sources evaluate differently the importance of similar events, granting them distinct amounts of coverage.
* There are both dissimilarities and overlaps in the publication patterns of the two source types—news and blog sources.
* The content direction and diversity behave differently over time.
More precisely, a keyword analysis allowed the researchers to see that news and blog sources granted different levels of importance to a given set of keywords related with major global events that took place in September 2015. A source analysis showed that the temporal publication patterns of these two media behaved differently—blogs published a higher percentage of content during the weekend than news sources—though both sources followed an identical curve during a 24-hour cycle. Finally, a diversity analysis indicated variations in the dynamics of topical diversity over time.
5 Poster Session
The poster session took place after finalising the second session. The accepted posters are listed below,
- Temporal Random Indexing: A Tool for Analysing Word Meaning Variations in News
By Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro (Department of Computer Science, University of Bari Aldo Moro – Italy) - Visualising the Propagation of News on the Web
By Svitlana Vakulenko (MODUL University Vienna, Austria), Max Göbel (Vienna University of Economics and Business, Austria), Arno Scharl (MODUL University Vienna , Austria) and Lyndon Nixon (MODUL University Vienna, Austria). - Comparative Analysis of GDELT Data Using the News Site Contrast System
By Masaharu Yoshioka (Hokkaido University – Japan) and Noriko Kando (National Institute of Informatics – Tokyo, Japan)
Some of the papers included in the proceedings of the workshop also had posters, and each poster presented had to be explained and defended by its authors, which encouraged interaction among the participants and greatly enhanced the discussion.
6 Session 3: Analysis and Visualisation
The second keynote, delivered by Julio Gonzalo, opened the afternoon sessions. Short descriptions of the keynote and papers presented in this session are listed below.
Monitoring Reputation in the Wild Online West
By Julio Gonzalo (National Distance Education University (UNED) – Madrid, Spain)
Julio elaborated on some of his recent work on online reputation management, which has already become a key part of public relations (PR) for organisations and individuals. Julio explained how PR companies start by analysing what topics have been mentioned in Twitter, the “central nervous system for PR companies”. Afterwards, filtering is applied, focusing on topics that are relevant. PR work should be approached with a recall-oriented point of view, where every tweet or news article counts.
Special emphasis was given to the RepLab Evaluation Campaign. Such a campaign is co-organised by Julio, who compiled a collection of tweets retrieved in collaboration with the UNED research group. The collection provides over half a million manual annotations by reputation experts—approximately, 570,108 annotations—and 208,000 URLs derived from tweets for which manual annotations are available. As in the case of PR companies, RepLab concentrates on Twitter content, because it is the key media for early detection of potential reputational issues. Nevertheless, online monitoring pervades all media: news, social media, the blogosphere, etc.
Polarity for reputation, as opposed to sentiment analysis, was illustrated with examples during the keynote. Polarity for reputation attempts to identify statements and opinions that have negative or positive implications for the reputation of a person or company, and it is involved in author profiling, categorisation and ranking.
Finally, Julio referred briefly to his interest in adaptive learning and systems that implement it as a means to correct potential failures. Julio expressed that it is not critical for a system to fail once, as long as it is not recurrent—knowledge about the failure can be incorporated to prevent the same error in the future.
What do a Million News Articles Look like?
By David Corney, Dyaa Albakour, Miguel Martinez and Samir Moussa (Signal Media – London, UK)
The final presentation of the workshop was delivered by Dyaa Albakour from Signal Media, who offered a comprehensive description of the Signal Media dataset.
As explained by Dyaa, there are 407,754,159 words in the Signal Media dataset; there are 2,003,254 distinct words in the dataset; and the average number of words per article is 407.75. Just 144 articles in the collection have more than 10,000 words. The longest article has 12,450 words and it is a transcript of a US college football match. Other long articles include an instalment of a serialised novel; detailed personal memoirs; and a list of fishing reports from Florida. The articles were collected from a variety of news sources, including Reuters, the BBC and the New York Times, along with many sources that have fewer readers—such as news magazines, blogs, local outlets and specialist publications. The dataset is shared under a Creative Commons licence, while the copyright of the articles remains with the original publishers.
The Signal Media dataset is text-only, and does not contain links to the original articles. This is partly due to issues around image licensing, but also to avoid storing links that might eventually become obsolete and outdated. An open source repository on GitHub to host useful tools and programming scripts for processing the dataset has been created. This provides scripts to index the data with ElasticSearch and convert it to the TREC format for compatibility with other IR tools.
7 Break-out Session
At the end of the third session, the audience was divided into three groups to discuss, separately, the challenges that the news IR community faces, the data that would be useful to have to continue the current research, and the tasks that the community should focus on in the short and long term. Then, the entire audience reconvened and a representative of each group presented the outcomes.
Generally, the participants expressed their interest in extending the time period covered by the Signal Media dataset, as one month—September 2015—makes it unsuitable for an investigation on temporal analysis, where longer time spans are indispensable. The participants also recommended integrating multimedia content and multilingual sources to the Signal Media dataset, though these features were out of the scope of the original plan. Talks to combine a Twitter dataset with the existing news articles over the same time period to have a unified collection of news, blogs and tweets have already taken place in collaboration with Igor Brigadir from the Insight Centre for Data Analytics. Arguably, this would be a very useful dataset for future work.
Another issue that came up in the discussion was the verification of news, which includes fact checking, controversy detection and determination of news bias. These were suggested as possible tasks for the next year’s workshop.
8 Panel and Closing Remarks
A panel composed by Gabriella Kazai, Stefano Mizzaro, Jochen Leidner and Julio Gonzalo addressed the final questions of the audience and offered their final thoughts at the end of the day.
Evaluation was the first topic referred to by the panel. Stefano and Julio reminded us of the multiple evaluation metrics already devised for IR systems—precision and recall are not the only ones. Since each different metric says something different about the data, we may not need new metrics, but rather to combine correctly the existing ones to provide better explanations. In this context, Jochen proposed involving journalists to future NewsIR events, as their input would offer valuable insight that we may otherwise miss.
The interconnection amongst the different elements of online media—blogs, news websites, Twitter and social media in general—sparked off the discussion during the panel session as well. All the different channels where news are published at present are so intrinsically linked that some bloggers have got as much influence as major newspapers. The panel agreed that there are clear dependencies between news, blogs and social media: a tweet might cause someone to create a blog that in turn will cause someone else to write a short piece in a local newspaper, which, later on, will be picked up by a worldwide publication. Then, as explained by Jochen, ascertaining the quality of news material, and the need for mechanisms to separate real news from pseudo-news, rumours and clickbait is critical. In this sense, the explosion of new sources has increased the difficulty of determining legitimate or trustworthy sources. Moreover, defining what trustworthy means is becoming harder, as single events can be seen through several points of views, some of which might be contradictory.
Finally, Gabriella commented on the business model that should be adopted to protect news publishers. Since people have become content creators, who should receive the revenue from advertising in the future, and who should be responsible for the distribution of news.
Anyway… the discussion continues online. Plans for the next event, and comments on all the topics of interest for the news IR community—summarisation, clustering, topic classification, duplicate identification, entity recognition and disambiguation, event detection and social media integration—are part of the regular chatter at the Google discussion forum.
Acknowledgements
This article was written by Marco Palomino and Ayse Göker. Pictures were kindly provided by Udo Kruschwitz.