Perspectives on the EU Horizon DoSSIER Project

In this feature item there are three excellent contributions from members of the DoSSIER project. DoSSIER is an acronym for an EU Horizon 2020 ITN/ETN on Domain Specific Systems for Information Extraction and Retrieval

There are three contributions from members of the project team
  • A Summary of the First DoSSIER Training School, which took place in September 2022 and is jointly authored by Florina Piroi, Mike Salampasis, and Allan Hanbury
  • A sub-project on the exploration of ‘relevance’ by Geirgios Peikos, an early-stage researcher in the project team
  • The application of machine learning in the healthcare and biomedical domain by Wojciech Kus

Before reading these contributions, it might be useful if I extract a description of the project from the project web site.

The objectives of DoSSIER are to elucidate, model, and address the different information needs of professional users. It mobilizes an excellent and highly synergistic team of world-leading Information Retrieval (IR) experts from 5 EU States who, together with 3 academic partners (universities in the US, Japan, and Australia), and 11 industrial partners (dynamic SMEs and large corporations) will produce fundamental insights into how users comprehend, formulate, and access information in professional environments.

DoSSIER research is structured in three areas:

  • Models: fundamental models of users and domain specificity,
  • Methods: contextual and personalized search, and
  • Applications: workflow, task and the interface.

Each area individually and in cross-field fertilisation, will produce breakthroughs in our understanding of computer-supported human information search workflows.

DoSSIER groups its research activities into the three general areas (ApplicationsMethods, and Models). These feed into each other to generate new hypotheses, identify new experimental procedures, and bring about a better understanding of knowledge and information needs, and the processes by which the two interact. The results of this research will provide the vital know-how and tools to the professional search industry
The project leaders are Allan Hanbury, Elaine Toms, Arjen P. de Vries, Gabriella Pasi, Leif Azzopardi, Stefan Gindl, Suzan Verberne and Mike Salampasis
On the site is a list of the research papers that have already been published as an outcome of the project.

A Summary of the First DoSSIER Training School

Contributed by Florina Piroi, Mike Salampasis, and Allan Hanbury

The DoSSIER project, an EU Horizon 2020 ITN/ETN, officially started in November 2019. As the Covid pandemic started early 2020, our project was markedly affected right from the beginning. One of the main challenges was represented by the need to move the networking and training of the selected PhD students into an on-line ecosystem. The efforts to integrate the PhD students in the advisors’ teams and their research networks has rewarded with successful representation of our project at the main IR conferences within and outside Europe. As travel to conferences became possible again, the DoSSIER students were already introduced to the IR communities.

The education and training of Early Stage Researchers is a key objective of the DoSSIER EU project. After too many months of on-line lecturing, training, and scree-time with advisors and DoSSIER colleagues, we were happy to successfully organize and conduct from 25 to 30 of September 2022, the first DoSSIER Summer Training School. With the Intelligent Systems Laboratory of the International Hellenic University, being the local organizer, the school took place in the idyllic village of Olympiada (Chalkidiki, Greece).

The 1st DoSSIER Training School was a week-long event consisting of a series of lectures and seminars. The school was aimed to and designed for the DoSSIER Early Stage Researchers, who now actively conduct research in the area of Domain Specific Systems for Information Extraction and Retrieval. The aim of the school was to give grounding in core research topics (e.g. IR Experimentation), but also to provide training in other subjects relevant to their research activities (e.g. Licensing practices, Search in Industry settings). The school lecturers were highly qualified and recognised experts in their area of research, including also advisors from within the DoSSIER network. In numbers, the school consisted of 12 talks presented by professors and researchers from several countries.

On the first day, two experts in Information Retrieval Experimentation and Evaluation, Nicola Ferro (University of Padua) and Norbert Fuhr (Universität Duisburg-Essen), gave advanced lectures on their favourite topics, giving recommendation on correct and robust IR measuring practices. The mornings of the second and third days were dedicated to lectures on the characteristics of User Studies and Information Interaction, Interfaces & Evaluation of User Studies, with two experts in these domains giving their insights to the DoSSIER students: Katriina Byström from OsloMet University and Elaine Toms from the University of Sheffield. The afternoons of days two and three were a combination of student led annotation sessions, and focused discussions about topics relevant to the PhD students. Two lectures dedicated to developing search systems in commercial and enterprise settings were given by Antonis Makropoulos (Contextflow GmbH) and Udo Kruschwitz (Universität Regensburg).

We took the opportunity to visit the Ancient Stageira archaeological site which is a short, pleasant walk from the village of Olympiada. Ancient Stagira is the birthplace of Aristotle, probably the greatest and most influential thinker of all times and the first genuine scientist in history. On the evening of the third day the school agenda included a panel discussion on the subject of how useful is academia in the current AI use in commercial and enterprise settings (panel title: “Do we need academia in AI?”). The various opinions and ideas were presented in a very interesting and vibrant way, on the archaeological site of the Ancient Stageira.

Days four and five of the event contained lectures on search solutions and systems that originated out of academic research and are now implemented commercially. These solutions were presented by Roberto Cornacchia (Spinque), Jacub Zavrel (Zeta Alpha), and Petr Knoth (Open University UK, Core.UK). Reproducibility in (IR) experimentation was present on the DoSSIER agenda in two lectures, one included in Norbert Fuhr’s lecture and one given by Andreas Rauber from the Technische Universität Wien.

Legal aspects of working with software and data were presented by Marianna Katrakazi, from Athena RC, who exemplified how to correctly choose and check the possible licensing models for code release.

In total 12 PhD students that are part of the DoSSIER training network attended the school. In addition to their taking on the opportunity to have direct discussions with the lecturers, they coordinated their efforts to develop a search engine prototype for domain-specific tasks in scheduled sessions on the school’s agenda.

Attendees at the DoSSiER Training School

As the summer school reached its end, after five days of intense attendance and tutoring, the general feeling was positive and optimistic, making the 1st DoSSIER Training School a very successful event. All of the school participants, both lecturers and students, left the site with fruitful discussions in mind, leading to research ideas and input to own research projects, as well as wonderful memories of Sirtaki dances and quick dips into the sea that was too cold for some, a bit too warm for others, and yet perfect for the largest majority.

 

An exploration of ‘relevance

Contributed by Geirgios Peikos, University of Milano-Bicocca

I am currently an Early Stage Researcher (ESR) in the DoSSIER project doing my PhD research at the university of Milano-Bicocca. Having an Electrical & Computer Engineering diploma, I have dedicated the past few years working in the fields of natural language processing, data analytics and information retrieval. Currently, my PhD research is focused on domain-specific and professional search, aiming to create novel retrieval models that exploit the characteristics of real-life professional search tasks.

The main objective is to identify how professional users interact with information and model their interactions by exploiting a decision-theoretic setting. To that aim, my research is based on an interpretation of relevance, in which various relevance factors (often conflicting) are identified and used to estimate the overall relevance of an information item with respect to the situation at hand. For example, when a user searches for scientific publications to support her current research work, it might be the case that she is assessing the relevance of an information item i.e., a document, by considering topical similarity, but also the number of citations, the publication venue, the timeliness of the contained information, among others. Therefore, in this case, the user assesses the relevance of the retrieved documents by simultaneously accounting for multiple criteria (i.e., relevance factors).

In my project, I develop IR models for multidimensional relevance estimation that inspired by several Multi-Criteria Decision Making (MCDM) methods. These models assume that the IR system is the decision-maker, while its goal is to estimate an overall relevance value of an information item by quantifying various objective or subjective relevance factors, aiming at incorporating them into the retrieval process. To achieve that, I study the characteristics of various domain-specific and complex search tasks, incorporate appropriate decision-theoretic methods, and model the relevance estimation by considering the decision behaviour of professional users while undertaking the studied task.

Currently, my research is focused on optimizing the efficiency of several MCDM method so they can be applied for information retrieval. Moreover, to investigate the extent to which these decision-theoretic multidimensional relevance models affect retrieval effectiveness, the proposed models have been applied in a complex search task in the medical domain, i.e., the task of eligibility screening for clinical trials. Future work will be focused on implementing the proposed models in other contextual situations that involve different types of users, tasks and relevance factors in the medical, academic and legal domains.

Therefore, I welcome suggestions about information retrieval applications that may benefit from multidimensional relevance estimation. My email is georgios.peikos@unimib.it.

The application of machine learning in the healthcare and biomedical domain

Contributed by Wojciech Kusa, TU-Wien

I am a PhD student at TU Wien, supervised by Prof. Allan Hanbury and Dr Petr Knoth. I am a member of Project DoSSIER, a Marie Skłodowska-Curie Innovative Training Network on Domain Specific Systems for Information Extraction and Retrieval (https://dossier-project.eu/). My research focuses on the application of machine learning in the healthcare and biomedical domain. I am interested in tools and methods that increase the efficiency and productivity of domain experts.

My PhD topic is about methods for automating the systematic literature review process. Systematic reviews aim to find, assess and combine all relevant items concerning a specific subject. Systematic reviews follow strict criteria, as their conclusions are considered the gold standard in evidence-based medicine. Completing systematic reviews is a slow, repetitive, and time-consuming process that relies primarily on human labour. I focus on the citation screening step, which involves reviewing thousands of scientific papers for their eligibility to the research question. From the perspective of machine learning, the task of citation screening is usually presented as a binary classification or ranking problem. However, this is just a convenient simplified model of reality, considering that the manual process involves multiple decisions concerning various inclusion and exclusion criteria.

For several years, there was very little progress in this task. Even the large neural models could not improve over simple logistic regression, mainly because the datasets are highly imbalanced. Usually, as little as 5% of documents are classified as relevant. Moreover, the decisions made by automated screening algorithms are not explainable, which can be a significant obstacle to the adoption within the medical community.

My project attempts to improve the screening task by an approach involving a set of fine-grained classification and extraction steps instead of asking a single ‘Is the review relevant to the research topic?’ question. In the context of medical reviews, these decisions can concern a study’s population, interventions and outcomes (the PICO framework). I experiment with prompt-based learning, which could also enable the usage of once-trained models for new systematic reviews without collecting new annotations.

In the current phase, I am working on a new dataset which could be used to train and validate the system. I plan to gather annotations for several systematic reviews, where instead of binary ‘include/exclude’ decisions, annotators would compare each paper to a set of fine-grained eligibility criteria. Breaking this problem into smaller tasks would make it easier to explain the final decision. Unfortunately, collecting these detailed annotations is time-consuming and often requires special customised software.

I also aim to generalise to systematic reviews from domains other than medicine. Successful outcomes of this experiment could make it possible to conduct more “systematic” literature reviews in disciplines often restricted by human and financial resources. However, I am concerned about whether we can identify and group eligibility criteria in other domains.

I welcome all suggestions on collecting fine-grained annotations and recommendations on how we could generalise the approach for non-medical literature reviews. Feel free to reach out to me via wojciech.kusa@tuwien.ac.at and https://wojciechkusa.github.io

Leave a comment

Your email address will not be published. Required fields are marked *