Graduate Student Corner – Pedro Ruas, University of Lisbon

I am currently a PhD student in the PhD Programme in Informatics at the Faculty of Sciences of the University of Lisbon. My main research interests are biomedical text mining and natural language processing. I have published several journal and workshop articles that focus on improving the named entity linking task and other text mining tasks, such as scientific recommendation and text classification.

I first got in contact with the large body of scientific data during my biomedical academic path (Biochemistry bachelor’s degree and Medical Parasitology master’s degree). Then during the Bioinformatics master’s program, I realized that using computational approaches was a very efficient way of getting new biological insights, in particular from text.

Consequently, my PhD project focuses on the biomedical text mining field and on the named entity linking task. The goal of the task is to automatically associate entities in a text to entries in target repositories, such as ontologies, knowledge bases, or vocabularies. This type of system is essential in information extraction and retrieval pipelines.

Biomedical entities like genes, chemicals or diseases are often ambiguous  (for example, there are several names to designate the same entity) which raises challenges when one is searching for specific information. Named entity linking can improve question answering systems and semantic search. Semantic search engines can include representations of the indexed texts enriched with the output of a named entity linking system, i.e. the entities associated with the identifiers of the target repository. When a user inputs a query the system matches it with the texts according to the semantics and not the lexical similarity. Imagine if you were searching for documents using the query “COVID-19”. A search engine relying on lexical similarity would return the documents mentioning “COVID-19”  and ignore documents mentioning “SARS Coronavirus-2 infection”, a synonym of “COVID-19” with low lexical similarity. A semantic search engine would include enriched representations of the indexed texts, where both “COVID-19” and “SARS Coronavirus-2 infection” would be linked to the same entry/identifier, so the same query would return the documents mentioning either “COVID-19” or “SARS Coronavirus-2 infection”.

My project attempts to solve limitations associated with existing named entity linking models, such as decreased performance when dealing with incomplete repositories, but also the scarcity of resources devoted to non-English languages and specific biomedical domains.

The first limitation is tackled by directly extracting semantic relations between entities expressed in the scientific literature and partially associating NIL entities, i.e. entities with no matching entries in the target repository, with the best entries available.

The second limitation is tackled by developing a multilingual biomedical dataset including Spanish, Portuguese, and English texts. Most of the existing resources focus on the English language. Spanish and Portuguese are the second and sixth languages with more native speakers, but fewer resources are available. The tools developed for English text have low performance when applied to Portuguese and Spanish, which is even lower in specific domains. For instance, clinical text is mostly expressed in the native language of the authors, so it is necessary to use language-specific tools for information extraction. Even for related languages, such as Portuguese and Spanish, that have a lexical similarity of around 90%, the performance is not transferable across languages.

The output of the project will be an out-of-the-box modular tool that can extract biomedical entities in text and link them to target repositories. The target audience includes researchers in the information retrieval field looking to improve their entity-based systems, biomedical researchers or curators of scientific repositories searching for specific information in a given scientific article or organizing information through a knowledge graph. Besides, a new biomedical dataset focusing on Portuguese and Spanish languages will be available for text mining researchers.

I welcome suggestions to further improve the performance of the named entity linking tool, as well information retrieval applications that can benefit from its integration. I have a web site and my email is psruas@fc.ul.pt

Leave a comment

Your email address will not be published. Required fields are marked *