Applied Research at the Knowledge Discovery research area of the Know-Center in Graz

The Know-Center is Austria’s leading research centre for data-driven business and big data analytics, with the purpose to bridge the gap between industry and academia. Therefore, the state-of-the-art in science is applied to problem setting within the industry to stir innovation, and to actively research in areas not being covered currently by the research community.

Within the Know-Center, the Knowledge Discovery is one of four research areas. Information retrieval is one of three main pillars of our research activities. The other two research fields are machine learning and natural language processing. The other research areas of the Know-Center are: Knowledge Visualization, where new approaches for visual analytics are developed and novel augmented reality tools are being developed – Social Computing, where interaction within social networks and emerging structures are being studied – Ubiquitous Personal Computing, where personal data are collected, analysed and processed using desktop and mobile sensor technology. Finally, the services and development area is responsible to adapt the research output to the needs and quality expectations of the industry.

In regard to information retrieval at the Know-Center, the focus lies on researching in domains directly derived from the needs of the industry and other large organisations. In addition to the functional aspects, we also put an emphasis on the scalability of our proposed methods and hence on the efficiency of the algorithms. This is particularly challenging for information retrieval scenarios, where the the amount of data is big and volatile. For example the scope of the documents, which are accessible might depend on roles and rights, which might change quickly. Under no circumstances confidential information is not allowed to leak. On the other end of the spectrum, each user may have private data, which again needs to be protected.

In addition to projects in close co-operation with our partners in the industry, we also conduct funded research project, on an national level, as well as on a European level. In this article, just a few projects are briefly presented to give an overall impression of our research work in the field of information retrieval an closely related domains.

EEXCESS – Cross Vertical Aggregated Search

The goal of the FP7 EEXCESS project is to bring long tail content closer to the users. This is motivated by the observation that most of the users revert to the same Web information sources, because it would require considerable effort by the users to harvest multiple information sources. Therefore, many sources remain untapped, especially in the cultural heritage domain. The approach followed in this project is to reuse the existing search infrastructure of content providers and combine them using a cross vertical aggregated search. In addition, the user is not expected to explicitly state her information need, instead a just-in-time information retrieval approach is followed. Thereby the users interactions are monitored, for example the browsing history, and the users context is inferred. A crucial aspect of this is to balance the potentially private data of a user and the quality of the results. Here the approach we follow is to keep the user in the position to allow exactly, which information is disclosed and which remains confidential. For example, the user might allow the exact age to be used, or just a rough figure.

As a first step within the project we need to find out, what sources of information might be interesting or helpful for the user. Therefore we conducted a series of user based evaluations to gain a deeper understanding of how users perceive search results list, being generated by assembling the individual search result list of content providers. The content is mainly rooted in the cultural, scientific and economic domain with an emphasis on narrow topics and focus on the expert users. In order to allow us to conduct these experiments we first had to develop a custom evaluation framework, that enables us to test various algorithms for query reformulation, result aggregation and methods to diversify the search result list.

Diversity and serendipity play an important role in the context of the target user group of the project. This is in part emphasised by the way queries are generated, as the search is not initiated by an explicit interaction of the user. Therefore, the relevancy of results cannot easily be determined as it depends on the current context and other factors, which decide upon the helpfulness of the search results.

Another set of challenges stem from the fact that we do not gather the data from the individual content providers, but instead reuse their search services, typically available as REST APIs. Therefore all the technological issues from a federated search need to be addressed, for instance dealing with uncooperative providers, latency and domain specific issues.

CODE – Mining Scientific Articles

Scientific articles represent the main source of information for the FP7 CODE project. The overall goal of project is to generate machine readable Linked Open Data that serve as the foundation for web-based, commercially oriented ecosystems. The necessary steps towards this goal are: parse scholarly publications, extract factual information, disambiguate and semantically label the extracted information, store and aggregate the information using semantic technologies and finally to allow visual analytics on the extracted data. This processing pipeline allows for a number of use cases, with semantic search within the repository of scientific articles as one example.

The first necessary step to a achieve the project’s goal is the parsing of the scientific articles. Here the PDF format is predominately used as exchange and storage format. The PDF format has been engineered to allow the content to be displayed correctly independent of device or operating system. This comes with a price. In this case it concerns the structural and semantic information, which is lost once a document is converted to the PDF format, which just consists of a stream of processing instructions to draw single characters or other other graphical elements. Therefore, we had to devise methods to reconstruct the eliminated information out of the graphical representation of a scientific article.

Our PDF processing pipeline extract the main text, identifies decoration like page numbers, identifies headings, extracts captions for tables and figures, reconstruct the reading order over multiple columns. Next the headings are automatically grouped to recreate the table of contents. Paragraphs that spawn multiple columns or pages are joined and hyphens are removed. An important source of information are tables, as they are expected to contain factual data. Here we developed multiple approaches to identify the table boundaries and to extract the table structure and contents. Specifically for scientific publications we utilize supervised machine learning approaches to extract the core meta-data including the title, journal name, authors names, affiliations and e-mail addresses. For extraction of citation and reference information we enhanced state-of-the-art algorithms by integrating a more rich representation of the input features. In course of the project we managed to publish a number of papers in this domain. Our work is well received in the respective research community, for example one of the papers has been awarded as best paper of the Theory and Practice of Digital Libraries 2013 conference.

Another important source for factual information is the textual content of scientific publications. Here we focused on two domains, the bio-medical domain, where a pool of existing work can be drawn upon, and the computer science domain. For the latter domain we had to conduct additional work, more specifically we created ontological structures and prepared data-sets that contain manually annotated examples for the key concepts. For the information extraction itself we adapted state-of-the-art sequence classification algorithms together with a rich set of features. The final system is able to extract entities and relation between entities with satisfying performance across domains, without the need to conduct a tedious parameter tuning.

Once the factual information is extracted either from tabular structures or natural text, a disambiguation step is conducted and the information is then stored and managed using semantic technologies. These allow the data to be further aggregated and to apply sophisticated retrieval functionality. Finally, the user is able to interact with the information by the use of visual tools. Thereby we developed mechanism that allow users to analyse the data without the need to learn any query language. Instead visual metaphors are used, based on interactive multiple coordinated views.

i-KNOW – Conference on Knowledge Technologies and Data Driven Business

The i-KNOW confernece series aims at advancing research at the intersection of disciplines such as Knowledge Discovery, Semantics, Information Visualization, Visual Analytics, Social (Semantic) and Ubiquitous Computing. The goal of integrating these approaches is to create cognitive computing systems that will enable humans to utilize massive amounts of data. Since 2001, i-KNOW has successfully brought together leading researchers and developers from these fields and attracted over 450 international attendees every year. The international conference is held annually in Graz, Austria and organized by the Know-Center and Graz University of Technology.