Webis Group, Weimar

The Web Technology and Information Systems group, Webis for short (www.webis.de), is part of the media faculty at the Bauhaus-Universität Weimar. The faculty provides, among others, the study course Computer Science and Media (both bachelor and master) and has a strong commitment to research in the field of digital media and information technology.

The Webis group was founded in 2005. The group conducts basic and applied research in the cross section of information retrieval, data mining, and knowledge processing, whereas a main focus is on algorithm development. Our contributions include original work for density-based cluster analysis and cluster labeling, query segmentation and session handling, hash-based search and efficient indexing, the cross language ESA retrieval model, retrieval-specific one-class classification and domain transfer, Web genre categorization, as well as algorithms for text forensics to address the detection of Wikipedia vandalism and text plagiarism.

The Webis group runs a Hadoop cluster and hosts various own corpora as well as relevant corpora from the field of information retrieval and digital libraries (www.webis.de/research/corpora). Moreover, the group is involved in the organization of different activities related to information retrieval; an overview is given here.

The following list illustrates examples of our research activities. See www.webis.de/research for a more comprehensive description.

Netspeak (www.netspeak.org).
Netspeak is a word search engine that reuses the web as a corpus of writing examples. It allows authors to finish incomplete sentences and to search alternative phrasings. Netspeak shows how the world speaks English — visit www.netspeak.org.
PAN (pan.webis.de).
PAN is a network of the world’s leading researchers in plagiarism detection and author identification. The Webis group is the initiator and one of the main driving forces behind PAN.
ChatNoir (chatnoir.webis.de).
ChatNoir is a search engine that is used to research and to educate the state of the art in information retrieval in Big Data settings. It runs on a Hadoop cluster and indexes half a billion web pages which can be searched in seconds.
TIRA (tira.webis.de).
TIRA, short for “Testbed for Information Retrieval Algorithms”, provides a means to embed executable and parameterizable experiments directly into a web page, rendering the experiments reproducible and comparable for other researchers. TIRA is used as evaluation platform for the annual PAN competitions.
Matilda (www.webis.de/research/projects/matilda).
Matilda, short for “Mining Artificial Data”, is a research line where retrieval and mining technology is applied to simulation data. We consider simulation data mining in combination with visual analytics technology as a means to master the increasing amount of automatically generated data.

A long term vision is the integration of technologies from information retrieval and artificial intelligence, combining the research approaches of digital library search, user modeling, and unstructured information management from the former field with problem solving strategies and symbolic computation from the latter.