The MUMIA Summer School: “Building Next Generation Search Systems”.

A Summer Training School, entitled “Building next generation search systems”, was organised by the MUMIA (Multi-lingual and Multifaceted Interactive Information Access) Cost Action (www.mumia-network.eu), and held from 24th to 28th September 2012 in Chalkidiki, Greece. Twenty one PhD students and early stage researchers attended the training school. A unifying theme around several of the talks was patent search. Patent searches typically involve long queries, the documents are classified by topic and are authored by professionals, and are domain specific. In contrast, general web searches involve short queries, the documents are not classified by topic, are often authored by amateurs and web pages and blogs tend to be domain independent.

The opening talk was given by John Tait, who talked further about the unique nature of patent search. Since patents can be declared invalid if the idea first appeared in another language, there is a need for multilingual searching. Patent search is time critical, as patents can go out of date. Traditionally patent search has been considered recall-oriented, as we do need to know whether anything, anywhere has been written on the subject of a patent. However, there is also a need to limit the size of the hit list, as in practice patent officers do not want to examine more than 200 to 300 documents. The concept of patent landscaping, a current research area, goes beyond mere search, as it takes into account what we need the information for. For example, we may need to understand our competitors more – what components does Samsung own patents on? We may wish to identify business opportunities in patented areas, or avoid doing research in areas where patents are unlikely to be obtained. We can predict future product development trends through patent activity.

Georgios Paltoglou of the University of Wolverhampton spoke about distributed information retrieval for patent search, also known as federated patent search. Patent searches invariably involve trawling through several collections of documents, such as medical records, text documents and data bases. Federated search involves three research areas: representation of the source collections; source selection, the job of the “broker” which is where the bulk of research has been done; and result merging. Previously federated systems had been seen as a means of accessing the “invisible web”, but there are characteristics of patent search which make it amenable for distributed IR. There are various patent offices in the world, widely geographically distributed. Patent documents are manually assigned IPC (International Patent Classification) in a five-level hierarchy, and so patents from each IPC subsection can be placed in their own collection. Choosing the right source of documents increases the efficiency of the search, and also avoids the problem of less relevant collections putting noise in the hit list.
Gabriella Pasi spoke on contextualizing search by the representation and exploitation of user context. The traditional approach in IR has been “query centered”, a “one size fits all” viewpoint where the same query by different users would produce the same results. This ignores user expertise and experience, and does not consider the task at hand. There is an emphasis on topical relevance, but we should also take into account such things as desired geographic source or author, the user’s trust in certain sources, timeliness, the user’s knowledge of the domain, emotional state and gender. Modelling these contexts requires the creation of user models, which raises two issues: firstly we must keep the interaction with the system as simple as possible (or unchanged) while collecting information about the user, and secondly we must exploit the user profile to enhance search quality. There is a tension between personalization and privacy issues. Many states have privacy laws that affect the permissibility of user profiling. One solution called client-side personalization is to use only local profiles. There is a need for user-based evaluation, where the two issues of user profile accuracy and retrieval effectiveness are considered separately.

Pavel Braslavski of Kontur Labs and the Ural Federal University gave a talk entitled “Genres and other non-topical features for IR”. Genre-aware search engines could refine the relevance model by returning documents that are both topically relevant and in the desired genre, enrich the query language if the user is able to specify the desired genre, and enable the building of vertical indexes – a separate index for each genre. Using principal component analysis (PCA) he found 5 main genres which formed a series of legal, scientific, publications, literature and chat. Moving along this spectrum from left to right is characterized by increasing adverb ratio, falling adjective ratio and falling word length. He also considered the possible role of text readability in personalising IR searches. Re-ranking could be done if there is a difference between the user reading profile and the readability of the returned documents. The user’s reading proficiency could be estimated by the average query length in words, or the average reading level of satisfied clicks (lingering on a returned document for 30 seconds or more).
Stephanos Vrochidis from the Centre for Research and Technology Hellas spoke about multimedia search engines, in particular how to make them interactive by considering explicit and implicit user feedback. With explicit user feedback, user responses become training data for future searches by using machine learning technique, but in practice users are not willing to expend effort in providing this data. Implicit user feedback can be obtained from patterns of user interaction (such as mouse clicks) in log files, or regard eye movements and heart rate as indicators of user interest. Multimedia techniques have been used in patent search, where for example figure (diagram) labels associate text with images.

Mike Salampasis spoke on how to model and analyse search behavior, drawn from his practical experience on designing and using patent search systems. Information seeking is more human-oriented and open-ended than information retrieval, since IR technology is focused on algorithms, and issues such as recall and precision. User issues over and above traditional IR measures take into account that the user wants maximum information for minimum effort, efficiency (how many steps must the user take?), accuracy (how many mistakes does the user make?), recall in the sense of what does the user remember afterwards about the search session, and the user’s emotional response. Experimentally, it is necessary to set up search scenarios to see how people function in a realistic setting, observe them directly, administer pre and/or post search questionnaires, use think-aloud protocols or measure physiological responses such as eye-tracking. However, techniques that are based on collecting, analyzing and visualizing the discrete steps (as a result of choices and decisions), that information seekers make during an information seeking episode can provide additional views which cannot be easily obtained from more traditional usability testing methods.

Mihai Lupu talked about the evaluation of IR systems. Without evaluation there is no research. He covered retrieval effectiveness rather than efficiency, since effectiveness is more difficult to define, requires user involvement in its study, and there is more current research into it. He covered the history and the relative merits of recall and precision based measures, many of which can be implemented at the TREC website http://trec.nist.gov/trec_eval. Recently Smucker and Clarke (2012) have developed a time-based calibration, where a decay factor is a function of the time required for the user to reach item k in the ranked list. Although there is no single answer to “which measure is best?”, it is possible to group measures by finding the correlations between their retrieval sets using Kendall’s Tau. There are also considerations of which measures discriminate best between systems, and their stability between runs. The choice of measure depends on the task at hand, e.g. for a known-item search we would need a measure that just looks at the rank of that one relevant document. Mihai spoke of some challenges in measuring relevance, such as approaches to pooling, and Voorhees’s finding that inter-annotator agreement on the relevance of documents is rarely above 80%. In order to determine whether the improvement in one system over another is statistically significant, we can use the Student t test to compare the set of effectiveness scores for each of 25 or more topics for both systems under comparison. However, it may be better to look for “substantive” rather than statistical significance, and ask ourselves such things as “does a 2% increase in retrieval performance actually make a user happier?”

Michael Oakes from the University of Sunderland spoke about Natural Language Processing, in particular the levels of language (syntactic, semantic, discourse etc.) at which texts can be studied, and their relative difficulties with respect the bag of words model. The second part of his talk was on the overlap between text classification techniques and standard IR, which also classifies documents into relevant or non-relevant. In his first session, Branimir Reljin of the University of Belgrade gave an overview of image retrieval systems, and in the second, talked about his group’s work at the University of Belgrade on ways to accelerate the initial search without significant degradation of accuracy. One approach is to cluster the image set beforehand using minor component analysis (MCA), and then retrieve as the first set the entire cluster closest to the query. They have also worked on reducing the size of the feature vector, such as by finding a more equal balance between colour and line/texture features, finding dominant components by statistical methods, and considering sets of features as one by using the Gabor mean of means. They found that it was possible to reduce an initial set of 556 features to one tenth of its size without greatly affecting the set of images retrieved in response to a query.