A preprint of an ECIR2020 paper by Chris Kamphuis, Arjen de Vries, Leonid Boytsov and Jimmy Lin caught my eye recently as it reports on an investigation of eight variants of the BM25 scoring function. BM25 is where information retrieval meets search, as most search vendors will claim they are using BM25 but will never go into details. As the authors note BM25 is now often incorporated into a multi-stage reranking process, and for commercial vendors the exact basis of this reranking is their magic dust when it comes to relevance ranking.
The best place to start on the BM25 journey is a 2008 paper in the Journal of Information Science by Stephen Robertson, in which he gives a fascinating
account of its back-history without using any mathematical formulae. Since I am writing this on Brexit-Exit day it is especially nice to report that BM25 is one of the major UK contributions to the development of information retrieval. (Others include Cyril Cleverdon at Cranfield and Steve Pollitt at the University of Huddersfield).
Although the Lucene implementation is often cited as the master implementation in fact BM25 was only adopted in 2015 for Solr 6. Even at this stage the issue of the role of document length was a subject of much discussion.
The challenge in talking about BM25 is the number of variants that have been proposed, notably BM25F which takes into account the fields within a document (so much for ‘unstructured information’!) and is now quite widely used. Over the last few years there have been several papers that explore both the attributes and issues with BM25 and its variants. In a 2014 paper Andrew Trotman, Antti Puurla and Blake Burgess looked at Improvements to BM25 and Language Models Examined. Four years later, Colin Wilkie and Lief Azzopardi presented The Impact of Fielding on Retrieval Performance and Bias to the Australasian Document Computing Symposium, looking in particular at fields are scored independently and then combined (Model 1), or fields are first combined and then scored (Model 2). To me a mystery about this paper is that there is no reference to BM25 in the title, and I suspect that it has not received the attention it deserves.
If you want to put BM25 in a wider perspective I can recommend Learning to Adaptively Rank Document Retrieval System Configurations by Romain Deveaud, Josiane Mothe, MD Zia Ullah and Jian-Yun Nie. (Please note that MD is not a typing error). They observe that Modern Information Retrieval (IR) systems have become more and more complex, involving a large number of parameters. For example, a system may choose from a set of possible retrieval models (BM25, language model, etc.), or various query expansion parameters, whose values greatly influence the overall retrieval effectiveness. Traditionally, these parameters are set at a system level based on training queries, and the same parameters are then used for different queries. They also observe that it may not be easy to set all these parameters separately, since they can be dependent. In addition, a global setting for all queries may not best fit all individual queries with different characteristics. The parameters should be set according to these characteristics.
These are just a few papers that I have found of interest in gaining a better appreciation of BM25, and especially the fact that the choice of variant and the way it is included in multi-stage reranking have to be taken into consideration in achieving as optimal as possible ranking in a search system. Google Scholar lists out 14,100 results on “BM25”. I think I will leave the rest to another day.
The point I would like to emphasise (wearing my consultant/Professor hat) is that there could be interesting opportunities to assess the performance of BM25 variants in enterprise implementations, especially where the collections are in multiple languages. I suspect that the open source community in particular (Lucene/Solr and Elasticsearch/Solr) would welcome a dialogue with the research community.