And finally…..the pleasures and pain of inverting files

The topic of search index development came floating across my mind when reading the work of the IR Anthology team on identifying over 53,000 research papers. To take advantage of the outcomes of this research by a commercial search software company would almost always require its customers to re-index their document collection (which might be 500 million+ documents) and that is something they are not going to risk. This, in my view, is a much greater barrier to the adoption of novel and potentially very valuable IR methodologies than whether or not they are published in open access journals. Perhaps all IR courses should invite along a search manager to give them a sense of just what is involved in managing large scale internal search applications.

Building the initial index is not a time to be around either an enterprise or e-commerce search team. Not so very long ago it might take a couple of weeks to index a reasonable-sized document collection. Halfway through a rogue document (often a PDF) might bring the process to a halt or a post-index check might show that some stop words had not been correctly recognized. This might require starting all over again. Working out how best to add new content to an index is also hard work. It is not unusual for employees to be unable to find a document they wrote in the last few days because it is in a collection that is only re-indexed on a weekly basis.   The duration and fragility of this process inhibits search vendors from making changes to the index structure that would require either a full or at least a partial re-index.

We are of course familiar with the designation of ‘inverted file’ for the most common architecture for a search index but might not be aware that the early work on developing this architecture was undertaken by Alfonso Cárdenas at the IBM Research Laboratory in San Jose back in 1974/1975, (ACM Communications 1975, 18(5) 253-263.)  However the first use of the concept of an inverted index arguably dates back to 1947 and W.E.Batten at the ICI Research Laboratories working with Hollerith punched cards, where one card recorded all the document numbers relating to a particular topic. By the time I started my career each card referred to up to 10,000 records. Four cards down on the illuminated screen reveals 43 results. Add a fifth and ZERO! What to do next – that is the question?

There is a very good 2006 introduction and review on inverted files from Justin Zobel and Alistair Moffat and I’d also commend The Curse of Dense Low-Dimensional Information Retrieval For Large Index Sizes by Nils Reimers and Irnya Gurevych for a 2021 perspective.

Leave a comment

Your email address will not be published. Required fields are marked *