Relevance under uncertainty – the commercial realities of IR development

Relevance Under Uncertainty – How Loop54 does software engineering to advance relevance

Loop54 (on the market under the name FactFinder Infinity) is a technology that integrates with e-commerce stores and determines based on visitor interactions, in real time, which the most relevant products are for each individual user at every moment. It attempts to perform the function a really good salesperson would if you step into a brick-and-mortar store: figure out as quickly as possible exactly what you are interested in and guide you directly to that. Just as with a really good salesperson, the visitor is not meant to notice that anything out of the ordinary happened. This is not the business of definitive rights and wrongs, but ever so many shades of roughly correct.

John Carmack put it fairly well when he said about neural networks that “It is interesting that things still train even when various parts are pretty wrong — as long as the sign is right most of the time, progress is often made.”

Loop54 uses machine learning techniques to perform its functionality fully automatically, which makes it difficult to know if changes to the software improve relevance or make it worse. This aspect affects how development of Loop54 happens, but before we get into that, let’s look at what Loop54 is, more specifically.

Information Retrieval Is Conditional Probability

Loop54 was initially created as a search engine, answering “How likely is this product to be relevant given this search phrase?” Over the past ten years it has evolved to support more types of relevance. On a high level, every user-facing feature is a question of conditional probability:

How likely is this product to be relevant, given that the visitor …

… looks for alternatives to an electric kettle?
… looks for things that go well together with a guitar?
… looked at a smartphone minutes ago and is now searching for “Samsung”?
… has a history of purchasing blue clothing, and is now looking in the category of dresses?

I believe this type of question of conditional probability is at the heart of information retrieval. By looking at it this way, we avoid common traps like suggesting five pairs of headphones to a user who has just bought a pair of headphones, or five pairs of sneakers to a user because they looked at sneakers once two months ago. But perhaps more importantly, we avoid restricting ourselves to focusing on the search phrase as the only input modality. In many situations, users also give off other signals of intent that allow us to better understand what is likely to be relevant right now.

E-Commerce Is Data-Limited

This is where e-commerce is both blessed and cursed. Blessed, because the end user provides the software with a very strong signal of relevance: the purchase. On the other hand cursed, because purchase data for most small-to-mid size e-commerce businesses is very limited; there is not enough purchase information on individual products to determine their relevance in the wide variety of contexts in which they are potentially relevant. New products take a while to gather purchase data (“cold start problem”), and in some verticals (particularly fashion and technology), new products make up a significant chunk of the active product catalogue.

To get around this problem, Loop54 is based on the hypothesis that similar products can be treated as one unit when it comes to relevance. Loop54 very rarely deals with individual products, instead operating on clusters of similar products, structured hierarchically based on domain-specific similarity measures.

To give an example, if a toy store visitor purchases one specific passenger car after searching for “train”, then maybe that counts as a vote for the relevance of all rail-bound vehicles conditional on the search phrase “train”, and not just that specific passenger car. If a few visitors of another shop have bought steak thermometers together with barbeque grills, then maybe it is relevant to suggest steak thermometers as complements to barbeque grills more generally, even those pairs of products for which there is not yet any individual purchase data. (Complementary products like these are especially difficult for traditional techniques, because if the data for n products is sparse, you can guess what the data for n² pairs of products is like.)

This raises another concern, though. While the generalisation hypothesis is market-tested in the sense that Loop54 tends to significantly outperform the solution it is replacing, it is very difficult to determine whether small tweaks to the system improve or degrade relevance. If we had a separate gold standard to compare to, it would be easy, but in this case Loop54 itself is our best idea of a gold standard. The brief version is that we don’t have a solution to that problem. We put humans in the loop to verify the relevance of sample queries after relevance-affecting changes, but this is a slow process that scales only linearly with both number of changes and number of installations. That leads to some uncertainty around changes, which has shaped how we work with product development.

Innovation Comes From Easy Prototyping

Since we don’t have concrete proof that the things we do are right or wrong, we need to remain flexible in the face of diffuse evidence over a long period of time. This means it is critical to make it easy to experiment, because cheap experiments lead to flexibility in technical direction. The development organisation around Loop54 has always been focused on a high rate of experimentation and innovation: Loop54 has from the start had a modular architecture, with a configuration system that is capable of rewiring most of the program logic without a single line of code. Such a flexible configuration system can be a liability without discipline and an eye for long-term maintenance, but success would have been more expensive without built-in support for tinkering.

The distinction between software engineering and feature development has lived on as the organisation has grown. Most of the feature development is done by a small team of product specialists, i.e. experts on Loop54 configuration – not because they have any sort of privilege within the organisation, but because they are intimately familiar with the requirements of many customers, as well as what the competition looks like. When they have ideas, they can create prototypes for even relatively large features in the span of hours to a few days – without engaging software engineers at that stage. Sometimes these prototypes are inspired by direct requests from customers, in which case the prototype can be tested out on production workloads in collaboration with that customer.

Once a prototype seems successful, the software engineering team helps build the feature into Loop54 in a way that optimises performance, future maintenance demands, and can be rolled out to every customer that benefits from it. This takes significantly longer – on the order of days to months, which is why it’s important to do only for ideas that appear successful.

Software Engineering Stays Responsive

In order to remain responsive both to changing circumstances and to requests from the rest of the organisation, software engineering resources are committed to efforts on short time frames only, and slack is maintained in the system to allow unforeseen problems to be addressed quickly. For the same reason, the amount of work in progress is kept low, with priority given to finishing ongoing work before starting new efforts. This implies that things such as internal peer review is taken seriously and feedback is rapid – hours is the norm. In addition to planning over short time frames only, responsiveness requires both an open mind towards contradictory evidence and an appropriate level of detachment from the work product. The people working on a feature need to be able to quickly realise when things aren’t going the way they expected, and be willing to throw out past work to try a different approach instead. This does not come easily, and takes understanding from management all the way to the top.

Quality and Velocity Are Correlated

Once an approach seems promising, it is important to invest in quality early on, even at the expense of functionality.

There are two reasons for this, where the first is mundane: it is much easier to add features to high-quality software than it is to add quality to featureful software. The second is not so obvious but becomes important especially with the way we work around experimentation. Any issues with the core Loop54 functionality contaminates experimental results directly, but it also makes it harder for non-programmers to experiment with new business logic because of e.g. unexpected interactions between components.

Software quality also helps software engineers move faster, by allowing them to focus on smaller chunks of functionality at a time. If improvements to one area of the software often triggers problems in another area, fewer improvements will necessarily be made because attention is divided between fixing problems and making improvements. Spending time on achieving quality up front allows one to move significantly faster down the line. That, however, requires that one has a decent idea of which things are valuable, and finding that out is why we have a lower quality bar for experiments.

Avoiding Research Is Competitive Differentiation

When people are introduced to the technical details of how Loop54 works, it is common for them to ask, “Why do you not do popular thing X?” or “Have you tried research idea Y?” This is a fair question, because at every level, Loop54 can seem technically primitive. There are three reasons for this apparent lack of sophistication:

Reliability. It is our experience that sticking with tried-and-true mechanisms where possible means there are fewer surprises when components are integrated with each other, and it is easier to troubleshoot the problems that do occur.
Simplicity. The unsophisticated is usually simpler and makes it easier to achieve high quality, with the benefits already discussed.
Differentiation. Many of the ideas that are published are those that our competition has a head start on, either because a competitor was the one to publish the result, or because a competitor has more resources to spend on achieving the desired result. A recurring theme when choosing which way to take the product has been to avoid replicating what our competition does, and instead find out how we can complement or improve on what they are doing.

That said, we are taking into account research from multiple areas, primarily by encouraging individuals in the organisation to follow their interests and pitch ideas for which direction to take development in. Sometimes such inspiration leads us to unexpected places, like using image processing techniques to create a palatable blend of search result sets from different categories, or borrowing from bioinformatics in comparing result sets to each other.

Innovation Takes Verification

All of this explains how we develop Loop54 to take advantage of innovations that appear to work, but we have already seen there is one component missing: we still haven’t mechanised the judgment of small relevance differences, and we don’t know how to. Traditional methods for inspecting the relevance of a result set often assume Boolean retrieval. This can be easily adapted to probabilistic retrieval by e.g. looking at the problem through information theory and scoring the Shannon information of the observation given our prediction at various percentiles of the result set. However, even this suffers from the problem that it requires that a human judges definitively whether a result is relevant or not. In the non-trivial cases, this is surprisingly difficult even to get hallway testing humans to agree on.

Determining categorically whether a product is relevant or not also gets combinatorially complicated when considering what the query is conditioned on. If a visitor of a grocery store has previously bought a lot of vegan products, has shown interest in cow-based dairy this session, is searching for “milk yoghurt”, and the grocery store really wants to push every visitor of the dairy section to buy cheese, which products should be in the top five results? Only vegan alternatives? Only yoghurts? Only cow-based milk? Dare I ask about the cheese? Or a combination of the above? An obvious way to prioritise would be based on immediacy (or what Herbert Weisberg would call ambiguity) of information: the search query is very immediate, so gets priority over everything else, and what the grocery store wants to push is the least immediate, so is considered least important. But this is just one alternative way to prioritise, and may not optimise relevance for the visitors of all businesses.

Labeling a data set for all these combinations of contexts given just one priority rule would be daunting. With different priority rules for information it becomes even more difficult. Then as we discover new ways to think of relevance, the labels may need to be updated. Suffice it to say that this option has been ruled out as unscalable so far. So we currently don’t have an easy answer to this question. It is difficult enough to determine with statistical significance when relevance has changed at all, and much harder to determine whether a change is an improvement or deterioration.

This is a problem that sometimes keeps me awake at night. If you know of pertinent research that I’ve missed, I’d love to hear more. If you think you can contribute fresh research in this direction, even better!

—-

Innovation Takes Verification

Leave a comment Cancel reply