{"id":7848,"date":"2023-04-19T12:21:29","date_gmt":"2023-04-19T11:21:29","guid":{"rendered":"https:\/\/irsg.bcs.org\/informer\/?p=7848"},"modified":"2023-04-19T12:21:29","modified_gmt":"2023-04-19T11:21:29","slug":"relevance-under-uncertainty-the-commercial-realities-of-ir-development","status":"publish","type":"post","link":"https:\/\/archive-irsg.bcs.org\/informer\/?p=7848","title":{"rendered":"Relevance under uncertainty &#8211; the commercial realities of IR development"},"content":{"rendered":"<p><strong>Relevance Under Uncertainty &#8211; How Loop54 does software engineering to advance relevance<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">Loop54 (on the market under the name <\/span><a href=\"https:\/\/www.fact-finder.com\/blog\/loop54-becomes-infinity-plus-new-merchandising-capabilities\/\" target=\"_blank\" rel=\"noopener noreferrer\"><i><span style=\"font-weight: 400;\">FactFinder Infinity<\/span><\/i><\/a><span style=\"font-weight: 400;\">) is a technology that integrates with e-commerce stores and <\/span><b>determines based on visitor interactions, in real time, which the most relevant products are for each individual user at every moment<\/b><span style=\"font-weight: 400;\">. It attempts to perform the function a really good salesperson would if you step into a brick-and-mortar store: figure out as quickly as possible exactly what you are interested in and guide you directly to that. Just as with a really good salesperson, the visitor is not meant to notice that anything out of the ordinary happened. <\/span><span style=\"font-weight: 400;\">This is not the business of definitive rights and wrongs, but ever so many shades of roughly correct. <\/span><\/p>\n<p><span style=\"font-weight: 400;\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/John_Carmack\" target=\"_blank\" rel=\"noopener noreferrer\">John Carmack<\/a> put it fairly well when he said about neural networks that <\/span><span style=\"font-weight: 400;\">&#8220;It is interesting that things still train even when various parts are pretty wrong \u2014 as long as the sign is right most of the time, progress is often made.&#8221;<\/span><\/p>\n<p><!--more--><\/p>\n<p><span style=\"font-weight: 400;\">Loop54 uses machine learning techniques to perform its functionality fully automatically, which makes it difficult to know if changes to the software improve relevance or make it worse. This aspect affects how development of Loop54 happens, but before we get into that, let&#8217;s look at what Loop54 is, more specifically.<\/span><\/p>\n<p><strong>Information Retrieval Is Conditional Probability<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">Loop54 was initially created as a search engine, answering &#8220;<\/span><b>How likely is this product to be relevant given this search phrase?<\/b><span style=\"font-weight: 400;\">&#8221; Over the past ten years it has evolved to support more types of relevance. On a high level, every user-facing feature is a question of conditional probability:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">How likely is this product to be relevant, given that the visitor &#8230;<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">&#8230; looks for alternatives to an electric kettle?<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">&#8230; looks for things that go well together with a guitar?<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">&#8230; looked at a smartphone minutes ago and is now searching for &#8220;Samsung&#8221;?<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">&#8230; has a history of purchasing blue clothing, and is now looking in the category of dresses?<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">I believe this type of question of conditional probability is at the heart of information retrieval. By looking at it this way, we avoid common traps like suggesting five pairs of headphones to a user who has just bought a pair of headphones, or five pairs of sneakers to a user because they looked at sneakers once two months ago. But perhaps more importantly, <\/span><b>we avoid restricting ourselves to focusing on the search phrase as the only input modality<\/b><span style=\"font-weight: 400;\">. In many situations, users also give off other signals of intent that allow us to better understand what is likely to be relevant right now.<\/span><\/p>\n<p><strong>E-Commerce Is Data-Limited<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">This is where e-commerce is both blessed and cursed. Blessed, because the end user provides the software with a very strong signal of relevance: the purchase. On the other hand cursed, because purchase data for most small-to-mid size e-commerce businesses is very limited; there is not enough purchase information on individual products to determine their relevance in the wide variety of contexts in which they are potentially relevant. New products take a while to gather purchase data (&#8220;cold start problem&#8221;), and in some verticals (particularly fashion and technology), new products make up a significant chunk of the active product catalogue.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To get around this problem, Loop54 is based on the hypothesis that <\/span><b>similar products can be treated as one unit when it comes to relevance<\/b><span style=\"font-weight: 400;\">. Loop54 very rarely deals with individual products, instead operating on clusters of similar products, structured hierarchically based on domain-specific similarity measures.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To give an example, if a toy store visitor purchases one specific passenger car after searching for &#8220;train&#8221;, then maybe that counts as a vote for the relevance of all rail-bound vehicles conditional on the search phrase &#8220;train&#8221;, and not just that specific passenger car. If a few visitors of another shop have bought steak thermometers together with barbeque grills, then maybe it is relevant to suggest steak thermometers as complements to barbeque grills more generally, even those pairs of products for which there is not yet any individual purchase data. (Complementary products like these are especially difficult for traditional techniques, because if the data for n products is sparse, you can guess what the data for n\u00b2 pairs of products is like.)<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This raises another concern, though. While the generalisation hypothesis is market-tested in the sense that Loop54 tends to significantly outperform the solution it is replacing, it is very difficult to determine whether small tweaks to the system improve or degrade relevance. If we had a separate gold standard to compare to, it would be easy, but in this case Loop54 itself is our best idea of a gold standard. <\/span><span style=\"font-weight: 400;\">The brief version is that we don&#8217;t have a solution to that problem. We put humans in the loop to verify the relevance of sample queries after relevance-affecting changes, but this is a slow process that scales only linearly with both number of changes and number of installations. <\/span><span style=\"font-weight: 400;\">That leads to some uncertainty around changes, which has shaped how we work with product development.<\/span><\/p>\n<p><strong>Innovation Comes From Easy Prototyping<\/strong><\/p>\n<p><b>Since we don&#8217;t have concrete proof that the things we do are right or wrong, we need to remain flexible in the face of diffuse evidence over a long period of time.<\/b><span style=\"font-weight: 400;\"> This means it is critical to make it easy to experiment, because cheap experiments lead to flexibility in technical direction.\u00a0 <\/span><span style=\"font-weight: 400;\">The development organisation around Loop54 has always been focused on a high rate of experimentation and innovation: Loop54 has from the start had a modular architecture, with a configuration system that is capable of rewiring most of the program logic without a single line of code. Such a flexible configuration system can be a liability without discipline and an eye for long-term maintenance, but success would have been more expensive without built-in support for tinkering.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The distinction between software engineering and feature development has lived on as the organisation has grown. Most of the feature development is done by a small team of product specialists, i.e. experts on Loop54 configuration \u2013 not because they have any sort of privilege within the organisation, but because they are intimately familiar with the requirements of many customers, as well as what the competition looks like. <\/span><b>When they have ideas, they can create prototypes for even relatively large features in the span of hours to a few days <\/b><span style=\"font-weight: 400;\">\u00a0\u2013 without engaging software engineers at that stage. Sometimes these prototypes are inspired by direct requests from customers, in which case the prototype can be tested out on production workloads in collaboration with that customer.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Once a prototype seems successful, the software engineering team helps build the feature into Loop54 in a way that optimises performance, future maintenance demands, and can be rolled out to every customer that benefits from it. This takes significantly longer \u2013 on the order of days to months, which is why it&#8217;s important to do only for ideas that appear successful.<\/span><\/p>\n<p><strong>Software Engineering Stays Responsive<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">In order to remain responsive both to changing circumstances and to requests from the rest of the organisation, <\/span><b>software engineering resources are committed to efforts on short time frames only, and slack is maintained in the system to allow unforeseen problems to be addressed quickly<\/b><span style=\"font-weight: 400;\">. For the same reason, the amount of work in progress is kept low, with priority given to finishing ongoing work before starting new efforts. This implies that things such as internal peer review is taken seriously and feedback is rapid \u2013 hours is the norm. <\/span><span style=\"font-weight: 400;\">In addition to planning over short time frames only, <\/span><b>responsiveness requires both an open mind towards contradictory evidence and an appropriate level of detachment from the work product<\/b><span style=\"font-weight: 400;\">. The people working on a feature need to be able to quickly realise when things aren&#8217;t going the way they expected, and be willing to throw out past work to try a different approach instead. This does not come easily, and takes understanding from management all the way to the top.<\/span><\/p>\n<p><strong>Quality and Velocity Are Correlated<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">Once an approach seems promising, it is important to invest in quality early on, even at the expense of functionality. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">There are two reasons for this, where the first is mundane: <\/span><b>it is much easier to add features to high-quality software than it is to add quality to featureful software<\/b><span style=\"font-weight: 400;\">. <\/span><span style=\"font-weight: 400;\">The second is not so obvious but becomes important especially with the way we work around experimentation. Any issues with the core Loop54 functionality contaminates experimental results directly, but it also makes it harder for non-programmers to experiment with new business logic because of e.g. unexpected interactions between components.<\/span><\/p>\n<p><b>Software quality also helps software engineers move faster, by allowing them to focus on smaller chunks of functionality at a time.<\/b><span style=\"font-weight: 400;\"> If improvements to one area of the software often triggers problems in another area, fewer improvements will necessarily be made because attention is divided between fixing problems and making improvements. Spending time on achieving quality up front allows one to move significantly faster down the line. That, however, requires that one has a decent idea of which things are valuable, and finding that out is why we have a lower quality bar for experiments.<\/span><\/p>\n<p><strong>Avoiding Research Is Competitive Differentiation<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">When people are introduced to the technical details of how Loop54 works, it is common for them to ask, &#8220;Why do you not do popular thing X?&#8221; or &#8220;Have you tried research idea Y?&#8221; This is a fair question, because at every level, Loop54 can seem technically primitive. There are three reasons for this apparent lack of sophistication:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Reliability<\/b><span style=\"font-weight: 400;\">. It is our experience that sticking with tried-and-true mechanisms where possible means there are fewer surprises when components are integrated with each other, and it is easier to troubleshoot the problems that do occur.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Simplicity<\/b><span style=\"font-weight: 400;\">. The unsophisticated is usually simpler and makes it easier to achieve high quality, with the benefits already discussed.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Differentiation<\/b><span style=\"font-weight: 400;\">. Many of the ideas that are published are those that our competition has a head start on, either because a competitor was the one to publish the result, or because a competitor has more resources to spend on achieving the desired result. A recurring theme when choosing which way to take the product has been to avoid replicating what our competition does, and instead find out how we can complement or improve on what they are doing.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">That said, we are taking into account research from multiple areas, primarily by <\/span><b>encouraging individuals in the organisation to follow their interests and pitch ideas for which direction to take development in<\/b><span style=\"font-weight: 400;\">. Sometimes such inspiration leads us to unexpected places, like using image processing techniques to create a palatable blend of search result sets from different categories, or borrowing from bioinformatics in comparing result sets to each other.<\/span><\/p>\n<h3><strong>Innovation Takes Verification<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">All of this explains how we develop Loop54 to take advantage of innovations that appear to work, but we have already seen there is one component missing: <\/span><b>we still haven&#8217;t mechanised the judgment of small relevance differences, and we don&#8217;t know how to<\/b><span style=\"font-weight: 400;\">. <\/span><span style=\"font-weight: 400;\">Traditional methods for inspecting the relevance of a result set often assume Boolean retrieval. This can be easily adapted to probabilistic retrieval by e.g. looking at the problem through information theory and scoring the Shannon information of the observation given our prediction at various percentiles of the result set. However, even this suffers from the problem that it requires that a human judges definitively whether a result is relevant or not. <\/span><b>In the non-trivial cases, this is surprisingly difficult even to get hallway testing humans to agree on.<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Determining categorically whether a product is relevant or not also gets combinatorially complicated when considering what the query is conditioned on. If a visitor of a grocery store has previously bought a lot of vegan products, has shown interest in cow-based dairy this session, is searching for &#8220;milk yoghurt&#8221;, and the grocery store really wants to push every visitor of the dairy section to buy cheese, which products should be in the top five results? Only vegan alternatives? Only yoghurts? Only cow-based milk? Dare I ask about the cheese? Or a combination of the above? An obvious way to prioritise would be based on immediacy (or what Herbert Weisberg would call ambiguity) of information: the search query is very immediate, so gets priority over everything else, and what the grocery store wants to push is the least immediate, so is considered least important. But this is just one alternative way to prioritise, and may not optimise relevance for the visitors of all businesses.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Labeling a data set for all these combinations of contexts given just one priority rule would be daunting. With different priority rules for information it becomes even more difficult. Then as we discover new ways to think of relevance, the labels may need to be updated. Suffice it to say that this option has been ruled out as unscalable so far. So we currently don&#8217;t have an easy answer to this question. It is difficult enough to determine with statistical significance when relevance has changed at all, and much harder to determine whether a change is an improvement or deterioration.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is a problem that sometimes keeps me awake at night. If you know of pertinent research that I&#8217;ve missed, I&#8217;d love to hear more. If you think you can contribute fresh research in this direction, even better!<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">&#8212;-<\/span><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Relevance Under Uncertainty &#8211; How Loop54 does software engineering to advance relevance Loop54 (on the market under the name FactFinder Infinity) is a technology that integrates with e-commerce stores and determines based on visitor interactions, in real time, which the most relevant products are for each individual user at every moment. It attempts to perform&hellip; <a class=\"more-link\" href=\"https:\/\/archive-irsg.bcs.org\/informer\/?p=7848\">Continue reading <span class=\"screen-reader-text\">Relevance under uncertainty &#8211; the commercial realities of IR development<\/span><\/a><\/p>\n","protected":false},"author":88,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[201,217],"tags":[],"class_list":["post-7848","post","type-post","status-publish","format-standard","hentry","category-feature-article","category-spring-2023","entry"],"_links":{"self":[{"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=\/wp\/v2\/posts\/7848","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=\/wp\/v2\/users\/88"}],"replies":[{"embeddable":true,"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7848"}],"version-history":[{"count":0,"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=\/wp\/v2\/posts\/7848\/revisions"}],"wp:attachment":[{"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7848"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7848"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7848"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}