A framework for chatbot evaluation

Unless you’ve been on another planet for the last year or so, you‘ll almost certainly have noticed that chatbots (and conversational agents in general) became quite popular during the course of 2016. It seems that every day a new start up or bot framework was launched, no doubt fuelled at least in part by a growth in the application of data science to language data, combined with a growing awareness in machine learning and AI techniques more generally. So it’s not surprising that we now see on a daily basis all manner of commentary on various aspects of chatbots, from marketing to design, development, commercialisation, etc.

But one topic that doesn’t seem to have received quite as much attention is that of evaluation. It seems that in our collective haste to join the chatbot party, we risk overlooking a key question: how do we know when the efforts we have invested in design and development have actually succeeded? What kind of metrics should be applied, and what constitutes success for a chatbot anyway?

In a commercial setting, the short answer might be to extend traditional business metrics and KPIs based around retention, engagement, and to optimise through A/B testing and so on.  But not all chatbots are deployed in such a setting, and this kind of approach risks conflating technical merit with a host of other circumstantial, confounding influences. So the question remains: is there a principled way to determine which metrics to apply when evaluating conversational agents, and crucially, how should those metrics be deployed as part of a feedback loop to iteratively improve chatbot design?

One initial way to answer this may be to consider the various perspectives involved developing chatbots. For example:

  • The information retrieval perspective: many chatbots offer a type of questions answering service, so we can measure their performance in terms of accuracy, e.g. using precision, recall, F-measure etc. This is a common approach, but focuses on relatively low-level metrics that don’t always align with the overall user experience. In addition, it is predicated on traditional notions of functional utility and relevance, which aren’t always appropriate for applications designed for discretionary or leisure use
  • The user experience perspective: chatbots are essentially an interactive software application, so we can evaluate them from a human factors or usability point of view, and focus on measures such as task completion, user satisfaction etc. This is also quite a popular approach, but is expensive and difficult to scale beyond a handful of user sessions
  • The linguistic perspective: chatbots offer a conversational interaction which can be evaluated by measuring the degree to which they supports Grice’s conversational maxims and other cooperative principles. Unlike the other metrics, these principles begin to accommodate some of the critical elements that underpin effective human discourse, but rely on expert judgement in their application and are thus difficult to automate or use at scale
  • The AI perspective: a chatbot is (arguably) an attempt to offer a human-like interaction, so we could apply AI-oriented measure such as the Turing Test. However, apart from being something of an abstraction, this metric offers essentially a binary outcome, and in its purest form provides little in the way of quantitative feedback to improve the design

My personal background and experience would suggest that the IR perspective offers the most principled approach for quantitative evaluation, and the UX perspective for qualitative evaluation. But it would seem that no single perspective can deliver a universal framework for chatbot evaluation, and that instead some kind of hybrid evaluation scheme may be required. But if so, what kind of hybrid, and what principles should we apply in determining its composition?

In order to answer this question, we ought first to define the problem space a little more formally. Not all chatbots are the same, so we should consider the dimensions by which they can vary. For example:

  1. Is its purpose solely utilitarian, or does it have some degree of discretionary, leisure use?
  2. Is it designed to be purely passive, reacting only to user input, or can it be proactive and take some degree of initiative (both in a functional and conversational sense)?
  3. Does its dialog consist of simple, stateless exchanges or can it apply complex models of linguistic and real-world context?
  4. Is the scope of its knowledge confined to a narrow domain or is it designed to cover a broad range of expertise?
  5. Is the scope of its functionality confined to simple, atomic tasks or is it designed to support higher level or composite tasks which can be decomposed into a series of subtasks necessary for completion of the overall goal?

Evidently these dimensions alone won’t give us a definitive answer to how we should evaluate chatbots. But they do give us a foundation to start to thinking about metrics, and more importantly, a set of evaluation frameworks that we can begin to explore and apply. For example:

  • For discretionary, leisure-oriented chatbots, traditional notions of utility and effectiveness from a classical IR perspective may not be appropriate. Instead, it may be better to consider search as fun and apply a different set of evaluation metrics and principles, such as:
    • Spending more time searching may be a positive rather than negative signal
    • Finding something of interest may be the trigger to start a new session rather end the current one
  • For chatbots that offer some degree of proactive behaviour or initiative, traditional patterns of ‘query then response’ behaviour may no longer be appropriate. Instead, it may be better to apply evaluation approaches from the ‘zero query‘ paradigm, which considers the utility of content pushed to the user based on the context around time, location, environment and user interests.
  • One of the interesting things about the hype surrounding chatbots (as with many hype cycles) is that the protagonists tend to assume that today is day zero of the revolution. So for folks who have been around a little longer, a sense of déjà vu is somewhat inevitable and may actually be a good thing. In this case, it offers an immediate point of reference to communities such as ACL and Sigdial, who have decades of experience in evaluating conversational systems. So for chatbots that claim to offer some degree of linguistic sophistication, this would be a good place to start. (Interesting to note that one of the more recent papers concludes that we must use human evaluations together with metrics that only roughly approximate human judgements).
  • While many chatbots may be built for leisure or even frivolous purposes, the ones that offer the greatest long term value are likely to be those that deliver some kind of functional utility bound to a transactional product or service. In these instances, it seems natural to resort to classical human factors principles and usability metrics such as task completion, effectiveness, efficiency, and so on. However, one of the key drivers for conversational interactions is that such dialogs do not have to be framed as purely functional transactions, and that they may instead embody a more playful, game-like style of interaction. In this context, a range of games UX techniques and evaluation metrics may be more appropriate.
  • For chatbots that aim to support the completion of composite tasks in service of a higher-level goal, it may be necessary to measure task understanding in addition to (sub)task completion. This in turn can be quantified in terms of usefulness and relevance as outlined in evaluation frameworks such as the TREC Tasks track.

So now we have some initial dimensions of variation, we can use this framework to classify chatbots and hence better understand what kind of evaluation techniques may be appropriate. For example, many popular chatbots would be classified as instances of the left hand side of the 5 dimensions: utilitarian, passive, stateless, simplistic and atomic. In these cases, some combination of the classical IR and UX metrics (alongside commercial metrics where data permits) would probably suffice. Alternatively, for more sophisticated, leisure oriented chatbots that provide proactive, stateful dialogs in support of complex composite tasks, it may be time to consider some of the more sophisticated evaluation techniques and approaches described above.

No doubt as the field progresses other dimensions will start to become pertinent, and indeed other evaluation metrics and frameworks will be needed. So what we present above is very much a work in progress, and in that spirit I invite comments and feedback (and of course references to relevant scientific papers). And if you know of any conversational agents that credibly instantiate the right hand side of the above dimensions, I would love to hear about them.

Acknowledgements

I’d like to acknowledge the contributions of the following individuals who shared a breakout group with me at the ELIAS workshop in London where the initial thinking behind this piece was conceived: Jochen Leidner, Evangelos Kanoulas, Ingo Frommholz, Haiming Liu, and Michiel Hildebrand.

CC Image courtesy of Ben Husman on Flickr.

Leave a comment

Your email address will not be published. Required fields are marked *