{"id":2993,"date":"2014-11-03T12:32:32","date_gmt":"2014-11-03T12:32:32","guid":{"rendered":"https:\/\/irsg.bcs.org\/informer\/?p=2993"},"modified":"2014-11-03T12:32:32","modified_gmt":"2014-11-03T12:32:32","slug":"mining-search-logs-for-usage-patterns-pt-2","status":"publish","type":"post","link":"https:\/\/archive-irsg.bcs.org\/informer\/?p=2993","title":{"rendered":"Mining search logs for usage patterns (pt 2)"},"content":{"rendered":"<p><a href=\"https:\/\/isquared.files.wordpress.com\/2014\/06\/slide-20.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/isquared.files.wordpress.com\/2014\/06\/slide-20.png?w=300\" alt=\"Expectation Maximization applied to a new sample of 100,000 sessions\" width=\"300\" height=\"155\" \/><\/a><\/p>\n<p>In a <a href=\"https:\/\/irsg.bcs.org\/informer\/?p=2721\">previous post<\/a> I discussed some initial investigations into the use of unsupervised learning techniques (i.e. clustering) to identify <a href=\"http:\/\/isquared.wordpress.com\/2014\/03\/18\/a-taxonomy-of-search-sessions\/\">usage patterns in web search logs<\/a>. As you may recall, we had some initial success in finding interesting patterns of user behaviour in the <a href=\"http:\/\/techcrunch.com\/2006\/08\/06\/aol-proudly-releases-massive-amounts-of-user-search-data\/\">AOL log<\/a>, but when we tried to extend this and replicate a <a href=\"http:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/meet.1450440232\/abstract\">previous study<\/a> of the <a href=\"http:\/\/faculty.ist.psu.edu\/jjansen\/academic\/transaction_logs.html\">Excite log<\/a>, things started to go somewhat awry. In this post, we investigate these issues, present the results of a revised procedure, and reflect on what they tell us about searcher behaviour.<\/p>\n<p><!--more-->So to recap, last time we got to the point where we\u2019d applied the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Expectation%E2%80%93maximization_algorithm\">expectation maximization<\/a> algorithm to a sample of 10,000 sessions from the AOL log, and were hoping to replicate the findings from <a href=\"http:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/meet.1450440232\/abstract\">Dietmar Wolfram\u2019s 2008 study<\/a>. But our results were very different: three clusters instead of four, and some very different patterns. Moreover, our results weren\u2019t even replicable within themselves: a further three samples of 10,000 sessions produced widely different outcomes (7, 10 and 10 clusters respectively). Even increasing the sample size to 100,000 seemed to make little difference (despite the suggestion in Wolfram\u2019s paper that subsets of 50k to 64k sessions should produce stable clusters).<\/p>\n<p>So why are we seeing such different results? One interpretation may be of course that these insights are indeed an authentic reflection of changes in user behaviour due to differences in context (e.g. a different search engine, time period, demographic, etc.) But before we explore that possibility, we should take steps to discount the effect of other confounding factors. For example, is our data truly representative of the population? Taking a further sample of sessions is a relatively straightforward test of this, and indeed, applying EM to a fresh sample of 10,000 sessions produced the following 4 clusters [note that I have changed the display order of the features to facilitate comparison with Wolfram\u2019s results, and simplified their names]:<\/p>\n<div>\n<dl id=\"attachment_1978\">\n<dt><a href=\"https:\/\/isquared.files.wordpress.com\/2014\/06\/slide-13.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/isquared.files.wordpress.com\/2014\/06\/slide-13.png?w=300\" alt=\"Expectation Maximization using Wolfram\u2019s 6 features on 10,000 sessions from AOL\" width=\"300\" height=\"151\" \/><\/a><\/dt>\n<dd>Expectation Maximization using Wolfram\u2019s 6 features on 10,000 sessions from AOL<\/dd>\n<\/dl>\n<\/div>\n<p>This outcome seems to offer some interesting insights, but again, it fails to repeat across the other samples; with 5, 6 and 7 clusters produced each time. Moreover, increasing the sample size to 100,000 also fails to produce a stable result, with 7, 13, 6 and 6 clusters produced on each iteration.<\/p>\n<p>But let\u2019s pause for a moment and examine the pattern in more detail. There is something very odd happening with term popularity now: we see a small cluster (just 3% of the sessions) where this feature seems to be something of an outlier, compressing the remaining traces into a narrow band. Indeed, the phenomenon becomes even more pronounced when we take a sample of 100,000 sessions:<\/p>\n<div>\n<dl id=\"attachment_1979\">\n<dt><a href=\"https:\/\/isquared.files.wordpress.com\/2014\/06\/slide-14.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/isquared.files.wordpress.com\/2014\/06\/slide-14.png?w=300\" alt=\"Expectation Maximization applied to 100,000 sessions\" width=\"300\" height=\"151\" \/><\/a><\/dt>\n<dd>Expectation Maximization applied to 100,000 sessions<\/dd>\n<\/dl>\n<\/div>\n<p>Perhaps this is an artefact of the clustering algorithm? Let\u2019s try <a href=\"http:\/\/weka.sourceforge.net\/doc.packages\/XMeans\/weka\/clusterers\/XMeans.html\">XMeans<\/a> instead (which is a variant of <a href=\"http:\/\/en.wikipedia.org\/wiki\/K-means_clustering\">kMeans<\/a> where the value for k is determined by the algorithm). In this iteration, XMeans finds a local optimum at k=10, so the number of clusters is different. But the overall pattern, with a small cluster (1% of sessions) representing outlier values for term popularity is again clearly visible:<\/p>\n<div>\n<dl id=\"attachment_1980\">\n<dt><a href=\"https:\/\/isquared.files.wordpress.com\/2014\/06\/slide-15.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/isquared.files.wordpress.com\/2014\/06\/slide-15.png?w=300\" alt=\"XMeans (k&lt;=10) applied to 100,000 sessions\" width=\"300\" height=\"165\" \/><\/a><\/dt>\n<dd>XMeans (k&lt;=10) applied to 100,000 sessions<\/dd>\n<\/dl>\n<\/div>\n<p>So something else must be at play. It turns out that there is indeed an artefact in the data which is causing this. Long story short, there are a small number of sessions which contain just a single query, consisting solely of the character \u2018-\u2018. Precisely why they are there is a matter for speculation: they may have been the default query in some popular search application, or an artefact of some automated service or API, etc. We\u2019ll probably never know. But sessions like these, along with other robot-generated sessions, aren\u2019t generally helpful when trying to understand human behavioural patterns. Instead, they are best removed prior to analysis. Of course, there are no 100% reliable criteria for differentiating robot traffic from human, and what should be removed is a matter for judgement, often on a case-by-case basis. In this case, including these single character queries appears to be counter-productive.<\/p>\n<p>So now, with a new sample of 100,000 sessions excluding these outlier queries, we see EM produce the following output:<\/p>\n<div>\n<dl id=\"attachment_1981\">\n<dt><a href=\"https:\/\/isquared.files.wordpress.com\/2014\/06\/slide-20.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/isquared.files.wordpress.com\/2014\/06\/slide-20.png?w=300\" alt=\"Expectation Maximization applied to a new sample of 100,000 sessions\" width=\"300\" height=\"155\" \/><\/a><\/dt>\n<dd>Expectation Maximization applied to a new sample of 100,000 sessions<\/dd>\n<\/dl>\n<\/div>\n<p>This pattern is much more stable, with four iterations producing 7, 7, 7 and 9 clusters respectively. At this point we can start to speculate on what these patterns may be telling us. For example:<\/p>\n<ul>\n<li>Cluster 6 appears to be a group of users that engage in longer sessions, with many queries and many page views (clicks), but few repeating terms<\/li>\n<li>Cluster 4 appears to be a smaller group who seem to specialise in relatively long but popular queries (an odd combination!), also with few repeating terms<\/li>\n<li>Cluster 3 appears to be a relatively large group who make greater use of repeated terms, but are otherwise relatively unengaged (with shorter sessions and fewer page views)<\/li>\n<\/ul>\n<p>And so on. Evidently, the patterns above are somewhat hard to interpret due to the larger number of clusters and lines on the chart. So what would happen if we tried to determine the optimum number ourselves, rather than letting XMeans find one for us? One way of investigating this is to specify different values for k <em>a priori<\/em>, and see how the within-cluster sum of squared errors (which is calculated by <a href=\"http:\/\/www.cs.waikato.ac.nz\/ml\/weka\/\">Weka<\/a> as part of its output) varies on each iteration. For example, varying k from 2 to 10 gives us the following result:<\/p>\n<div>\n<dl id=\"attachment_1982\">\n<dt><a href=\"https:\/\/isquared.files.wordpress.com\/2014\/06\/slide-21.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/isquared.files.wordpress.com\/2014\/06\/slide-21.png?w=300\" alt=\"Sum of squared errors by k\" width=\"300\" height=\"171\" \/><\/a><\/dt>\n<dd>Sum of squared errors by k<\/dd>\n<\/dl>\n<\/div>\n<p>As we can see, there is an \u2018elbow\u2019 around k=4 and another around k=7. This implies that either of these two values may be good choices for a local optimum. We\u2019ve already seen the output for k=7 (which is the optimum that xMeans found), so now let\u2019s try kMeans with k=4:<\/p>\n<div>\n<dl id=\"attachment_1983\">\n<dt><a href=\"https:\/\/isquared.files.wordpress.com\/2014\/06\/slide-22.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/isquared.files.wordpress.com\/2014\/06\/slide-22.png?w=300\" alt=\"kMeans (k=4) applied to 100,000 sessions\" width=\"300\" height=\"168\" \/><\/a><\/dt>\n<dd>kMeans (k=4) applied to 100,000 sessions<\/dd>\n<\/dl>\n<\/div>\n<p>This time the groups are somewhat easier to differentiate. For example, we might infer that:<\/p>\n<ul>\n<li>Cluster 3 represents a baseline or \u2018generic\u2019 cluster of users who hover around the average for all measures<\/li>\n<li>Cluster 4 represents a relatively large group of users who engage in longer sessions (with more queries and page views) but are diverse in their interests, with few repeated terms<\/li>\n<li>Cluster 1 represents a smaller group who are the converse to cluster 4, engaging in shorter sessions but with more repeated terms<\/li>\n<li>Cluster 2 represents a tiny group (2%) of users who are similar to cluster 1 but focus on highly popular queries<\/li>\n<\/ul>\n<p>Evidently, there are other ways we could analyse this data, and there are other ways we could interpret the output. In fact, I hope to write more about search log analysis in the coming weeks, taking advantage of a new source of data, which should further validate the methodology and allow us to explore some very different behaviour patterns. But for now, let\u2019s draw some of the threads together and review what we\u2019ve learnt.<\/p>\n<h2>Conclusions<\/h2>\n<ul>\n<li><strong>Replicate to validate<\/strong>: As researchers, our instincts are to explore the unknown, to solve the unsolvable, and to favour novelty over repetition. But sometimes it befits us to focus on replication: by applying new techniques to old data, we validate our methodology and build a more reliable baseline for our own experimental work.<\/li>\n<li><strong>Features describe, but behaviours explain<\/strong>: It\u2019s tempting to select features based on whatever a particular data source offers, and include as many as possible in the learning process. But not all are equally useful, and some can indeed \u2018drown out\u2019 the influence of more important signals. So rather than starting from what the data can offer, identify the information seeking behaviours you\u2019d like to explore, and try to find the features that most closely align with them.<\/li>\n<li><strong>There is no \u2018right answer&#8217;<\/strong>: As in many investigations of naturalistic phenomena, there is a tendency to look for patterns that make sense or in some way align with our expectations. But those expectations themselves are a subjective, social construct. The fact that we can produce multiple interpretations of the same data underlines the need for a common perspective when comparing patterns in search logs, and to apply recognised models of information seeking behaviour when interpreting the outputs.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In a previous post I discussed some initial investigations into the use of unsupervised learning techniques (i.e. clustering) to identify usage patterns in web search logs. As you may recall, we had some initial success in finding interesting patterns of user behaviour in the AOL log, but when we tried to extend this and replicate&hellip; <a class=\"more-link\" href=\"https:\/\/archive-irsg.bcs.org\/informer\/?p=2993\">Continue reading <span class=\"screen-reader-text\">Mining search logs for usage patterns (pt 2)<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[183,201],"tags":[258,310,332,356,357],"class_list":["post-2993","post","type-post","status-publish","format-standard","hentry","category-autumn-2014","category-feature-article","tag-clustering","tag-log-analysis","tag-site-search","tag-web-search","tag-weka","entry"],"_links":{"self":[{"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=\/wp\/v2\/posts\/2993","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2993"}],"version-history":[{"count":0,"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=\/wp\/v2\/posts\/2993\/revisions"}],"wp:attachment":[{"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2993"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2993"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/archive-irsg.bcs.org\/informer\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2993"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}