According to the IDC whitepaper, The High Cost of Not Finding Information, knowledge workers spend 2.5 hours per day searching for information. Whether they eventually find what they are looking for or just stop and make a sub-optimal decision, there is a high cost to both outcomes. The recruitment industry, for example, relies on Boolean search as the foundation of the candidate sourcing process, and yet finding candidates with appropriate skills and experience remains an ongoing challenge. Similarly, patent agents rely on accurate prior art search as the foundation of their due diligence process, and yet infringement suits are being filed at a rate of more than 10 a day due to the later discovery of prior art which their original search tools missed.
What these professions have in common is a need to develop search strategies that are accurate, repeatable and transparent. The traditional solution to this problem is to use line-by-line query builders which require the user to enter Boolean strings that may then be combined to form a multi-line search strategy:
However, such query builders typically offer limited support for error checking or query optimization, and their output is often compromised by errors and inefficiencies. In this post, we review three early but highly original and influential alternatives, and discuss their contribution to contemporary issues and design challenges.
Alternative approaches
The application of data visualization to search query formulation can offer significant benefits, such as fewer zero-hit queries, improved query comprehension, and better support for exploration of an unfamiliar database. An early example of such an approach is that of Anick et al. (1989), who developed a system that could parse natural language queries and represent them using a “Query Reformulation Workspace”. Although early work, this system introduced a number of key design ideas:
- The query was represented as a set of ‘tiles’ on a visual canvas, which could be (re)arranged by direct manipulation
- Query elements could be made ‘active’ or ‘inactive’
- The layout had a left-to-right reading, with tiles that overlapped vertically being ORed and those which did not being ANDed.
For example, the natural language query ‘Copying backup savesets from tape under ~5.0’ would be represented as follows:
and the Boolean semantic interpretation (shown in the lower half) would be:
(“copy” AND “BACKUP saveset” AND “tape” AND (“~5.0” OR “version 5.0”)).
The set of results retrieved was defined as all those documents that contained some combination of terms from any possible left-to-right path through the chart. Crucially, the user was at liberty to re-arrange those tiles to reformulate the expression, and to activate or deactivate alternative elements to optimise the query. In addition, the system offered support for integration with thesauri and it also displayed the number of hits in the lower left corner of each tile. These are remarkably prescient ideas, and themes to which we return in our own work.
In subsequent work, Fishkin and Stone (1995) investigated the application of direct manipulation techniques to the problem of database query formulation, using a system of ‘lenses’ to refine and filter the data. Lenses could be combined by stacking them and applying a suitable operator, e.g. AND/OR, etc. For example, a user could search a database of US census data to find cities that have high salaries (the upper filter) AND low taxes (the lower filter):
Moreover, these lenses could be combined to create compound lenses, and hence support the encapsulation of queries of arbitrary complexity. This is a further theme to which we return in our own work.
A further influential work is that of Jones (1998), who reflected upon the difficulties that users experience in dealing with Boolean logic, noting in particular the disconnect between query specification and result browsing and the inefficiency caused by a lack of feedback regarding the effectiveness of individual terms. He proposed an alternative in which concepts are expressed using a Venn diagram notation combined with integrated query result previews. Queries could be formulated by overlapping objects within the workspace to create intersections and disjunctions, and subsets could be selected to facilitate execution of subcomponents of an overall query:
Crucially, Jones noted that although the representation offered a degree of universality of expression, the semantic interpretation would necessarily need to be tied to that of the particular collection being searched, and thus independent adapters would be required for each such database. This is also a theme to which we return in our own work.
In summary
In this short piece we have briefly reviewed some of the challenges involved in articulating complex search strategies and Boolean expressions, and studied three early but highly original alternative approaches. Given the decade in which these systems were developed (the first of which pre-dates the web by several years), this is extraordinary work, offering design insights and principles of enduring value. In our next post, we’ll review some of the more recent approaches, and reflect on how their design ideas and insights may be used to address contemporary search challenges.