Search strategies considered harmful

Over the last few months I’ve been looking in detail at the process of search strategy formulation, i.e. the various ways in which professionals go about solving the problem of resolving complex information needs.

Some professions (e.g. recruitment professionals) employ complex search queries to address sourcing needs, generating queries such as this:

(“business analyst” or “systems analyst” or “system analyst”
or “data analyst” or “requirements analyst” or “functional
analyst”) and crystal and report* and analy* and data near
analy* and not inventory and not retail and not (ecommerce
or  “e-commerce” or b2b or b2c)

This particular query is designed to retrieve candidates who match a typical client brief. As you can see, it’s essentially a complex Boolean expression, and the challenge of creating and optimising such expressions is the subject of a number of social media forums.

Other professions adopt a different approach. Healthcare professionals, particularly those that are involved in the creation of systematic (literature) reviews, tend to adopt a line by line approach such as this (the published Medline strategy for Oral protein calorie supplementation for children with chronic disease):

randomized controlled trial.pt.
controlled clinical trial.pt.
randomized.ab.
placebo.ab.
clinical trials as topic.sh.
randomly.ab.
trial.ti.
1 or 2 or 3 or 4 or 5 or 6 or 7
(animals not (humans and animals)).sh.
8 not 9
exp Child/
ADOLESCENT/
exp infant/
child hospitalized/
adolescent hospitalized/
(child$ or infant$ or toddler$ or adolescen$ or teenage$).tw.
or/11-16
Child Nutrition Sciences/
exp Dietary Proteins/
Dietary Supplements/
Dietetics/
or/18-21
exp Infant, Newborn/
exp Overweight/
exp Eating Disorders/
Athletes/
exp Sports/
exp Pregnancy/
exp Viruses/
(newborn$ or obes$ or “eating disorder$” or pregnan$ or childbirth or virus$ or influenza).tw.
or/23-30
10 and 17 and 22
32 not 31

In this type of formalism, the search strategy is built up incrementally, as a set of discrete expressions which are referred to by line number and combined using various operators. This type of procedural approach has the advantage that strategies can be built up using techniques such as successive fractions, building blocks, and so on. It also allows the searcher to review the number of results returned at each step, and refine the expression accordingly.

Over the last few months I’ve got used to seeing some quite complex search strategies, often extending over a hundred lines or more. However, a few things about the formalism still strike as being a bit odd.

Firstly, the use of logical statements connected via numbered lines above does rather remind me of first generation BASIC. I’m not saying that the language didn’t have its place, but several decades on we’d like to think we now have recourse to rather more structured approaches. But more to the point, what’s happening with all those line numbers – are they really the best way to organize a collection of logical expressions? Just when we most need a principled mechanism for structuring our approach, it seems we are forced to rely on something as arbitrary as a line number. As any undergraduate computer scientist will tell you, the liberal use of such ‘goto’ statements is indeed considered harmful.

Secondly, and continuing the programming language metaphor, I wonder just how much support there is for constructing expressions that are syntactically correct and semantically transparent. A well-designed (programming) language, for example, should support concepts such as:

Encapsulation: the concept whereby data and functions are packed into a single component. To a degree, this is true of the line by line approach above, but it is compromised by the lack of facility for naming and invoking discrete elements of computation (other than by an arbitrary number).
Abstraction: the ability to generalize from a set of behaviours, e.g. the use of a template which can be populated for a given instance. In the example above we can see that lines 11 to 17 are probably intended to express the population element of the PICO process. So why not abstract this component out? That way, when we need to (re)use it, it could be instantiated on a case by case basis, e.g. male adults in strategy X, female infants in strategy Y, and so on. (OK, I know that some people equate abstraction with hiding implementation details, but I think the generalization sense is more pertinent here).

Likewise, I can imagine cases where we would want our search strategy to encompass other concepts such as inheritance, modularity, etc. So I am left thinking: why has the design of search strategies apparently changed so little when programming languages have changed so rapidly?

Of course, if you’re writing the control software for an Airbus 320 you might argue that you need tools and approaches that deal with a few orders of complexity more than your ‘average’ search strategy. But both endeavours are trying to find elegant and parsimonious ways to express complex logical constructs, both are concerned with syntactic correctness, and both need semantic transparency and pragmatic effectiveness. I wonder – is this formalism a bit like the QWERTY keyboard – a flawed and outdated design, but one that is ubiquitous by little more than convention?