Search: Emergent and Extrinsic Semantics

Semantics is a term often used in the search technology and information retrieval community these days. A distinction is drawn between semantic and traditional search, implying that somehow semantic search is a more advanced or sophisticated form.

My claim in this article is that there are actually two forms of semantic search: emergent and extrinsic. Further I want to claim that they are related, and that one of them (emergent) is not new but has been in widespread use since the 1980’s when “natural language querying” (as embodied in Google for example)  started to supplant pure Boolean querying as the usual query form for search on unstructured data.

My dictionary defines semantics as the branch of the science of language related to meaning. In search technology and information retrieval it has come to be associated with two very distinct ideas and communities.

First, which I will extrinsic, is the idea of using ontologies to enhance querying and especially by enriching the original texts or other data by automatically or semi-automatically adding meta-data during the indexing and document indigestion cycle. This approach has been developed by the dogged work of those who have spent many years trying to bring some reality to Tim Berners-Lee’s (2001) vision of the semantic web. Their motivation has always been to relate the relatively unstructured human-friendly world of language (whether text documents, natural language queries, or whatever) to the computer-friendly world highly-structured data, for example relational databases.

Second, which I will call emergent, is the use of the term semantics in Latent Semantic Analysis and a group of related techniques, like Principal Component Analysis mainly adopted from statistics. I want to include within this group classical information retrieval techniques for term weighting (like tf*idf) which have been used by the research community since the early 1970’s (Sparck Jones 1972, Salton & Buckley, 1988). Latent Semantic Analysis, it is worth noting, has a rather different pedigree, in that it was originated (Deerwasser et al, 1990, Landauer, Foltz and Laham, 1998) as a mathematically sound technique to explain certain findings from psychology on human memory and language processing, whereas the others make no such claims about their relation to human language processing and cognition. Many people how work in information retrieval and search, who I would class as working with emergent semantics, would be surprised at the attribution of this label to their work.

Now it is very common to see emergent and extrinsic as somehow contrasting and irreconcilable. Whereas I want to claim they are really two sides of the same coin, and further complementary and supporting.

It is common for those in the extrinsic (really semantic web community) to be somewhat dismissive towards to emergent community: seeing the basis of their work as lacking (real) semantics. This misses the point, which is that there must be some notion of semantics in therein emergent systems, because even simple word matching is dealing with semantic notions like synonymy: crudely the same words (space delimited strings of characters in English text) in similar contexts often mean the same. The problem is that emergent semantics are obscure, hidden, and difficult to access.

In my view the difficulty of making visible the knowledge hidden in the term weighting schemes and indexing systems has led people to make the mistaken jump to the conclusion that these systems  contain no semantics. My claim is that they do have semantics: but emergent semantics are generally obscure.

One of the things which has become apparent as the semantic web has developed is the sheer impossibility of maintaining large ontologies solely by manual efforts. Perhaps, with a little more reflection, it would have been apparent that ontologies would be continuously under revision. Our understanding of the world, and the language we use to talk about it is under continuous revision (and hopefully improvement): indeed science might be regarded as an activity which is intrinsically revising knowledge structures.

The solution to this problem in recent years has been adopt the use of machine learning of one sort or another to modify or extend ontologies (Wong, Liu and Bennammoun, 2012). But fundamentally many of the machine learning algorithms rely on exactly the same statistical term occurrence models as have been in long term use in IR.

My point is that in practice there is no fundamental difference between semantic search and more traditional purely statistical forms of IR. They started from different starting points: semantic search was initially focussed on improving the operation of computer systems by incorporating human knowledge relevant to the task in a comprehensible way; whereas the objective of traditional IR was to improve retrieval of documents by fully automated processing. This difference in starting points led to a divergence of tracks but these tracks are converging, as is witnessed by introduction of Google Knowledge.

Personally I have no doubt some additional extrinsic knowledge improves search. For example, the simple recognition that “George Bush” does not always refer to the same person, can eliminate many irrelevant documents from some search, improve relevance feedback, be exploited by results diversity measures and so on. Traditional IR systems, reliant on emergent knowledge extracted during the indexing process have great difficulty with this problem.

Further, ontologies and other forms of extrinsic knowledge have great potential in improving enterprise search: departmental boundaries, company specific document types, recognition of key points in company’s evolution (change of CEO, mergers, acquisitions and so on), are very hard to capture through fully automated analysis of document collections. Providing environments which can effectively and simply combine extrinsic knowledge of this sort, other forms of content management, and the emergent knowledge used in traditional information retrieval will be a major challenge for the enterprise search industry in the next few years.

References

  1. Berners-Lee, T., J. Hendler, O. Lasila “The Semantic Web” Scientific American Magazine, 2001.
  2. Sparck Jones, K.  “A statistical interpretation of term specificity and its application in retrieval”Journal of Documentation 28 (1): 11–21.
  3. Salton, G. And C. Buckley  Salton G, Buckley C (1988). “Term-weighting approaches in automatic text retrieval”. Information Processing and Management 24 (5): 513–523
  4. Scott DeerwesterSusan T. DumaisGeorge W. FurnasThomas K. LandauerRichard Harshman (1990). “Indexing by Latent Semantic Analysis” (PDF). Journal of the American Society for Information Science 41 (6): 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  5. Thomas Landauer, Peter W. Foltz, & Darrell Laham (1998). “Introduction to Latent Semantic Analysis” (PDF). Discourse Processes 25 (2–3): 259–284.doi:10.1080/01638539809545028.
  6. Wilson Wong, Wei Liu, and Mohammed Bennamoun. 2012. Ontology learning from text: A look back and into the future. ACM Comput. Surv. 44, 4, Article 20 (September 2012), 36 pages.