When I started the London Text Analytics meetup group some seven years ago, ‘text analytics’ was a term used by few, and understood by even fewer. Apart from a handful of enthusiasts and academics (who preferred the label of “natural language processing” anyway), the field was either overlooked or ignored by most people. Even the advent of “big data” – of which the vast majority was unstructured – did little to change perceptions.
But now, in these days of chatbot-fuelled AI mania, it seems everyone wants to be part of the action. The commercialisation and democratisation of hitherto academic subjects such as AI and machine learning have highlighted a need for practical skills that focus explicitly on the management of unstructured data. Career opportunities have inevitably followed, with job adverts now calling directly for skills in natural language processing and text mining. So the publication of Tom Reamy’s book “Deep Text: Using Text Analytics to Conquer Information Overload, Get Real Value from Social Media, and Add Bigger Text to Big Data” is indeed well timed.
At first blush, this book follows a conventional structure: divided into five parts, it introduces the reader to the technical basics of text analytics, and then presents practical advice on getting started in the field. This is followed by detailed descriptions of some of the fundamental principles and techniques of application development, which is in turn complemented by an analysis of these techniques applied to three popular application areas (search, info apps and social media). The book then closes with an account of the author’s perspectives on text analytics as a platform for offering enterprise-wide capability based on a principles semantic infrastructure.
The field of text analytics is not known for being blessed with an abundance of text books, so directly comparable predecessors for this volume are rare. The few that do exist generally reflect either a direct lineage with their roots in the academic discipline of natural language processing, or are based explicitly around a specific NLP toolkit (such as NLKT or GATE).
What makes Deep Text different is that it is neither of the above. Instead, it offers an unashamedly practitioner-oriented point of view, infused with the author’s own experiences as principal of a consulting organisation offering text analytics services. This gives the book a unique insight into the practicalities of delivering text analytics skills, services and projects within a commercial environment, with all the constraints on time, budgets and decision-making authority that that implies. The stories that the author presents (and yes, the majority are indeed presented as a historical narrative) offer a marked contrast to conventional NLP text book content: instead of academic concepts and principles, we have empirical case studies and insights. Instead of ‘how to’ instructions based around specific tools and tasks, we have professional guidance and best practices based around enterprise-level engagements. This change of perspective, along with the author’s eclectic background and highly accessible writing style, combine to make Deep Text a highly readable and astute commentary on what it means to succeed as a text analytics professional in the big data era.
But no book is perfect, and at this risk of sounding churlish, here are a few of the shortcomings as I see them. First, the corollary to describing the field of text analytics through the lens of the practitioner consultant is that the world consists of a series of engagements around the selection of proprietary platforms. Granted, this is of course a key activity for many, and Tom does well in abstracting out key insights and best practices from these experiences. But proprietary platforms are just that: unique to that vendor, often idiosyncratic in nature, and the insights gained from this experience rarely have the same impact outside of that context.
Moreover, there were times reading the book when I felt that the work of the academic community had been under-represented: on several occasions I felt that Tom’s view could have been better articulated had he cited certain seminal papers or other sources of scholarly evidence. And his own background in taxonomy, while undoubtedly contributing to the overall accessibility of the work, sometimes felt slightly overplayed, perhaps at the expense of other disciplines and perspectives that all contribute to the text analytics community.
But these shortcomings are minor, and I consider the time I invested in reading all 400+ pages of this book to be well spent. If you are interested in a career in text analytics, or contemplating a switch to specialise in the analysis of unstructured data, then this book is an excellent place to start.