Book Review: Fundamentals of Predictive Text Mining 2nd ed (2015)

Book Review: Weiss S. M., Indurkhya N. and Zhang T.  (2015). Fundamentals of Predictive Text Mining. Springer-Verlag, London. Second Edition

The volume “Fundamentals of Predictive Text Mining”, 2nd ed. has nine chapters, a table of contents, a list of references, a Subject Index and an Author Index. The book also includes a Preface written by the three authors,

Summary

Abbriavions: ML=Machine Learning; NLP=Natural Language Processing; IR= Information Retrieval

1) In Chapter 1, “Overview of Text Mining”, it is shown how models are constructed out of unstructured documents and how the results of the classification is projected to new text. One of the most important concepts introduced in this chapter is “data representation”. The spreadsheet model is introduced and some drawbacks, namely sparseness and missing values, are explained. Different text mining tasks that rely on a ML-based predictive framework are described (“1.2 What types of problems can be solved”), namely document classification, information retrieval, document clustering, information extraction. The last section of the chapter explains why performance evaluation is important.

2) Chapter 2, “From Textual Information to Numerical Vectors”, describes how words are converted in a vector-shaped format, which is the required format needed when applying predictive methods.  Words, or tokens, may be reduced to common roots by lemmatizers or stemmers. The words can be added to a dictionary.  In the vector representation, words are represented as attribute-value pairs, where the value of each attribute can be the measure of frequency of occurrence of a specific words weighted by specific weighting schemes,  such as tf/idf. Dictionary can be extended to multiword features like phrases. Other linguistic manipulation can be applied, such as part-of-speech tagging (for morphological analysis) , word sense disambiguation (to solve ambiguities such as “apple” the fruit vs. “apple” the brand), parsing (for syntactic analysis) , etc. In the last section of the chapter (2.12 Feature Generation), the authors point out how linguistic preprocessing can be useful to identify good features for text mining.

3) In the third chapter “Using Text for Prediction”, predictive text mining is described in terms of empirical analysis that focuses on word patterns. Fundamental ML methods are described, namely similarity-based methods, decision rules and trees, probabilistic methods and linear methods.  Section 3.5 Evaluation Performance describes evaluation measures (such as precision, recall etc.) and some pitfalls of these metrics. The chapter ends with a short introduction to graph models for social networks.

4) In chapter 4,  “Information Retrieval and Text Mining”, IR is defined and described as predictive text-mining task because the methods for retrieval can be considered variations of similarity-based nearest-neighbor methods. The different methods of measuring similarity are illustrated, including cosine similarity. Link analysis for ranking similarity of documents is then presented and discussed. The chapter ends with a list of additional evaluation cues to be taken into account when doing IR, eg. the date of appearance of documents and users’ voting.

5) Chapter 5 “Finding Structure in a Document Collection” presents methods for clustering documents. Clustering is used when documents in a collection have no label indicating their content. Clustering helps sort out documents into groups that have implicitly the same theme. A review of popular clustering algorithms is then presented, namely k-means, hierarchical clustering and the EM algorithm. The chapter includes a discussion on how to assign “meaning” to cluster that have been generated by algorithms. The chapter ends by emphasizing the value added by clustering techniques when performing exploratory analysis.

6) Chapter 6, “Looking for Information in Documents”, describes several models and learning methods that can be used for information extraction (IE). IE is defined as “a restricted form of full language understanding, where we know in advance what kind of semantic information we are looking for” (p. 119). Three tasks are examined in more detail, namely name-entity recognition (NER), co-reference resolution and relation extraction (RE). NER refers to automatic identification of names of persons, organizations, locations, expressions of times, quantities, etc. in unstructured text, while RE refers to the detection and classification of semantic relationship mentions. Co-reference resolution occurs when two or more expressions in a text refer to the same person or thing: in order to derive the correct interpretation of a text, pronouns and other referring expressions must be connected to the right individuals. In this chapter the Maximum Entropy method is illustrated and discussed. The chapter ends with a list of applications based on IE in the fields of IR, commercial extraction systems, criminal justice and intelligence.

7) In Chapter 7, “Data Sources for Prediction: Databases, Hybrid Data and the Web”, the authors explore hybrid forms of text and structure numerical data, for example stock data and related newswire headline. Prototypical examples are described, such as opinion mining and sentiment analysis and web-based XML data.

In Chapter 8, “Case Studies”, several scenarios are illustrated and discussed, namely: market intelligence from the web, document matching for digital libraries, help desk applications, assignments of topics to news articles, email filtering, search engines, named-entity extraction, mining of social media, and finally the creation of a customized version of newspaper. Each case study contains the following features: problem description, solution overview, methods and procedures and system deployments. Applications from several fields are reviewed.

In the last chapter, Chapter 9, “Emerging Directions”, the authors present a number of topics that show an increasing interest in predictive text mining, e.g. summarization, question answering, active learning, learning with unlabeled data, and deep learning.

Evaluation

This volume is a gentle introduction to predictive text mining. The language used in the book is simple and accessible. The book summarizes the basic knowledge needed to mine unstructured textual data and cast predictions on new text.

The ideal audience of the book is composed of students and, in general, of beginners with some basic knowledge in information retrieval, probability theory and linear algebra.  This “mathematical maturity” is a requirement that is clearly spelled out in the Preface (p. vi) and is needed to understand the formulas presented along the chapters (eg. Chapters 3 and 4).

In this volume, text mining is presented as an ideal blend of NLP, ML and IR. In particular, the importance of NLP for predictive text mining is duly stressed. NLP tasks are considered to be important to generate valuable features (Chapter 2) and for information extraction (Chapter 6). Chapter 6 could be also used in NLP courses.  In Chapter 9, many NLP tasks (such as summarization and question answering, which are well established in the fields of NLP, computational linguistics and language technology) are referred to as emerging areas in text mining. One of the authors of the book, namely Nitin Indurkhya, is also editor (together with Fred Damerau) of the Handbook of Natural Language Processing, 2010, 2nd ed., and this might explain the emphasis on NLP. Since I am computational linguist, I do appreciate this special stress, also because text mining is actually the ideal interdisciplinary meeting point of different fields, such as NLP, ML and IR.

The book has many good features. For instance:

1)  A number of pseudo-coded algorithms are provided (eg., in see Fig 2.4 “Generating features from tokens” or Fig 2.7 “Generating multiword features from tokens”).

2)   Important questions are addressed and discussed in dedicated sections, eg. “3.2 How many documents are enough?” or “5.3 What do a cluster’s label mean?”.

3)   At the end of each chapter the reader is provided with a short Summary section, that presents the main concepts introduced in the chapter, and a section called “Historical and Bibliographical Remarks”, which is very useful to get an idea of the progress in the area.

4)   Each chapter is complemented with Questions and Exercises, which are valuable additions to the content and can be used in class as teaching material.

5)   Teaching aid is available in the form of “[s]lides, sample solutions to selected exercises and suggestions for using the book in courses are are [sic] available from the publisher’s companion site for this book.” (p. vii) .

6)   Optional software that implements many of the methods discussed in the book can be also be downloaded from the data-miner website (p. vii).

There are, however, a couple of notions that I felt are missing in the book and that might be interpreted as desiderata for the next edition. The first one is the notion of “inductive bias”. What it is and how it affects the performance of different learning algorithms on the same data is important to know in ML practice.  Following Mitchell’s definition, the inductive bias of a machine learning algorithm is the “assumptions that must be added to the observed data to transform the algorithm’s outputs into logical deductions”. According to Daume’ “in the absence of data that narrow down the relevant concept, what type of solutions are we more like to prefer?”. Inductive bias is difficult notion for students to understand and it would be useful to explain it in a text mining manual.

The second desideratum is comparative evaluation metric. The evaluation of performance is comprehensively dealt with along the chapters. For instance, Chapters 3, 4, and  5 consistently present evaluation for supervised ML-based methods, for IR and for clustering. However, in order to compare the performance of two or more classifiers, statistical tests are usually employed, such as t-test and the like. This knowledge would also be useful for the students to fully grasp ML practice.

All in all, this manual makes a good contribution not only in predictive text mining, but also in machine learning for language technology in general. It is a good reading and a valuable manual in many respects.

Marina Santini

Computational linguist currently teaching Mathematics for language technologists, Machine Learning  in language technology and Semantic analysis in language technology at Uppsala University (Sweden).