“Big Data” is one of the latest buzzwords in the IT industry nowadays. Companies are building up huge stores of data running into terabytes and more. Data hierarchies are getting bigger and bigger and more complex. At the same time, search/categorization speeds are also expected to increase. Single classifiers are now unable to deal with this huge data in real time. Today’s vast data repositories such as the web also contain many broad domains of data which are quite distinct from each other e.g. medicine, education, sports and politics. Each domain constitutes a subspace of the data within which the documents are similar to each other but quite distinct from the documents in another subspace. The data within these domains is frequently further divided into many subcategories as shown in Fig. 1 below.
Subspace Learning is a technique popular with non-text domains such as pattern recognition to increase speed and accuracy. Subspace analysis lends itself naturally to the idea of hybrid classifiers. Each subspace can be processed by a classifier best suited to the characteristics of that particular subspace. Instead of using the complete set of full space feature dimensions, classifier performances can be boosted by using only a subset of the dimensions. This work presents a novel way of partitioning the data into subspaces based on underlying semantic content derived from the hierarchies. In this method, the information required for data partitioning is built into the vector representation. As such the partitioning can be done dynamically. This new vector representation is called the Conditional Significance Vector. We also experiment with different hybrid classifier combinations and show that these combinations perform much better than single classifiers when dealing with large data.
1. Conditional Significance Vector
We take the Significance Vector which is an existing vector representation that incorporates category information. It represents the significance of the data and weighs different words according to their significance for different topics. This vector representation was designed for a flat classification system and the positioning of the categories as components of the word/document vector does not follow any specific structure. We modify the Significance Vectors to represent a category hierarchy rather than a flat category structure. Consider the two-level hierarchy shown in Fig. 2 with four level 1 topics (main topics) and 20 level 2 topics (subtopics).
Our Conditional Significance Document Vector in this case consists of 24 components out of which the first 4 represent the 4 level 1 (main) topics and the remaining 20 represent the 20 level 2 (sub) topics. The four level 1 topics represent the four subspaces of the full data space. Within the 20 level 2 topics, the subtopics belonging to the same main topic are positioned consecutively in the vector space. This leads to a semantic division of the vector space into 4 groups, each group representing the subtopics of a specific main topic and therefore a subspace.
Since the document significance vector represents the significance of the document for the different categories, the category with the maximum numerical significance value is most likely to be the real category of a given document. Hence we propose the Maximum Significance Value as a means to detect the relevant subspace (level 1 topic) of a new test document. This value is extracted from the first four components of the document vector representing the four main topics. At subtopic level, the significance values represent the significance of the document for different subtopics within a given main topic. Hence this vector is called the Conditional Significance Vector.
2. Hybrid Parallel Classifier Architecture
Fig. 3 shows our proposed Hybrid Parallel Classifier where different classifiers operate on different portions of the input data space.
The combining classifier decides which part of the input data has to be applied to which base classifier. During the training phase, the training data set is divided into separate training data subsets according to the level 1 topics or subspaces (4 subsets for our example in Fig 2). In this architecture the combination classifier chooses the relevant subspace of a test vector based on the Maximum Significance Value discussed earlier. The vector components corresponding to subtopics of this subspace (main topic) are extracted and then given to the classifier trained on this subspace for level 2 classification of the test vector.
3. Experiments & Results
We used three different datasets for our experiments. The first two datasets were Reuters Headlines and Reuters Full Text extracted from the well-known Reuters News Corpus and consisted of 10,000 items each. These datasets consisted of 4 main topics and 50 subtopics. The third dataset was the Large Scale Hierarchical Text Corpus (LSHTC) which is web-based data drawn from the ODP web directory. This consisted of 4463 items with 10 main topics and 158 subtopics. The hybrid classifier implementations for the Reuters datasets consisted of 4 classifiers to deal with its 4 main topics while the hybrid classifier implementations for the LSHTC dataset consisted of 10 classifiers to deal with its 10 main topics.
All our experiments confirmed the fact that the maximum significance value was very effective in detecting the relevant subspace of a test document and that training separate classifiers on separate subsets of the original data enhanced overall classification accuracy. Hybrid parallel combinations of classifiers trained on different semantic subspaces offered a significant performance improvement over single classifier learning on full data space and the use of conditional significance vectors increased subtopic classification accuracy. The performances of the various hybrid classifiers were very close to each other but all of them performed much better than the baseline single classifiers. The improvement in classifier accuracy was more with the LSHTC Corpus than the Reuters Corpus. Thus datasets with a larger number of categories benefited more from this architecture. This result is particularly encouraging for real-world applications where the number of categories would be much larger than the number present in the experimental datasets. An unexpected result from our experiments was that Reuters Headlines performed better than Reuters Full Text for the purpose of news classification. This can be attributed to the fact that Reuters Full Text contains a lot of text which is introduced to make reading interesting. From a text processing point of view, this acts as noise which interferes with the relevant words. On the other hand, Reuters Headlines provide a concise summary of the news article which improves classification accuracy.
We also implemented the hybrid classifier as a meta-classifier using the same type of classifier for all subspaces. The results of this meta-classifier were similar to those of the various hybrid combinations suggesting that the method of classifier combination was more important than the classifiers themselves. Our architecture can thus be implemented with any base classifier available. The use of the meta-classifier also resulted in a considerable reduction in training and test timings along with an improvement in the classification accuracy over the corresponding single classifier.
4. Conclusion
Our experiments confirm the fact that the Maximum Significance Value is very effective in detecting the relevant subspace of a test document and that training different classifiers on different subsets of the original data enhances overall classification accuracy and significantly reduces training/testing times. The speed-up achieved is very significant in all cases. In this work, we have applied our techniques to unstructured text. However, Hybrid Parallel Classifiers can be applied to many other domains such as Image Processing, Pattern Recognition and Computer Vision where different classifiers can work on different parts of an image/pattern to improve overall recognition. Apart from image data, this technique can also be applied to image captions for image classification. Computational Biology can also benefit from this method to improve recognition within subdomains. Social Media, which has a lot of text content, can also be explored with this method for customised suggestions and marketing.