The concept of Big Data has been around for some time. John Mashey at Silicon Graphics is usually credited with inventing the term in a presentation he gave in 1998. Without doubt big data is very difficult to manage and the demand for people with data science skills never seems to slow down. However much less attention has been paid to Big Information and the equivalent need for information scientists, a term invented in 1958 by Jason Farradane.
In early October a group of investigative journalists released the Pandora papers, The Pandora paper revelations came from a very large tranche of documents: 2.94 terabytes of data in all, 11.9 million records and documents dating back to the 1970s. I would recommend a very good article in Wired UK which provides a substantial amount of information on how the information in these documents was surfaced and analyzed.
Although this may seem a large collection of documents many organisations can easily exceed these levels. This is especially the case in the pharmaceutical sector where the companies not only have very large collections of text documents but also massive collections of clinical trial data. One of my clients in the pharma sector has over 500 million documents being managed with an enterprise search application. I recently came across a very interesting paper from a team at Novartis that describes the use of a BERT development (CT-BERT) for NER applications with clinical trial documents.
For the last 20 years one of the constant challenges of search procurement projects has been working with clients to balance software costs, professional services costs and the investment in a search team. Computational costs could pretty much be ignored in an on-prem installation other than perhaps some additional servers for the index. Now the three dimensions have become four. The fourth dimension is the computational costs incurred with the cloud service provider.
A recent post from the guru of search gurus, Daniel Tunkelang, sets out some of these costs and makes very informed suggestions as to where investment in computational functionality is wise and also where the impact is low. The post is primarily directed at opensource solutions but it is completely relevant to commercial enterprise search situations. The difference is that with a commercial application the customer has little control over the allocation of computing resources. All the normal rules on capacity planning go out of the window with search, especially where there is a migration from on-prem to cloud, because the customer has absolutely no prior knowledge of the difference between enterprise applications that use databases and search applications with look-ups from an inverted file.
The vendor may have been transparent in the elements involved but will not be able to fine-tune the cloud service costs without a seriously deep dive into both the current architecture and the architecture-to-be. In my experience two of the elements that often catch organisations out are auto-suggest (because every added character in effect is a new query to run against the index) and when it is necessary to undertake a partial re-index. Another factor is federated search. It may seem such a good idea to be able to search all applications but will the budget support this use case and deliver value?
Then there is the system up-time problem. It is unusual for a search vendor to have their own cloud platform so they will be buying space on AWS, Google, Azure or any number of other services. This space usually comes with an up-time commitment of (say) 99.7%. That may seem fine until you work through the implications of the 0.3% and decide that a 99.95% up-time is essential to maintain global service levels. The cloud provider can certainly meet that requirement but the cost of the additional 0.25% might make your eyes water. Federated search again needs to be factored in because the required search uptime may be greater than one of the selected applications (HR is a good example) where computation speed has never been a major requirement.
These cost considerations could have an impact on the adoption of IR research developments. There is very rarely any consideration in research papers as to the likely technical costs of implementation.