As a PhD student just reaching the end of my first year at the Centre for Vision Speech and Signal Processing (CVSSP) at the University of Surrey, I was delighted to receive a postgraduate bursary from the British Computer Society IRSG to support my attendance at Search Solutions 2015.
Search Solutions was an ideal opportunity to gain a broad appreciation of information retrieval techniques, being directly aligned with my PhD research on robust, scalable search of image collections. Although many of the works showcased at Search Solutions focus on text based information retrieval, many contemporary visual search algorithms borrows heavily from techniques rooted in that field for their scalability e.g. Bag of Words, TF-IDF and Inverse Indexing – and of course deep learning is now being important for describing documents in both the image and text domains.
My work explores sketch-based image retrieval (SBIR); an emerging sub-field within visual search where a user provides a free-hand sketch as the query to search through potentially millions of database images. Although visual content is being generated at a staggering rate (Facebook 350M photos/day, Instagram 60M photos/day, Youtube 300 video hours/day), technology for the management of visual media has not kept pace with its generation. Most multimedia search engines are predominantly still using textual queries to search visual media, and whilst text efficiently conveys semantic concepts (e.g. find me a flower) it is neither intuitive nor concise to describe appearance in this manner (e.g. find me a flower that looks like this, or a video containing movement like this).
Sketching has been proved to be an efficient communication method since the age of caveman. Children can sketch without much effort before learning a language. Also, humans seem to have no difficulty interpreting sketch despite its sparsity and ambiguity. On top of that, the dramatic development of touch-screen devices such as tablet and smartphone makes sketching a trivial task. In fact, there are already several applications utilising search based sketch. Detexify helps Latex users find the code for any Maths symbol by simply drawing it. Google Android wear recently supports emoji recogniser where a user can sketch a clumsy emoji and the system will return the closest matchings. However, these applications are highly domain specific and their datasets are quite small. We have yet to see a practical SBIR system that robustly addresses generic image searching in a scalable manner.
Therefore the opportunity to hear about scalable search solutions in both an academic and enterprise context was of great relevance to me, and Search Solutions has broadened my horizons in this respect. I was impressed by how fast the search technologies have evolved. A few decades ago, a search engine could only handle Boolean expressions. Nowadays, it is able to communicate with us in a more “human” way. As demonstrated by Behshad Behzadi from Google, current search engines can deal with sophisticated questions like “How tall is the husband of Kim Kardashian?“. Not only does it have answers for most general questions about the world (by crawling a massive dataset of web pages) but is also smart enough to answer these questions within user context (e.g. “What is my frequent fly number?” by mining the owner’s data e.g. email stored within the cloud). Behzadi predicts the future of search engine to be an ultimate mobile assistant. I myself imagine that search engine would, one day, become a virtual friend which the user can seek advice for their daily life activities e.g. “What should I wear for the party tonight?“. If a search engine could analyse life-long data sources (email, browsing history, life-logged content), it should be able to figure out the style, habits, hobbies, interests and personality of its user. Also, a search engine with vision and hearing (e.g. camera, microphone, gyroscope, and other sensors integrated into a mobile phone) should understand more about the context of the conversation (e.g. “Is the owner happy at the moment?“) and give advice accordingly.
Our approach to the SBIR problem up to now is from Computer Vision (CV) perspective. We employ complex CV techniques to decode images into numerical feature vectors so that the distance between any two vectors in the feature space reflects the visual similarity between the two corresponding images. We attempt to address the partial deformation caused by human imperfection while sketching. Colour is also integrated into our framework as the second search modality. Our latest system can achieve interactive speeds on a dataset of 12M images (and was presented at ICCV 2015 this year). However, there are still many aspects to improve, including the need to develop a more precise “image2vec” encoding, a more scalable indexing and a capability to let a user refine their search results.
From the CV perspective, I wonder what the future of visual search will look like. Visual media online is growing rapidly with Cisco predicting that by 2019 over 80% of global consumer Internet traffic will comprise visual content. I feel that search over this media will not be deemed competent until CV systems can interpret images in the same way as human do. Particularly, once we solve the “semantic gap” between human powerful contextual language and low-level machine programming, a search engine with visual support can become a great human assistant.