|Keywords||Computer vision, retrieval|
With the rapid growth of multimedia footage, it is necessary to improve video and image indexing and retrieval methods. Today’s images have many objects with various interactions, causing simple searches to become long tasks with several rounds of different text queries, just to reach the target image. Despite the robustness of current CLIP (Contrastive Language–Image Pre-training) models, they still lack sufficient capability to find relations between objects in an image.
The temporal axis in videos further creates ambiguity to this framework where actions may require both textual and visual inputs to specify and personalize the search. State-of-the-art content-based and text-based retrieval methods mainly focus on a single cue (textual or visual) as query and often index identical instances in a corpus with single modality (visual/audio).
In this research, the team explores the challenge of multimodal queries for search in image and video corpuses. The query may include text, visual and/or audial cues, while the search engine will use all or part of modalities available in the raw data repository. Using the inter-relations between different modalities the researchers work to improve current single modal methods and utilize all available modalities in the data, either individually or combined, providing an improved multimodal option for search engines.
The team is interested in industrial collaboration opportunities for commercial implementations. The project is conducted in partnership with OriginAI research institute.