We present a probabilistic model for the retrieval of multimodal documents. The model is based on Bayesian decision theory and combines models for text based search with models for visual search. The textual model, applied to the LIMSI transcripts, is based on the language modelling approach to text retrieval. The visual model, a mixture of Gaussian densities, describes keyframes selected from shots. Both models have proved successful on media specific retrieval tasks. Our contribution is the combination of both techniques in a unified model, ranking shots on ASR-data and visual features simultaneously.

N.I.S.T.
Text REtrieval Conference
Database Architectures

Westerveld, T., de Vries, A., & van Ballegooij, A. (2002). CWI at the TREC-2002 video track. In Proceedings of Text REtrieval Conference 2002 (11) (pp. 1–10). N.I.S.T.