Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS

Lokoč, Jakub; Andreadis, Stelios; Bailer, Werner; Duane, Aaron; Gurrin, Cathal; Ma, Zhixin; Messina, Nicola; Nguyen, Thao-Nhu; Peška, Ladislav; Rossetto, Luca; Sauter, Loris; Schall, Konstantin; Schoeffmann, Klaus; Khan, Omar Shahbaz; Spiess, Florian; Vadicamo, Lucia; Vrochidis, Stefanos

doi:10.1007/s00530-023-01143-5

Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS

Regular Paper
Published: 24 August 2023

Volume 29, pages 3481–3504, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

343 Accesses
11 Citations
Explore all metrics

Abstract

This paper presents findings of the eleventh Video Browser Showdown competition, where sixteen teams competed in known-item and ad-hoc search tasks. Many of the teams utilized state-of-the-art video retrieval approaches that demonstrated high effectiveness in challenging search scenarios. In this paper, a broad survey of all utilized approaches is presented in connection with an analysis of the performance of participating teams. Specifically, both high-level performance indicators are presented with overall statistics as well as in-depth analysis of the performance of selected tools implementing result set logging. The analysis reveals evidence that the CLIP model represents a versatile tool for cross-modal video retrieval when combined with interactive search capabilities. Furthermore, the analysis investigates the effect of different users and text query properties on the performance in search tasks. Last but not least, lessons learned from search task preparation are presented, and a new direction for ad-hoc search based tasks at Video Browser Showdown is introduced.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th Video Browser Showdown

Article 26 January 2022

VIRET at Video Browser Showdown 2020

Interactive video search tools: a detailed analysis of the video browser showdown 2015

Article Open access 23 July 2016

Data and Code availability

The data and code to reproduce graphs and tables of Sect. 4 are available at https://github.com/mesnico/VBS22-KIS-Analysis and 5 are available at https://github.com/sauterl/VBS22-AVS-Analysis.

Notes

https://cloud.google.com/vision/docs/ocr.
We only kept those teams that did not yet solve the task, i.e., the timestamp of their correct submission was higher than the upper bound of respective interval (or they did not solve the task at all).
Counting from the task start time to task end time or correct submission time, whichever comes first.
https://github.com/siret-junior/somhunter.

References

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198 (2022)
Amato, G., Bolettieri, P., Carrara, F., Debole, F., Falchi, F., Gennaro, C., Vadicamo, L., Vairo, C.: VISIONE at VBS2019. In: International Conference on Multimedia Modeling, pp. 591–596. Springer (2019). https://doi.org/10.1007/978-3-030-05716-9_51
Amato, G., Bolettieri, P., Carrara, F., Debole, F., Falchi, F., Gennaro, C., Vadicamo, L., Vairo, C.: The visione video search system: Exploiting off-the-shelf text search engines for large-scale video retrieval. Journal of Imaging 7(5) (2021). https://doi.org/10.3390/jimaging7050076
Amato, G., Bolettieri, P., Carrara, F., Falchi, F., Gennaro, C., Messina, N., Vadicamo, L., Vairo, C.: (2022) Visione at video browser showdown,: In: Huet, B., Gurrin, C., Tran, M.T., Dang-Nguyen, D.T., Hu, A.M.C., Huynh Thi Thanh, B., Huet, B. (eds.) Multi Media Modeling, pp. 543–548. Springer International Publishing, Cham (2022)
Chapter Google Scholar
Amato, G., Bolettieri, P., Falchi, F., Gennaro, C., Messina, N., Vadicamo, L., Vairo, C.: VISIONE at video browser showdown 2021. In: International Conference on Multimedia Modeling, pp. 473–478. Springer (2021). Doi: https://doi.org/10.1007/978-3-030-67835-7_47
Andreadis, S., Moumtzidou, A., Galanopoulos, D., Pantelidis, N., Apostolidis, K., Touska, D., Gkountakos, K., Pegia, M., Gialampoukidis, I., Vrochidis, S., Mezaris, V., Kompatsiaris, I.: VERGE in vbs 2022. In: International Conference on Multimedia Modeling. Springer (2022)
Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, H.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 4715–4723 (2019)
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9365–9374 (2019)
Bailer, W., Arnold, R., Benz, V., Coccomini, D., Gkagkas, A., Guðmundsson, G.T., Heller, S., Jónsson, B.T., Lokoč, J., Messina, N., Pantelidis, N., Wu, J.: Improving Query and Assessment Quality in Text-Based Interactive Video Retrieval Evaluation. In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. New York: Association for Computing Machinery, pp. 597–601 (2023). https://doi.org/10.1145/3591106.3592281
Bailey, P., Moffat, A., Scholer, F., Thomas, P.: Retrieval consistency in the presence of query variations. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 395–404 (2017)
Benavente, R., Vanrell, M., Baldrich, R.: Parametric fuzzy sets for automatic color naming. JOSA A 25(10), 2582–2593 (2008)
Article Google Scholar
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. CoRR arXiv:2004.10934 (2020)
Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y.: Openpose: Realtime multi-person 2d pose estimation using part affinity fields. CoRR abs/1812.08008 (2018)
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: Hybrid task cascade for instance segmentation. In: Conference on Computer Vision and Pattern Recognition pp. 4969–4978 (2019). https://doi.org/10.1109/CVPR.2019.00511
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Cox, I., Miller, M., Omohundro, S., Yianilos, P.: Pichunter: Bayesian relevance feedback for image retrieval. In: International Conference on Pattern Recognition, vol. 3, pp. 361–369. IEEE (1996). https://doi.org/10.1109/ICPR.1996.546971
Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: Detecting scene text via instance segmentation. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), pp. 6773–6780. AAAI (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009). https://doi.org/10.1109/CVPR.2009.5206848
Duane, A., Jónsson, B.T.: Virma: (2022) Virtual reality multimedia analytics at video browser showdown,: In: Huet, B.T., Gurrin, C., Tran, M.T., Dang-Nguyen, D.T., Hu, A.M.C., Huynh Thi Thanh, B., Huet, B. (eds.) MultiMedia Modeling, pp. 580–585. Springer International Publishing, Cham (2022)
Chapter Google Scholar
Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)
Galanopoulos, D., Mezaris, V.: Attention mechanisms, signal encodings and fusion strategies for improved ad-hoc video search with dual encoding networks. In: International Conference on Multimedia Retrieval, pp. 336–340. ACM (2020). https://doi.org/10.1145/3372278.3390737
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448 (2015)
Gíslason, S., Jónsson, B., Amsaleg, L.: Integration of exploration and search: A case study of the m3 model. In: Proceedings of the International Conference on MultiMedia Modeling (MMM), Lecture Notes in Computer Science, pp. 156–168. Springer, Germany (2019). https://doi.org/10.1007/978-3-030-05710-7_13
Gkountakos, K., Touska, D., Ioannidis, K., Tsikrika, T., Vrochidis, S., Kompatsiaris, I.: Spatio-temporal activity detection and recognition in untrimmed surveillance videos. In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp. 451–455 (2021)
Gurrin, C., Zhou, L., Healy, G., Jónsson, B.Þ., Dang-Nguyen, D., Lokoc, J., Tran, M., Hürst, W., Rossetto, L., Schöffmann, K.: Introduction to the fifth annual lifelog search challenge, lsc’22. In: V. Oria, M.L. Sapino, S. Satoh, B. Kerhervé, W. Cheng, I. Ide, V.K. Singh (eds.) ICMR ’22: International Conference on Multimedia Retrieval, Newark, NJ, USA, June 27 - 30, 2022, pp. 685–687. ACM (2022). https://doi.org/10.1145/3512527.3531439
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6546–6555 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969 (2017)
Heller, S., Arnold, R., Gasser, R., Gsteiger, V., Parian-Scherb, M., Rossetto, L., Sauter, L., Spiess, F., Schuldt, H.: Multi-modal interactive video retrieval with temporal queries. In: International Conference on Multimedia Modeling, Lecture Notes in Computer Science. Springer (2022)
Heller, S., Arnold, R., Gasser, R., Gsteiger, V., Parian-Scherb, M., Rossetto, L., Sauter, L., Spiess, F., Schuldt, H.: Multi-modal Interactive Video Retrieval with Temporal Queries. In: MultiMedia Modeling, pp. 493–498. Springer International Publishing, Cham (2022)
Chapter Google Scholar
Heller, S., Gasser, R., Illi, C., Pasquinelli, M., Sauter, L., Spiess, F., Schuldt, H.: Towards explainable interactive multi-modal video retrieval with vitrivr. In: Int. Conf. Multimed. Model., pp. 435–440. Springer, UK (2021)
Chapter Google Scholar
Heller, S., Gsteiger, V., Bailer, W., Gurrin, C., Jónsson, B.Þ, Lokoc, J., Leibetseder, A., Mejzlík, F., Peska, L., Rossetto, L., Schall, K., Schoeffmann, K., Schuldt, H., Spiess, F., Tran, L., Vadicamo, L., Veselý, P., Vrochidis, S., Wu, J.: Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. Int. J. Multim. Inf. Retr. 11(1), 1–18 (2022). https://doi.org/10.1007/s13735-021-00225-2
Article Google Scholar
Hezel, N., Barthel, K.U.: Dynamic construction and manipulation of hierarchical quartic image graphs. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR ’18, p. 513-516. Association for Computing Machinery, New York, NY, USA (2018)
Hezel, N., Schall, K., Jung, K., Barthel, K.U.: Efficient search and browsing of large-scale video collections with vibro. In: B. Þór Jónsson, C. Gurrin, M.T. Tran, D.T. Dang-Nguyen, A.M.C. Hu, B. Huynh Thi Thanh, B. Huet (eds.) MultiMedia Modeling, pp. 487–492. Springer International Publishing, Cham (2022)
Ho, K., Dinh, V.X., Nguyen, H.Q., Le, K., Tran, K.D., Do, T., Mai, T.D., Ngo, T.D., Le, D.D.: Uit at vbs 2022: An unified and interactive video retrieval system with temporal search. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 556-561. Springer (2022)
Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Transact. Patt. Anal. Mach. Intell. 33(1), 117–128 (2010). https://doi.org/10.1109/TPAMI.2010.57
Article Google Scholar
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, p. 4904–4916. PMLR (2021)
Khan, O.S., Jónsson, B.T., Larsen, M., Poulsen, L., Koelma, D.C., Rudinac, S., Worring, M., Zahálka, J.: Exquisitor at the video browser showdown 2021: Relationships between semantic classifiers. In: MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22-24, 2021, Proceedings, Part II, p. 410-416. Springer-Verlag (2021)
Khan, O.S., Jónsson, B.T., Rudinac, S., Zahálka, J., Ragnarsdóttir, H., Þorleiksdóttir, T., Guðmundsson, G.T., Amsaleg, L., Worring, M.: Interactive learning for multimedia at large. In: Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part I, p. 495-510. Springer-Verlag (2020)
Khan, O.S., Larsen, M.D., Poulsen, L.A.S., Jónsson, B.T., Zahálka, J., Rudinac, S., Koelma, D., Worring, M.: Exquisitor at the lifelog search challenge 2020. In: Proceedings of the Third Annual Workshop on Lifelog Search Challenge, LSC ’20, p. 19-22. Association for Computing Machinery (2020)
Khan, O.S., Sharma, U., Jónsson, B.T., Koelma, D.C., Rudinac, S., Worring, M., Zahálka, J.: Exquisitor at the video browser showdown 2022. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 511-517. Springer-Verlag (2022)
Le, T.K., Ninh, V.T., Tran, M.K., Healy, G., Gurrin, C., Tran, M.T.: Avseeker: An active video retrieval engine at vbs2022. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 537-542. Springer (2022)
Lee, S., Park, S., Ro, Y.M.: Ivist: Interactive video search tool in vbs 2022. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 524-529. Springer (2022)
Leibetseder, A., Schoeffmann, K.: divexplore 6.0: Itec’s interactive video exploration system at vbs 2022. In: International Conference on Multimedia Modeling, pp. 569–574. Springer (2022)
Li, X., Xu, C., Yang, G., Chen, Z., Dong, J.: W2VV++: Fully Deep Learning for Ad-hoc Video Search. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1786–1794. ACM, Nice France (2019). https://doi.org/10.1145/3343031.3350906
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision, pp. 121–137. Springer (2020)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision – ECCV, pp. 740–755. Springer (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, Z., Geng, S., Zhang, R., Gao, P., de Melo, G., Wang, X., Dai, J., Qiao, Y., Li, H.: Frozen clip models are efficient video learners. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer vision - ECCV 2022, pp. 388–404. Springer Nature Switzerland, Cham (2022)
Chapter Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Lokoč, J., Mejzlík, F., Souček, T., Dokoupil, P., Peška, L.: Video search with context-aware ranker and relevance feedback. In: B. Þór Jónsson, C. Gurrin, M.T. Tran, D.T. Dang-Nguyen, A.M.C. Hu, B. Huynh Thi Thanh, B. Huet (eds.) MultiMedia Modeling, pp. 505–510. Springer International Publishing, Cham (2022)
Lokoč, J., Souček, T., Veselý, P., Mejzlík, F., Ji, J., Xu, C., Li, X.: A W2VV++ case study with automated and interactive text-to-video retrieval. In: International Conference on Multimedia. ACM (2020). https://doi.org/10.1145/3394171.3414002
Lokoč, J., Veselý, P., Mejzlík, F., Kovalčík, G., Souček, T., Rossetto, L., Schoeffmann, K., Bailer, W., Gurrin, C., Sauter, L., Song, J., Vrochidis, S., Wu, J., Jónsson, B.Þ.: Is the reign of interactive search eternal? findings from the video browser showdown 2020. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17(3) (2021). https://doi.org/10.1145/3445031
Lokoč, J., Bailer, W., Schoeffmann, K., Muenzer, B., Awad, G.: On influential trends in interactive video retrieval: Video browser showdown 2015–2017. IEEE Transact. Multimed. 20(12), 3361–3376 (2018). https://doi.org/10.1109/TMM.2018.2830110
Article Google Scholar
Lokoč, J., Peška, L.: A study of a cross-modal interactive search tool using clip and temporal fusion. In: Priya, D. (ed.) MultiMedia modeling - 29th international conference, mmm 2023, bergen, norway, January 9–12, 2023. Lecture Notes in Computer Science. Springer, UK (2023)
Google Scholar
Luu, D.T., Quan, K.A.C., Nguyen, T.Q., Hua, V.S., Nguyen, M.C., Tran, M.T., Nguyen, V.T.: Cdc: Color-based diffusion model with caption embedding in vbs 2022. p. 575-579. Springer (2022)
Ma, Z., Wu, J., Hou, Z., Ngo, C.W.: Reinforcement learning-based interactive video search. In: B. Þór Jónsson, C. Gurrin, M.T. Tran, D.T. Dang-Nguyen, A.M.C. Hu, B. Huynh Thi Thanh, B. Huet (eds.) MultiMedia Modeling, pp. 549–555. Springer International Publishing, Cham (2022)
Markatopoulou, F., Mezaris, V., Patras, I.: Implicit and explicit concept relations in deep neural networks for multi-label video/image annotation. IEEE Transact. Circuits Syst. Video Tech 29(6), 1631–1644 (2018)
Article Google Scholar
Markatopoulou, F., Moumtzidou, A., Galanopoulos, D., Avgerinakis, K., Andreadis, S., Gialampoukidis, I., Tachos, S., Vrochidis, S., Mezaris, V., Kompatsiaris, I., Patras, I.: ITI-CERTH participation in TRECVID 2017. In: TREC Video Retrieval Evaluation. NIST (2017). https://doi.org/10.5281/zenodo.1183440
Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5222–5229. IEEE (2021)
Mettes, P., Koelma, D.C., Snoek, C.G.: The imagenet shuffle: Reorganized pre-training for video event detection. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ICMR ’16, p. 175-182. Association for Computing Machinery (2016)
Nguyen, T.N., Puangthamawathanakun, B., Healy, G., Nguyen, B.T., Gurrin, C., Caputo, A.: Videofall - A Hierarchical Search Engine for VBS2022. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 518-523. Springer-Verlag, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-030-98355-0_48
Pittaras, N., Markatopoulou, F., Mezaris, V., Patras, I.: Comparison of fine-tuning and extension strategies for deep convolutional neural networks. In: International Conference on Multimedia Modeling, pp. 102–114. Springer (2017). https://doi.org/10.1007/978-3-319-51811-4_9
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. CoRR (2021). arXiv:2103.00020
Revaud, J., Almazan, J., Rezende, R., de Souza, C.: Learning with average precision: Training image retrieval with a listwise loss. In: International Conference on Computer Vision, pp. 5106–5115. IEEE (2019). https://doi.org/10.1109/ICCV.2019.00521
Rossetto, L., Gasser, R., Sauter, L., Bernstein, A., Schuldt, H.: A system for interactive multimedia retrieval evaluations. In: International Conference on Multimedia Modeling. Springer (2021). https://doi.org/10.1007/978-3-030-67835-7_33
Rossetto, L., Parian, M.A., Gasser, R., Giangreco, I., Heller, S., Schuldt, H.: Deep learning-based concept detection in vitrivr. In: International Conference on Multimedia Modeling, pp. 616–621. Springer (2019). https://doi.org/10.1007/978-3-030-05716-9_55
Rossetto, L., Schuldt, H., Awad, G., Butt, A.A.: V3C - A research video collection. In: International Conference on Multimedia Modeling, pp. 349–360. Springer (2019). https://doi.org/10.1007/978-3-030-05710-7_29
Sauter, L., Amiri Parian, M., Gasser, R., Heller, S., Rossetto, L., Schuldt, H.: Combining boolean and multimedia retrieval in vitrivr for large-scale video search. In: International Conference on Multimedia Modeling, pp. 760–765. Springer (2020). https://doi.org/10.1007/978-3-030-37734-2_66
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C.W., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S.R., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: LAION-5b: An open large-scale dataset for training next generation image-text models. In: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022). https://openreview.net/forum?id=M3Y74vmsMcY
Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 (2018)
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(9), 2035–2048 (2019). https://doi.org/10.1109/TPAMI.2018.2848939
Article Google Scholar
Spiess, F., Gasser, R., Heller, S., Parian-Scherb, M., Rossetto, L., Sauter, L., Schuldt, H.: Multi-modal video retrieval in virtual reality with vitrivr-vr. In: International Conference on Multimedia Modeling, Lecture Notes in Computer Science. Springer (2022)
Spiess, F., Gasser, R., Heller, S., Rossetto, L., Sauter, L., Schuldt, H.: Competitive interactive video retrieval in virtual reality with vitrivr-vr. In: International Conference on Multimedia Modeling, pp. 441–447. Springer (2021). https://doi.org/10.1007/978-3-030-67835-7_42
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence (2017)
Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)
Tran, M.T., Hoang-Xuan, N., Trang-Trung, H.P., Le, T.C., Tran, M.K., Le, M.Q., Le, T.K., Ninh, V.T., Gurrin, C.: V-first: A flexible interactive retrieval system for video at vbs 2022. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 562-568. Springer (2022)
Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. IEEE Transact. Image Process. 18(7), 1512–1523 (2009)
Article MathSciNet MATH Google Scholar
Veselý, P., Mejzlík, F., Lokoč, J.: Somhunter V2 at video browser showdown 2021. In: International Conference on Multimedia Modeling, pp. 461–466. Springer (2021). https://doi.org/10.1007/978-3-030-67835-7_45
Wu, J., Ngo, C.W.: Interpretable embedding for ad-hoc video search. In: Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, p. 3357-3366. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413916
Zaidi, S.S.A., Ansari, M.S., Aslam, A., Kanwal, N., Asghar, M., Lee, B.: A survey of modern deep learning based object detection models. Digital Signal Processing p. 103514 (2022)
Zhang, H., Wang, Y., Dayoub, F., Sunderhauf, N.: VarifocalNet: An IoU-aware dense object detector. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2021)
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vis. 127(3), 302–321 (2019). https://doi.org/10.1007/s11263-018-1140-0
Article Google Scholar

Download references

Acknowledgements

This work was partially funded by the EU’s Horizon 2020 research and innovation programme under the grant agreements n\(^o\) 101070250 XRECO, n\(^o\) 01004152 CALLISTO, n\(^o\) 951911, AI4Media - A European Excellence Centre for Media, Society and Democracy, the Swiss National Science Foundation projects “Participatory Knowledge Practices in Analog and Digital Image Archives” (contract no. 193788) and “MediaGraph” (contract no. 202125), Czech Science Foundation (GAČR) project 22-21696 S, Special thanks to IVIST, AVSEEKER, V-FIRST, VideoFall, VNUHCM, and UIT who provided technical information on their systems for the related work section of this paper.

Author information

Authors and Affiliations

Department of Software Engineering, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic
Jakub Lokoč & Ladislav Peška
ISTI-CNR, Pisa, Italy
Nicola Messina & Lucia Vadicamo
Department of Mathematics and Computer Science, University of Basel, Basel, Switzerland
Loris Sauter & Florian Spiess
Joanneum Research, Graz, Austria
Werner Bailer
Dublin City University, Dublin, Ireland
Cathal Gurrin & Thao-Nhu Nguyen
IT University of Copenhagen, Copenhagen, Denmark
Aaron Duane & Omar Shahbaz Khan
Klagenfurt University, Klagenfurt, Austria
Klaus Schoeffmann
Departement of Informatics, University of Zurich, Zurich, Switzerland
Luca Rossetto
Visual Computing Group, HTW Berlin, Berlin, Germany
Konstantin Schall
Information Technologies Institute (ITI), Centre for Research and Technology Hellas (CERTH), Thermi-Thessaloniki, Greece
Stelios Andreadis & Stefanos Vrochidis
School of Computing and Information Systems, Singapore Management University, Singapore, Singapore
Zhixin Ma

Authors

Jakub Lokoč
View author publications
You can also search for this author in PubMed Google Scholar
Stelios Andreadis
View author publications
You can also search for this author in PubMed Google Scholar
Werner Bailer
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Duane
View author publications
You can also search for this author in PubMed Google Scholar
Cathal Gurrin
View author publications
You can also search for this author in PubMed Google Scholar
Zhixin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Messina
View author publications
You can also search for this author in PubMed Google Scholar
Thao-Nhu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Ladislav Peška
View author publications
You can also search for this author in PubMed Google Scholar
Luca Rossetto
View author publications
You can also search for this author in PubMed Google Scholar
Loris Sauter
View author publications
You can also search for this author in PubMed Google Scholar
Konstantin Schall
View author publications
You can also search for this author in PubMed Google Scholar
Klaus Schoeffmann
View author publications
You can also search for this author in PubMed Google Scholar
Omar Shahbaz Khan
View author publications
You can also search for this author in PubMed Google Scholar
Florian Spiess
View author publications
You can also search for this author in PubMed Google Scholar
Lucia Vadicamo
View author publications
You can also search for this author in PubMed Google Scholar
Stefanos Vrochidis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: JL, LV, FS, WB, LS; Methodology: FS, LS, WB, JL; Data Curation: LV, NM, FS, LR, ZM, LS; Formal analysis and investigation: LV, NM, FS, LP, WB, KS, AD, LS; Writing - original draft preparation: JL, LV, NM, FS, LR, LP, WB, KS, OK, AD, LS; Software: LV, NM, FS, LR, ZM, AD, LS; Supervision: JL, LV, FS, LR.

Corresponding author

Correspondence to Jakub Lokoč.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lokoč, J., Andreadis, S., Bailer, W. et al. Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS. Multimedia Systems 29, 3481–3504 (2023). https://doi.org/10.1007/s00530-023-01143-5

Download citation

Received: 09 December 2022
Accepted: 17 July 2023
Published: 24 August 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00530-023-01143-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS

Abstract

Access this article

Similar content being viewed by others

Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th Video Browser Showdown

VIRET at Video Browser Showdown 2020

Interactive video search tools: a detailed analysis of the video browser showdown 2015

Data and Code availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS

Abstract

Access this article

Similar content being viewed by others

Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th Video Browser Showdown

VIRET at Video Browser Showdown 2020

Interactive video search tools: a detailed analysis of the video browser showdown 2015

Data and Code availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation