Skip to main content
Log in

Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

This paper presents findings of the eleventh Video Browser Showdown competition, where sixteen teams competed in known-item and ad-hoc search tasks. Many of the teams utilized state-of-the-art video retrieval approaches that demonstrated high effectiveness in challenging search scenarios. In this paper, a broad survey of all utilized approaches is presented in connection with an analysis of the performance of participating teams. Specifically, both high-level performance indicators are presented with overall statistics as well as in-depth analysis of the performance of selected tools implementing result set logging. The analysis reveals evidence that the CLIP model represents a versatile tool for cross-modal video retrieval when combined with interactive search capabilities. Furthermore, the analysis investigates the effect of different users and text query properties on the performance in search tasks. Last but not least, lessons learned from search task preparation are presented, and a new direction for ad-hoc search based tasks at Video Browser Showdown is introduced.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Data and Code availability

The data and code to reproduce graphs and tables of Sect. 4 are available at https://github.com/mesnico/VBS22-KIS-Analysis and 5 are available at https://github.com/sauterl/VBS22-AVS-Analysis.

Notes

  1. https://cloud.google.com/vision/docs/ocr.

  2. We only kept those teams that did not yet solve the task, i.e., the timestamp of their correct submission was higher than the upper bound of respective interval (or they did not solve the task at all).

  3. Counting from the task start time to task end time or correct submission time, whichever comes first.

  4. https://github.com/siret-junior/somhunter.

References

  1. Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198 (2022)

  2. Amato, G., Bolettieri, P., Carrara, F., Debole, F., Falchi, F., Gennaro, C., Vadicamo, L., Vairo, C.: VISIONE at VBS2019. In: International Conference on Multimedia Modeling, pp. 591–596. Springer (2019). https://doi.org/10.1007/978-3-030-05716-9_51

  3. Amato, G., Bolettieri, P., Carrara, F., Debole, F., Falchi, F., Gennaro, C., Vadicamo, L., Vairo, C.: The visione video search system: Exploiting off-the-shelf text search engines for large-scale video retrieval. Journal of Imaging 7(5) (2021). https://doi.org/10.3390/jimaging7050076

  4. Amato, G., Bolettieri, P., Carrara, F., Falchi, F., Gennaro, C., Messina, N., Vadicamo, L., Vairo, C.: (2022) Visione at video browser showdown,: In: Huet, B., Gurrin, C., Tran, M.T., Dang-Nguyen, D.T., Hu, A.M.C., Huynh Thi Thanh, B., Huet, B. (eds.) Multi Media Modeling, pp. 543–548. Springer International Publishing, Cham (2022)

    Chapter  Google Scholar 

  5. Amato, G., Bolettieri, P., Falchi, F., Gennaro, C., Messina, N., Vadicamo, L., Vairo, C.: VISIONE at video browser showdown 2021. In: International Conference on Multimedia Modeling, pp. 473–478. Springer (2021). Doi: https://doi.org/10.1007/978-3-030-67835-7_47

  6. Andreadis, S., Moumtzidou, A., Galanopoulos, D., Pantelidis, N., Apostolidis, K., Touska, D., Gkountakos, K., Pegia, M., Gialampoukidis, I., Vrochidis, S., Mezaris, V., Kompatsiaris, I.: VERGE in vbs 2022. In: International Conference on Multimedia Modeling. Springer (2022)

  7. Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, H.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 4715–4723 (2019)

  8. Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9365–9374 (2019)

  9. Bailer, W., Arnold, R., Benz, V., Coccomini, D., Gkagkas, A., Guðmundsson, G.T., Heller, S., Jónsson, B.T., Lokoč, J., Messina, N., Pantelidis, N., Wu, J.: Improving Query and Assessment Quality in Text-Based Interactive Video Retrieval Evaluation. In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. New York: Association for Computing Machinery, pp. 597–601 (2023). https://doi.org/10.1145/3591106.3592281

  10. Bailey, P., Moffat, A., Scholer, F., Thomas, P.: Retrieval consistency in the presence of query variations. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 395–404 (2017)

  11. Benavente, R., Vanrell, M., Baldrich, R.: Parametric fuzzy sets for automatic color naming. JOSA A 25(10), 2582–2593 (2008)

    Article  Google Scholar 

  12. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. CoRR arXiv:2004.10934 (2020)

  13. Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y.: Openpose: Realtime multi-person 2d pose estimation using part affinity fields. CoRR abs/1812.08008 (2018)

  14. Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: Hybrid task cascade for instance segmentation. In: Conference on Computer Vision and Pattern Recognition pp. 4969–4978 (2019). https://doi.org/10.1109/CVPR.2019.00511

  15. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

  16. Cox, I., Miller, M., Omohundro, S., Yianilos, P.: Pichunter: Bayesian relevance feedback for image retrieval. In: International Conference on Pattern Recognition, vol. 3, pp. 361–369. IEEE (1996). https://doi.org/10.1109/ICPR.1996.546971

  17. Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: Detecting scene text via instance segmentation. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), pp. 6773–6780. AAAI (2018)

  18. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009). https://doi.org/10.1109/CVPR.2009.5206848

  19. Duane, A., Jónsson, B.T.: Virma: (2022) Virtual reality multimedia analytics at video browser showdown,: In: Huet, B.T., Gurrin, C., Tran, M.T., Dang-Nguyen, D.T., Hu, A.M.C., Huynh Thi Thanh, B., Huet, B. (eds.) MultiMedia Modeling, pp. 580–585. Springer International Publishing, Cham (2022)

    Chapter  Google Scholar 

  20. Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)

  21. Galanopoulos, D., Mezaris, V.: Attention mechanisms, signal encodings and fusion strategies for improved ad-hoc video search with dual encoding networks. In: International Conference on Multimedia Retrieval, pp. 336–340. ACM (2020). https://doi.org/10.1145/3372278.3390737

  22. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448 (2015)

  23. Gíslason, S., Jónsson, B., Amsaleg, L.: Integration of exploration and search: A case study of the m3 model. In: Proceedings of the International Conference on MultiMedia Modeling (MMM), Lecture Notes in Computer Science, pp. 156–168. Springer, Germany (2019). https://doi.org/10.1007/978-3-030-05710-7_13

  24. Gkountakos, K., Touska, D., Ioannidis, K., Tsikrika, T., Vrochidis, S., Kompatsiaris, I.: Spatio-temporal activity detection and recognition in untrimmed surveillance videos. In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp. 451–455 (2021)

  25. Gurrin, C., Zhou, L., Healy, G., Jónsson, B.Þ., Dang-Nguyen, D., Lokoc, J., Tran, M., Hürst, W., Rossetto, L., Schöffmann, K.: Introduction to the fifth annual lifelog search challenge, lsc’22. In: V. Oria, M.L. Sapino, S. Satoh, B. Kerhervé, W. Cheng, I. Ide, V.K. Singh (eds.) ICMR ’22: International Conference on Multimedia Retrieval, Newark, NJ, USA, June 27 - 30, 2022, pp. 685–687. ACM (2022). https://doi.org/10.1145/3512527.3531439

  26. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6546–6555 (2018)

  27. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969 (2017)

  28. Heller, S., Arnold, R., Gasser, R., Gsteiger, V., Parian-Scherb, M., Rossetto, L., Sauter, L., Spiess, F., Schuldt, H.: Multi-modal interactive video retrieval with temporal queries. In: International Conference on Multimedia Modeling, Lecture Notes in Computer Science. Springer (2022)

  29. Heller, S., Arnold, R., Gasser, R., Gsteiger, V., Parian-Scherb, M., Rossetto, L., Sauter, L., Spiess, F., Schuldt, H.: Multi-modal Interactive Video Retrieval with Temporal Queries. In: MultiMedia Modeling, pp. 493–498. Springer International Publishing, Cham (2022)

    Chapter  Google Scholar 

  30. Heller, S., Gasser, R., Illi, C., Pasquinelli, M., Sauter, L., Spiess, F., Schuldt, H.: Towards explainable interactive multi-modal video retrieval with vitrivr. In: Int. Conf. Multimed. Model., pp. 435–440. Springer, UK (2021)

    Chapter  Google Scholar 

  31. Heller, S., Gsteiger, V., Bailer, W., Gurrin, C., Jónsson, B.Þ, Lokoc, J., Leibetseder, A., Mejzlík, F., Peska, L., Rossetto, L., Schall, K., Schoeffmann, K., Schuldt, H., Spiess, F., Tran, L., Vadicamo, L., Veselý, P., Vrochidis, S., Wu, J.: Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. Int. J. Multim. Inf. Retr. 11(1), 1–18 (2022). https://doi.org/10.1007/s13735-021-00225-2

    Article  Google Scholar 

  32. Hezel, N., Barthel, K.U.: Dynamic construction and manipulation of hierarchical quartic image graphs. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR ’18, p. 513-516. Association for Computing Machinery, New York, NY, USA (2018)

  33. Hezel, N., Schall, K., Jung, K., Barthel, K.U.: Efficient search and browsing of large-scale video collections with vibro. In: B. Þór Jónsson, C. Gurrin, M.T. Tran, D.T. Dang-Nguyen, A.M.C. Hu, B. Huynh Thi Thanh, B. Huet (eds.) MultiMedia Modeling, pp. 487–492. Springer International Publishing, Cham (2022)

  34. Ho, K., Dinh, V.X., Nguyen, H.Q., Le, K., Tran, K.D., Do, T., Mai, T.D., Ngo, T.D., Le, D.D.: Uit at vbs 2022: An unified and interactive video retrieval system with temporal search. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 556-561. Springer (2022)

  35. Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Transact. Patt. Anal. Mach. Intell. 33(1), 117–128 (2010). https://doi.org/10.1109/TPAMI.2010.57

    Article  Google Scholar 

  36. Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, p. 4904–4916. PMLR (2021)

  37. Khan, O.S., Jónsson, B.T., Larsen, M., Poulsen, L., Koelma, D.C., Rudinac, S., Worring, M., Zahálka, J.: Exquisitor at the video browser showdown 2021: Relationships between semantic classifiers. In: MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22-24, 2021, Proceedings, Part II, p. 410-416. Springer-Verlag (2021)

  38. Khan, O.S., Jónsson, B.T., Rudinac, S., Zahálka, J., Ragnarsdóttir, H., Þorleiksdóttir, T., Guðmundsson, G.T., Amsaleg, L., Worring, M.: Interactive learning for multimedia at large. In: Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part I, p. 495-510. Springer-Verlag (2020)

  39. Khan, O.S., Larsen, M.D., Poulsen, L.A.S., Jónsson, B.T., Zahálka, J., Rudinac, S., Koelma, D., Worring, M.: Exquisitor at the lifelog search challenge 2020. In: Proceedings of the Third Annual Workshop on Lifelog Search Challenge, LSC ’20, p. 19-22. Association for Computing Machinery (2020)

  40. Khan, O.S., Sharma, U., Jónsson, B.T., Koelma, D.C., Rudinac, S., Worring, M., Zahálka, J.: Exquisitor at the video browser showdown 2022. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 511-517. Springer-Verlag (2022)

  41. Le, T.K., Ninh, V.T., Tran, M.K., Healy, G., Gurrin, C., Tran, M.T.: Avseeker: An active video retrieval engine at vbs2022. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 537-542. Springer (2022)

  42. Lee, S., Park, S., Ro, Y.M.: Ivist: Interactive video search tool in vbs 2022. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 524-529. Springer (2022)

  43. Leibetseder, A., Schoeffmann, K.: divexplore 6.0: Itec’s interactive video exploration system at vbs 2022. In: International Conference on Multimedia Modeling, pp. 569–574. Springer (2022)

  44. Li, X., Xu, C., Yang, G., Chen, Z., Dong, J.: W2VV++: Fully Deep Learning for Ad-hoc Video Search. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1786–1794. ACM, Nice France (2019). https://doi.org/10.1145/3343031.3350906

  45. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision, pp. 121–137. Springer (2020)

  46. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision – ECCV, pp. 740–755. Springer (2014). https://doi.org/10.1007/978-3-319-10602-1_48

  47. Lin, Z., Geng, S., Zhang, R., Gao, P., de Melo, G., Wang, X., Dai, J., Qiao, Y., Li, H.: Frozen clip models are efficient video learners. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer vision - ECCV 2022, pp. 388–404. Springer Nature Switzerland, Cham (2022)

    Chapter  Google Scholar 

  48. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

  49. Lokoč, J., Mejzlík, F., Souček, T., Dokoupil, P., Peška, L.: Video search with context-aware ranker and relevance feedback. In: B. Þór Jónsson, C. Gurrin, M.T. Tran, D.T. Dang-Nguyen, A.M.C. Hu, B. Huynh Thi Thanh, B. Huet (eds.) MultiMedia Modeling, pp. 505–510. Springer International Publishing, Cham (2022)

  50. Lokoč, J., Souček, T., Veselý, P., Mejzlík, F., Ji, J., Xu, C., Li, X.: A W2VV++ case study with automated and interactive text-to-video retrieval. In: International Conference on Multimedia. ACM (2020). https://doi.org/10.1145/3394171.3414002

  51. Lokoč, J., Veselý, P., Mejzlík, F., Kovalčík, G., Souček, T., Rossetto, L., Schoeffmann, K., Bailer, W., Gurrin, C., Sauter, L., Song, J., Vrochidis, S., Wu, J., Jónsson, B.Þ.: Is the reign of interactive search eternal? findings from the video browser showdown 2020. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17(3) (2021). https://doi.org/10.1145/3445031

  52. Lokoč, J., Bailer, W., Schoeffmann, K., Muenzer, B., Awad, G.: On influential trends in interactive video retrieval: Video browser showdown 2015–2017. IEEE Transact. Multimed. 20(12), 3361–3376 (2018). https://doi.org/10.1109/TMM.2018.2830110

    Article  Google Scholar 

  53. Lokoč, J., Peška, L.: A study of a cross-modal interactive search tool using clip and temporal fusion. In: Priya, D. (ed.) MultiMedia modeling - 29th international conference, mmm 2023, bergen, norway, January 9–12, 2023. Lecture Notes in Computer Science. Springer, UK (2023)

    Google Scholar 

  54. Luu, D.T., Quan, K.A.C., Nguyen, T.Q., Hua, V.S., Nguyen, M.C., Tran, M.T., Nguyen, V.T.: Cdc: Color-based diffusion model with caption embedding in vbs 2022. p. 575-579. Springer (2022)

  55. Ma, Z., Wu, J., Hou, Z., Ngo, C.W.: Reinforcement learning-based interactive video search. In: B. Þór Jónsson, C. Gurrin, M.T. Tran, D.T. Dang-Nguyen, A.M.C. Hu, B. Huynh Thi Thanh, B. Huet (eds.) MultiMedia Modeling, pp. 549–555. Springer International Publishing, Cham (2022)

  56. Markatopoulou, F., Mezaris, V., Patras, I.: Implicit and explicit concept relations in deep neural networks for multi-label video/image annotation. IEEE Transact. Circuits Syst. Video Tech 29(6), 1631–1644 (2018)

    Article  Google Scholar 

  57. Markatopoulou, F., Moumtzidou, A., Galanopoulos, D., Avgerinakis, K., Andreadis, S., Gialampoukidis, I., Tachos, S., Vrochidis, S., Mezaris, V., Kompatsiaris, I., Patras, I.: ITI-CERTH participation in TRECVID 2017. In: TREC Video Retrieval Evaluation. NIST (2017). https://doi.org/10.5281/zenodo.1183440

  58. Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5222–5229. IEEE (2021)

  59. Mettes, P., Koelma, D.C., Snoek, C.G.: The imagenet shuffle: Reorganized pre-training for video event detection. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ICMR ’16, p. 175-182. Association for Computing Machinery (2016)

  60. Nguyen, T.N., Puangthamawathanakun, B., Healy, G., Nguyen, B.T., Gurrin, C., Caputo, A.: Videofall - A Hierarchical Search Engine for VBS2022. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 518-523. Springer-Verlag, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-030-98355-0_48

  61. Pittaras, N., Markatopoulou, F., Mezaris, V., Patras, I.: Comparison of fine-tuning and extension strategies for deep convolutional neural networks. In: International Conference on Multimedia Modeling, pp. 102–114. Springer (2017). https://doi.org/10.1007/978-3-319-51811-4_9

  62. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. CoRR (2021). arXiv:2103.00020

  63. Revaud, J., Almazan, J., Rezende, R., de Souza, C.: Learning with average precision: Training image retrieval with a listwise loss. In: International Conference on Computer Vision, pp. 5106–5115. IEEE (2019). https://doi.org/10.1109/ICCV.2019.00521

  64. Rossetto, L., Gasser, R., Sauter, L., Bernstein, A., Schuldt, H.: A system for interactive multimedia retrieval evaluations. In: International Conference on Multimedia Modeling. Springer (2021). https://doi.org/10.1007/978-3-030-67835-7_33

  65. Rossetto, L., Parian, M.A., Gasser, R., Giangreco, I., Heller, S., Schuldt, H.: Deep learning-based concept detection in vitrivr. In: International Conference on Multimedia Modeling, pp. 616–621. Springer (2019). https://doi.org/10.1007/978-3-030-05716-9_55

  66. Rossetto, L., Schuldt, H., Awad, G., Butt, A.A.: V3C - A research video collection. In: International Conference on Multimedia Modeling, pp. 349–360. Springer (2019). https://doi.org/10.1007/978-3-030-05710-7_29

  67. Sauter, L., Amiri Parian, M., Gasser, R., Heller, S., Rossetto, L., Schuldt, H.: Combining boolean and multimedia retrieval in vitrivr for large-scale video search. In: International Conference on Multimedia Modeling, pp. 760–765. Springer (2020). https://doi.org/10.1007/978-3-030-37734-2_66

  68. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C.W., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S.R., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: LAION-5b: An open large-scale dataset for training next generation image-text models. In: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022). https://openreview.net/forum?id=M3Y74vmsMcY

  69. Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 (2018)

  70. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(9), 2035–2048 (2019). https://doi.org/10.1109/TPAMI.2018.2848939

    Article  Google Scholar 

  71. Spiess, F., Gasser, R., Heller, S., Parian-Scherb, M., Rossetto, L., Sauter, L., Schuldt, H.: Multi-modal video retrieval in virtual reality with vitrivr-vr. In: International Conference on Multimedia Modeling, Lecture Notes in Computer Science. Springer (2022)

  72. Spiess, F., Gasser, R., Heller, S., Rossetto, L., Sauter, L., Schuldt, H.: Competitive interactive video retrieval in virtual reality with vitrivr-vr. In: International Conference on Multimedia Modeling, pp. 441–447. Springer (2021). https://doi.org/10.1007/978-3-030-67835-7_42

  73. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence (2017)

  74. Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)

  75. Tran, M.T., Hoang-Xuan, N., Trang-Trung, H.P., Le, T.C., Tran, M.K., Le, M.Q., Le, T.K., Ninh, V.T., Gurrin, C.: V-first: A flexible interactive retrieval system for video at vbs 2022. In: MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part II, p. 562-568. Springer (2022)

  76. Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. IEEE Transact. Image Process. 18(7), 1512–1523 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  77. Veselý, P., Mejzlík, F., Lokoč, J.: Somhunter V2 at video browser showdown 2021. In: International Conference on Multimedia Modeling, pp. 461–466. Springer (2021). https://doi.org/10.1007/978-3-030-67835-7_45

  78. Wu, J., Ngo, C.W.: Interpretable embedding for ad-hoc video search. In: Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, p. 3357-3366. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413916

  79. Zaidi, S.S.A., Ansari, M.S., Aslam, A., Kanwal, N., Asghar, M., Lee, B.: A survey of modern deep learning based object detection models. Digital Signal Processing p. 103514 (2022)

  80. Zhang, H., Wang, Y., Dayoub, F., Sunderhauf, N.: VarifocalNet: An IoU-aware dense object detector. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2021)

  81. Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vis. 127(3), 302–321 (2019). https://doi.org/10.1007/s11263-018-1140-0

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially funded by the EU’s Horizon 2020 research and innovation programme under the grant agreements n\(^o\) 101070250 XRECO, n\(^o\) 01004152 CALLISTO, n\(^o\) 951911, AI4Media - A European Excellence Centre for Media, Society and Democracy, the Swiss National Science Foundation projects “Participatory Knowledge Practices in Analog and Digital Image Archives” (contract no. 193788) and “MediaGraph” (contract no. 202125), Czech Science Foundation (GAČR) project 22-21696 S, Special thanks to IVIST, AVSEEKER, V-FIRST, VideoFall, VNUHCM, and UIT who provided technical information on their systems for the related work section of this paper.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: JL, LV, FS, WB, LS; Methodology: FS, LS, WB, JL; Data Curation: LV, NM, FS, LR, ZM, LS; Formal analysis and investigation: LV, NM, FS, LP, WB, KS, AD, LS; Writing - original draft preparation: JL, LV, NM, FS, LR, LP, WB, KS, OK, AD, LS; Software: LV, NM, FS, LR, ZM, AD, LS; Supervision: JL, LV, FS, LR.

Corresponding author

Correspondence to Jakub Lokoč.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lokoč, J., Andreadis, S., Bailer, W. et al. Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS. Multimedia Systems 29, 3481–3504 (2023). https://doi.org/10.1007/s00530-023-01143-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01143-5

Keywords

Navigation