Studying Online and Offline Evaluation Measures: A Case Study Based on the NTCIR-14 OpenLiveQ-2 Task

Arora, Piyush; Jones, Gareth J. F.

doi:10.1007/978-3-030-36805-0_6

Piyush Arora¹² &
Gareth J. F. Jones¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11966))

Included in the following conference series:

NII Conference on Testbeds and Community for Information Access Research

271 Accesses

Abstract

We describe our participation in the NTCIR-14 OpenLiveQ-2 task and our post-submission investigations. For a given query and a set of questions with their answers, participants in the OpenLiveQ task were required to return a ranked list of questions that potentially match and satisfy the user’s query effectively. In this paper we focus on two main investigations: (i) Finding effective features which go beyond only-relevance for the task of ranking questions for a given query in Japanese language. (ii) Analyzing the nature and relationship of online and offline evaluation measures. We use the OpenLiveQ-2 dataset for our study. Our first investigation examines user log-based features (e.g number of views, question is solved) and content-based features (BM25 scores, LM scores). Overall, we find that log-based features reflecting the question’s popularity, freshness, etc dominate question ranking, rather than content-based features measuring query and question similarity. Our second investigation finds that the offline measures highly correlate among themselves, but that the correlation between different offline and online measures is quite low. We find that the low correlation between online and offline measures is also reflected in discrepancies between the systems’ rankings for the OpenLiveQ-2 task, although this depends on the nature and type of the evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://chiebukuro.yahoo.co.jp/.
2.
https://scikit-learn.org/stable/.
3.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html.
4.
Similar pattern of results were observed using Spearman’s and Kendall’s Tau correlation metrics during our investigation, results have been omitted because of the space constraints.

References

Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM, pp. 621–630 (2009)
Google Scholar
Dang, V.: The Lemur Project-Wiki-Ranklib (2013). http://sourceforge.net/p/lemur/wiki/RankLib
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. (TOIS) 20(4), 422–446 (2002)
Article Google Scholar
Kato, M.P., Liu, Y.: Overview of NTCIR-13. In: Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies (2017)
Google Scholar
Joachims, T., Granka, L.A., Pan, B., Hembrooke, H., Gay, G.: Accurately interpreting clickthrough data as implicit feedback. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 154–161. SIGIR (2005)
Google Scholar
Kato, M.P., Manabe, T., Fujita, S., Nishida, A., Yamamoto, T.: Challenges of multileaved comparison in practice: lessons from NTCIR-13 OpenLiveQ Task. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM, pp. 1515–1518 (2018)
Google Scholar
Kato, M.P., Nishida, A., Manabe, T., Fujita, S., Yamamoto, T.: Overview of the NTCIR-14 OpenLiveQ-2 task. In: Proceedings of the 14th NTCIR Conference on Evaluation of Information Access Technologies (2019)
Google Scholar
Manabe, T., Nishida, A., Fujita, S.: YJRS at the NTCIR-13 OpenLiveQ task. In: Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies (2017)
Google Scholar
Arora, P., Jones, G.J.F.: DCU at the NTCIR-14 OpenLiveQ-2 task. In: Proceedings of the 14th NTCIR Conference on Evaluation of Information Access Technologies (2019)
Google Scholar
Qin, T., Liu, T.Y., Xu, J., Li, H.: LETOR: a benchmark collection for research on learning to rank for information retrieval. J. Inf. Retrieval 13(4), 346–374 (2010)
Article Google Scholar
Metzler, D., Croft, W.B.: Linear feature-based models for information retrieval. J. Inf. Retrieval 10(3), 257–274 (2007)
Article Google Scholar
Oosterhuis, H., de Rijke, M.: Sensitive and scalable online evaluation with theoretical guarantees. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM, pp. 77–86 (2017)
Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. SIGIR (1998)
Google Scholar
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: NIST Special Publication, no. 500225, pp. 109–123 (1995)
Google Scholar
Sakai, T.: Evaluating evaluation metrics based on the bootstrap. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 525–532. SIGIR (2006)
Google Scholar
Sakai, T.: On the reliability of information retrieval metrics based on graded relevance. J. Inf. Process. Manag. 43(2), 531–548 (2007)
Article Google Scholar
Breiman, L.: Some properties of splitting criteria. J. Mach. Learn. 24(1), 41–47 (1996)
MATH Google Scholar

Download references

Acknowledgement

This research is supported by Science Foundation Ireland (SFI) as a part of the ADAPT Centre at Dublin City University (Grant No: 12/CE/I2267).

Author information

Authors and Affiliations

ADAPT Centre, School of Computing, Dublin City University, Dublin 9, Ireland
Piyush Arora & Gareth J. F. Jones

Authors

Piyush Arora
View author publications
You can also search for this author in PubMed Google Scholar
Gareth J. F. Jones
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Piyush Arora .

Editor information

Editors and Affiliations

University of Tsukuba, Tsukuba, Japan
Makoto P. Kato
Tsinghua University, Beijing, China
Yiqun Liu
National Institute of Informatics (NII), Tokyo, Japan
Noriko Kando
University of Waterloo, Waterloo, ON, Canada
Charles L. A. Clarke

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arora, P., Jones, G.J.F. (2019). Studying Online and Offline Evaluation Measures: A Case Study Based on the NTCIR-14 OpenLiveQ-2 Task. In: Kato, M., Liu, Y., Kando, N., Clarke, C. (eds) NII Testbeds and Community for Information Access Research. NTCIR 2019. Lecture Notes in Computer Science(), vol 11966. Springer, Cham. https://doi.org/10.1007/978-3-030-36805-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-36805-0_6
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36804-3
Online ISBN: 978-3-030-36805-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics