Skip to main content

Studying Online and Offline Evaluation Measures: A Case Study Based on the NTCIR-14 OpenLiveQ-2 Task

  • Conference paper
  • First Online:
Book cover NII Testbeds and Community for Information Access Research (NTCIR 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11966))

  • 271 Accesses

Abstract

We describe our participation in the NTCIR-14 OpenLiveQ-2 task and our post-submission investigations. For a given query and a set of questions with their answers, participants in the OpenLiveQ task were required to return a ranked list of questions that potentially match and satisfy the user’s query effectively. In this paper we focus on two main investigations: (i) Finding effective features which go beyond only-relevance for the task of ranking questions for a given query in Japanese language. (ii) Analyzing the nature and relationship of online and offline evaluation measures. We use the OpenLiveQ-2 dataset for our study. Our first investigation examines user log-based features (e.g number of views, question is solved) and content-based features (BM25 scores, LM scores). Overall, we find that log-based features reflecting the question’s popularity, freshness, etc dominate question ranking, rather than content-based features measuring query and question similarity. Our second investigation finds that the offline measures highly correlate among themselves, but that the correlation between different offline and online measures is quite low. We find that the low correlation between online and offline measures is also reflected in discrepancies between the systems’ rankings for the OpenLiveQ-2 task, although this depends on the nature and type of the evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://chiebukuro.yahoo.co.jp/.

  2. 2.

    https://scikit-learn.org/stable/.

  3. 3.

    https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html.

  4. 4.

    Similar pattern of results were observed using Spearman’s and Kendall’s Tau correlation metrics during our investigation, results have been omitted because of the space constraints.

References

  1. Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM, pp. 621–630 (2009)

    Google Scholar 

  2. Dang, V.: The Lemur Project-Wiki-Ranklib (2013). http://sourceforge.net/p/lemur/wiki/RankLib

  3. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. (TOIS) 20(4), 422–446 (2002)

    Article  Google Scholar 

  4. Kato, M.P., Liu, Y.: Overview of NTCIR-13. In: Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies (2017)

    Google Scholar 

  5. Joachims, T., Granka, L.A., Pan, B., Hembrooke, H., Gay, G.: Accurately interpreting clickthrough data as implicit feedback. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 154–161. SIGIR (2005)

    Google Scholar 

  6. Kato, M.P., Manabe, T., Fujita, S., Nishida, A., Yamamoto, T.: Challenges of multileaved comparison in practice: lessons from NTCIR-13 OpenLiveQ Task. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM, pp. 1515–1518 (2018)

    Google Scholar 

  7. Kato, M.P., Nishida, A., Manabe, T., Fujita, S., Yamamoto, T.: Overview of the NTCIR-14 OpenLiveQ-2 task. In: Proceedings of the 14th NTCIR Conference on Evaluation of Information Access Technologies (2019)

    Google Scholar 

  8. Manabe, T., Nishida, A., Fujita, S.: YJRS at the NTCIR-13 OpenLiveQ task. In: Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies (2017)

    Google Scholar 

  9. Arora, P., Jones, G.J.F.: DCU at the NTCIR-14 OpenLiveQ-2 task. In: Proceedings of the 14th NTCIR Conference on Evaluation of Information Access Technologies (2019)

    Google Scholar 

  10. Qin, T., Liu, T.Y., Xu, J., Li, H.: LETOR: a benchmark collection for research on learning to rank for information retrieval. J. Inf. Retrieval 13(4), 346–374 (2010)

    Article  Google Scholar 

  11. Metzler, D., Croft, W.B.: Linear feature-based models for information retrieval. J. Inf. Retrieval 10(3), 257–274 (2007)

    Article  Google Scholar 

  12. Oosterhuis, H., de Rijke, M.: Sensitive and scalable online evaluation with theoretical guarantees. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM, pp. 77–86 (2017)

    Google Scholar 

  13. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. SIGIR (1998)

    Google Scholar 

  14. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: NIST Special Publication, no. 500225, pp. 109–123 (1995)

    Google Scholar 

  15. Sakai, T.: Evaluating evaluation metrics based on the bootstrap. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 525–532. SIGIR (2006)

    Google Scholar 

  16. Sakai, T.: On the reliability of information retrieval metrics based on graded relevance. J. Inf. Process. Manag. 43(2), 531–548 (2007)

    Article  Google Scholar 

  17. Breiman, L.: Some properties of splitting criteria. J. Mach. Learn. 24(1), 41–47 (1996)

    MATH  Google Scholar 

Download references

Acknowledgement

This research is supported by Science Foundation Ireland (SFI) as a part of the ADAPT Centre at Dublin City University (Grant No: 12/CE/I2267).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Piyush Arora .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Arora, P., Jones, G.J.F. (2019). Studying Online and Offline Evaluation Measures: A Case Study Based on the NTCIR-14 OpenLiveQ-2 Task. In: Kato, M., Liu, Y., Kando, N., Clarke, C. (eds) NII Testbeds and Community for Information Access Research. NTCIR 2019. Lecture Notes in Computer Science(), vol 11966. Springer, Cham. https://doi.org/10.1007/978-3-030-36805-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-36805-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-36804-3

  • Online ISBN: 978-3-030-36805-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics