Standard quality criteria derived from current NLP evaluations for guiding evaluation design and grounding comparability and AI compliance assessments