Beyond Binary Relevance: Preferences, Diversity, and Set-Level Judgments

  • Paul Bennett ,
  • Ben Carterette ,
  • Olivier Chapelle ,
  • Thorsten Joachims

|

In most information retrieval or filtering applications, assessor judgments form the cornerstone of system evaluation. These judgments are critical when comparing systems or training ranking algorithms. Other “judgments” such as clicks, relevance feedback or ratings are also used for tuning and selection of ranking algorithms, and more broadly for user modeling, evaluating presentation techniques, etc. However, most judgment types (including binary or graded relevance judgments, ratings, and explicit feedback) are for an individual document independent of other documents. The past several years have seen a growing interest in the use of relative judgments or preferences (Is document A better than document B?), diversity judgments (Is a retrieved set of results diverse?), novelty judgments (Is this document novel when added to a set?), and implicit preferences from clicks that require considering multiple items. At the same time, there has been a surge in the development of machine learning methods for “structured learning”. These methods are capable of optimizing more complex interactions in both input features and output values as well as directly optimizing for more complex performance measures (e.g., MAP, ROC area, ranking, predicting parse trees, etc.).