### Abstract

We evaluate statistical inference procedures for small-scale IR experiments that involve multiple comparisons against the baseline. These procedures adjust for multiple comparisons by ensuring that the probability of observing at least one false positive in the experiment is below a given threshold. We use only publicly available test collections and make our software available for download. In particular, we employ the TREC runs and runs constructed from the Microsoft learning-to-rank (MSLR) data set. Our focus is on non-parametric statistical procedures that include the Holm-Bonferroni adjustment of the permutation test p-values, the MaxT permutation test, and the permutation-based closed testing. In TREC-based simulations, these procedures retain from 66% to 92% of individually significant results (i.e., those obtained without taking other comparisons into account). Similar retention rates are observed in the MSLR simulations. For the largest evaluated query set size (i.e., 6400), procedures that adjust for multiplicity find at most 5% fewer true differences compared to unadjusted tests. At the same time, unadjusted tests produce many more false positives.

Original language | English |
---|---|

Title of host publication | SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval |

Pages | 403-412 |

Number of pages | 10 |

DOIs | |

State | Published - Sep 2 2013 |

Event | 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2013 - Dublin, Ireland Duration: Jul 28 2013 → Aug 1 2013 |

### Publication series

Name | SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval |
---|

### Conference

Conference | 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2013 |
---|---|

Country | Ireland |

City | Dublin |

Period | 07/28/13 → 08/1/13 |

### Fingerprint

### Keywords

- Holm-Bonferroni
- MaxT
- Multiple comparisons
- Permutation test
- Randomization test
- Statistical significance
- T-test

### Cite this

*SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval*(pp. 403-412). (SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval). https://doi.org/10.1145/2484028.2484034