Why you should NOT use MS MARCO to evaluate semantic search

And likely not many other widely used datasets either

Thiago G. Martins
Towards Data Science
7 min readMar 23, 2020

--

If we want to investigate the power and limitations of semantic vectors (pre-trained or not), we should ideally prioritize datasets that are less biased towards term-matching signals. This piece shows that the MS MARCO dataset is more biased towards those signals than we expected and that the same issues are likely present in many other datasets due to similar data collection designs.

--

--