ML Reproducibility Crisis

A study from 2016 shows the majority of over 1500 surveyed researchers acknowledged the existence of a reproducibility crisis across various research fields (Baker 2016). Gundersen and Kjensmo (2018) further highlight this by examining the reproducibility of research papers that have been recently submitted to two of the major conference series in the field of AI, namely International Joint Conference on Artificial Intelligence (IJCAI) and Association for the Advancement of Artificial Intelligence (AAAI). For this, they defined the metrics for Experiment Reproducibility (R1), Data Reproducibility (R2) and Methods Reproducibility (R3) and confirmed that none of the 400 papers submitted were fully reproducible according to those metrics. A similar effect can be seen in the field of recommender systems. Less than 50% of algorithms for the traditional top-N recommendation problem submitted to the top-level conferences IJCAI, Knowledge Discovery and Data Mining (KDD) or Special Interest Group on Information Retrieval (SIGIR) were deemed to be reproducible by Cremonesi and Jannach (2021). Finally, the paper Pham et al. (2021) highlighted the unawareness of researchers of the various factors that introduce variance in their experiments by conducting two surveys. The first part consisted of a researcher and practitioner survey which showed that 83.8% of the 901 participants were not aware of any implementation-level variance. Additionally, their literature survey found that only 19.5±3% of research papers published in recent top software engineering, AI, and systems conferences utilise multiple identical training runs to quantify the variance of their results. The evidence suggests that researchers are not only often unaware of implementation level variances impacting their results but also that practices for ensuring reproducibility are insufficiently applied. Addressing this crisis requires effort in order to enhance awareness and adoption of more rigorous reproducibility practices.


Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533, no. 7604 (May): 452–454. ISSN: 1476-4687. https://doi.org/10.1038/533452a.

Gundersen, Odd Erik, and Sigbjørn Kjensmo. 2018. “State of the Art: Reproducibility in Artificial Intelligence.” Proceedings of the AAAI Conference on Artificial Intelligence 32, no. 1 (April). ISSN: 2374-3468. https://doi.org/10.1609/aaai.v32i1.11503.

Pham, Hung Viet, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. 2021. “Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance.” In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 771–783. ASE ’20. New York, NY, USA: Association for Computing Machinery, January. ISBN: 978-1-4503-6768-4. https://doi.org/10.1145/3324884.3416545