INDUSTRY
INTRODUCING MLCOMMONS By Gregg Barrett
MLCommons was officially launched on December 3rd, 2020. MLPerf, which has been absorbed into MLCommons, successfully collected input from dozens of companies and academic institutions to create the industry-standard benchmarks for machine learning.
W
here MLPerf concerned itself with the development and maintenance of machine learning benchmarks for training and inference, MLCommons has a much broader objective and aims to answer the needs of the nascent machine learning industry through open and collaborative engineering in three areas: benchmarking, datasets and best practices. Datasets To make machine learning systems requires data and that is something
44
SYNAPSE | 1ST QUARTER 2021
that is not always easy to come by. In addition, to do an apples-to-apples comparison of one model against another requires using the same testing data. The problem however is that most public datasets are small, legally restricted, not redistributable and not diverse. To foster machine learning innovation MLCommons is therefore working to unite disparate companies and organisations in the creation of large diverse and redistributable datasets under a Creative Commons or similar license, for machine learning training and testing.
As an example, ImageNet is commonly used in computer vision, however there is no similar largescale dataset for speech where there might be a need to compare speech to text accuracy. The first effort in this regard is the People’s Speech Dataset, which is approximately 100 times larger than earlier open alternatives, and aims to be more diverse. Compiling such a dataset is not any easy task as aside from data engineering issues, there are licensing matters that need to be dealt with and the resulting dataset needs to be state of the art.