These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
Croissant
Croissant is an open-source framework developed by MLCommons to standardise dataset descriptions, enhance data discoverability, and facilitate automated use across machine-learning tasks. Croissant ensures datasets are consistently documented by providing structured metadata schemas, improving interoperability, transparency, and ease of integration. This enables practitioners to efficiently find, evaluate, and utilise datasets, streamlining the machine learning lifecycle and supporting reproducible, data-centric AI research and applications.
Croissant is an initiative by MLCommons, a collaborative organisation dedicated to establishing industry-wide benchmarks and open standards for accelerating machine learning (ML) innovation. Recognising the challenges posed by datasets' growing volume and complexity, MLCommons introduced Croissant to enhance data accessibility, interoperability, and reproducibility across the global ML community.
Historically, machine learning researchers and developers have struggled with the lack of standardised dataset metadata, leading to significant inefficiencies. Variability in dataset documentation complicates the processes of dataset discovery, evaluation, and integration into ML workflows. This inconsistency often results in redundant work, decreased reproducibility of ML results, and barriers to accurately validating and comparing model performance.
Addressing these issues, MLCommons developed Croissant to provide structured schemas that clearly define and standardize dataset metadata. The tool consistently describes essential dataset characteristics, including content structure, provenance, licensing, intended usage, and associated tasks. By implementing a standardised metadata framework, Croissant simplifies dataset discovery through automated search capabilities, making it significantly easier for practitioners to locate appropriate datasets for their needs.
The primary objectives of Croissant are:
Interoperability and Accessibility: Croissant significantly enhances interoperability by enforcing consistent dataset metadata structures. Practitioners across diverse organisations, disciplines, and geographical locations can seamlessly discover and use datasets without extensive manual preprocessing or analysis.
Transparency and Reproducibility: Standardised dataset descriptions provided by Croissant allow users to quickly understand the dataset’s source, structure, and usage constraints. Improved transparency facilitates more reliable and reproducible ML research, enabling the community to verify and replicate findings effectively.
Automation and Efficiency: Croissant facilitates automated dataset retrieval, assessment, and integration into ML workflows. Structured metadata enables automated tools to validate dataset suitability, compatibility, and compliance with user-specified criteria, streamlining the ML pipeline.
Community-Driven Development: As an open-source framework, Croissant benefits from continual improvement through community contributions and collaborative development efforts. This collective approach ensures that Croissant remains responsive to evolving community needs and industry standards.
Ultimately, Croissant represents a foundational component of MLCommons’ efforts to improve the broader machine-learning ecosystem. By standardising and enriching dataset metadata, Croissant actively contributes to developing robust, transparent, and accessible data-driven AI systems. It reflects MLCommons’ commitment to fostering open, community-led solutions that address significant barriers to adopting and ethically deploying machine learning technologies.
About the tool
You can click on the links to see the associated tools
Developing organisation(s):
Tool type(s):
Objective(s):
Impacted stakeholders:
Target sector(s):
Country of origin:
Lifecycle stage(s):
Type of approach:
Usage rights:
License:
Target groups:
Target users:
Stakeholder group:
Validity:
Enforcement:
Geographical scope:
People involved:
Required skills:
Technology platforms:
Tags:
- open source
- metadata
- standard
Use Cases
Would you like to submit a use case for this tool?
If you have used this tool, we would love to know more about your experience.
Add use case