Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Cleanlab



Cleanlab

cleanlab automatically finds and fixes errors in any ML dataset. This data-centric AI package facilitates machine learning with messy, real-world data by providing clean labels during training.

# cleanlab works with **any classifier**. Yup, you can use sklearn/PyTorch/TensorFlow/XGBoost/etc.
cl = cleanlab.classification.CleanLearning(sklearn.YourFavoriteClassifier())

# cleanlab finds data and label issues in **any dataset**... in ONE line of code!
label_issues = cl.find_label_issues(data, labels)

# cleanlab trains a robust version of your model that works more reliably with noisy data.
cl.fit(data, labels)

# cleanlab estimates the predictions you would have gotten if you had trained with *no* label issues.
cl.predict(test_data)

# A true data-centric AI package, cleanlab quantifies class-level issues and overall data quality, for any dataset.
cleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint)

Get started with: documentationtutorialsexamples, and blogs.

  • Learn how to run cleanlab on your own data in just 5 minutes for classification with: imagetextaudio, and tabular data.

pypi os py_versions build_status coverage docs Slack Community Twitter Cleanlab Studio


News! (2022) — cleanlab made accessible for everybody, not just ML researchers (click to learn more)
News! (2021) — cleanlab finds pervasive label errors in the most common ML datasets (click to learn more)
News! (2020) — cleanlab adds support for all OS, achieves state-of-the-art, supports co-teaching and more (click to learn more)

Release notes for past versions are available here. Details behind certain updates are explained in our blog.

So fresh, so cleanlab

cleanlab cleans your data’s labels via state-of-the-art confident learning algorithms, published in this paper and blog. See datasets cleaned with cleanlab at labelerrors.com. This package helps you find all the label issues lurking in your data and train more reliable ML models.

cleanlab is:

  1. backed by theory
    • with provable guarantees of exact noise estimation and label error finding in realistic cases with imperfect models.
  2. fast
    • Code is optimized and parallel-threaded (< 1 second to find label issues in ImageNet with pre-computed probabilities).
  3. easy-to-use
    • Find label issues or train noise-robust models in one line of code. By default, cleanlab requires no hyper-parameters.
  4. general
    • Works with any dataset and any model, e.g., TensorFlow, PyTorch, sklearn, xgboost, etc.

Examples of incorrect given labels in various image datasets found and corrected using cleanlab.

Run cleanlab

cleanlab supports Linux, macOS, and Windows and runs on Python 3.6+.

cleanlab core package components (click to learn more)

Use cleanlab with any model (TensorFlow, PyTorch, sklearn, xgboost, etc.)

All features of cleanlab work with any dataset and any model. Yes, any model: scikit-learn, PyTorch, Tensorflow, Keras, JAX, HuggingFace, MXNet, XGBoost, etc. If you use a sklearn-compatible classifier, cleanlab methods work out-of-the-box.

It’s also easy to use your favorite non-sklearn-compatible model (click to learn more)

Cool cleanlab applications

Reproducing results in Confident Learning paper (click to learn more)
cleanlab performance across 4 data distributions and 9 classifiers (click to learn more)
ML research using cleanlab (click to learn more)
cleanlab for advanced users (click to learn more)
Positive-Unlabeled learning with cleanlab (click to learn more)

Citation and related publications

cleanlab is based on peer-reviewed research. Here are the relevant papers to cite if you use this package:

Confident Learning (JAIR ’21) (click to show bibtex)
Rank Pruning (UAI ’17) (click to show bibtex)

Other resources

While this open-source library finds data issues, an interface is needed to efficiently fix these issues in your dataset. Cleanlab Studio is a no-code platform to find and fix problems in real-world ML datasets. Studio automatically runs optimized versions of the algorithms from this open-source library on top of AutoML models fit to your data, and presents detected issues in a smart data editing interface. Think of it like a data cleaning assistant that helps you quickly improve the quality of your data (via AI/automation + streamlined UX).

Join our community

License

Copyright (c) 2017-2022 Cleanlab Inc.

cleanlab is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

cleanlab is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See GNU Affero General Public LICENSE for details.

About the tool


Developing organisation(s):


Tool type(s):


Objective(s):



Type of approach:





Programming languages:



Github stars:

  • 2799

Github forks:

  • 278

Modify this tool

Use Cases

There is no use cases for this tool yet.

Would you like to submit a use case for this tool?

If you have used this tool, we would love to know more about your experience.

Add use case
catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.