These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
Cleanlab
cleanlab automatically finds and fixes errors in any ML dataset. This data-centric AI package facilitates machine learning with messy, real-world data by providing clean labels during training.
# cleanlab works with **any classifier**. Yup, you can use sklearn/PyTorch/TensorFlow/XGBoost/etc.
cl = cleanlab.classification.CleanLearning(sklearn.YourFavoriteClassifier())
# cleanlab finds data and label issues in **any dataset**... in ONE line of code!
label_issues = cl.find_label_issues(data, labels)
# cleanlab trains a robust version of your model that works more reliably with noisy data.
cl.fit(data, labels)
# cleanlab estimates the predictions you would have gotten if you had trained with *no* label issues.
cl.predict(test_data)
# A true data-centric AI package, cleanlab quantifies class-level issues and overall data quality, for any dataset.
cleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint)
Get started with: documentation, tutorials, examples, and blogs.
- Learn how to run cleanlab on your own data in just 5 minutes for classification with: image, text, audio, and tabular data.
News! (2022) — cleanlab made accessible for everybody, not just ML researchers (click to learn more)
News! (2021) — cleanlab finds pervasive label errors in the most common ML datasets (click to learn more)
News! (2020) — cleanlab adds support for all OS, achieves state-of-the-art, supports co-teaching and more (click to learn more)
Release notes for past versions are available here. Details behind certain updates are explained in our blog.
So fresh, so cleanlab
cleanlab cleans your data’s labels via state-of-the-art confident learning algorithms, published in this paper and blog. See datasets cleaned with cleanlab at labelerrors.com. This package helps you find all the label issues lurking in your data and train more reliable ML models.
cleanlab is:
- backed by theory
- with provable guarantees of exact noise estimation and label error finding in realistic cases with imperfect models.
- fast
- Code is optimized and parallel-threaded (< 1 second to find label issues in ImageNet with pre-computed probabilities).
- easy-to-use
- Find label issues or train noise-robust models in one line of code. By default, cleanlab requires no hyper-parameters.
- general
- Works with any dataset and any model, e.g., TensorFlow, PyTorch, sklearn, xgboost, etc.
Examples of incorrect given labels in various image datasets found and corrected using cleanlab.
Run cleanlab
cleanlab supports Linux, macOS, and Windows and runs on Python 3.6+.
- Get started here! Install via
pip
orconda
as described here. - Developers who install the bleeding-edge master branch from source should refer to this master version of documentation.
cleanlab core package components (click to learn more)
Use cleanlab with any model (TensorFlow, PyTorch, sklearn, xgboost, etc.)
All features of cleanlab work with any dataset and any model. Yes, any model: scikit-learn, PyTorch, Tensorflow, Keras, JAX, HuggingFace, MXNet, XGBoost, etc. If you use a sklearn-compatible classifier, cleanlab methods work out-of-the-box.
It’s also easy to use your favorite non-sklearn-compatible model (click to learn more)
Cool cleanlab applications
Reproducing results in Confident Learning paper (click to learn more)
cleanlab performance across 4 data distributions and 9 classifiers (click to learn more)
ML research using cleanlab (click to learn more)
cleanlab for advanced users (click to learn more)
Positive-Unlabeled learning with cleanlab (click to learn more)
Citation and related publications
cleanlab is based on peer-reviewed research. Here are the relevant papers to cite if you use this package:
Confident Learning (JAIR ’21) (click to show bibtex)
Rank Pruning (UAI ’17) (click to show bibtex)
Other resources
-
NeurIPS 2021 paper: Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
-
Cleanlab Studio: No-code Data Improvement
While this open-source library finds data issues, an interface is needed to efficiently fix these issues in your dataset. Cleanlab Studio is a no-code platform to find and fix problems in real-world ML datasets. Studio automatically runs optimized versions of the algorithms from this open-source library on top of AutoML models fit to your data, and presents detected issues in a smart data editing interface. Think of it like a data cleaning assistant that helps you quickly improve the quality of your data (via AI/automation + streamlined UX).
Join our community
-
The best place to learn is our Slack community.
-
Have ideas for the future of cleanlab? How are you using cleanlab? Join the discussion and check out our active/planned Projects and what we could use your help with.
-
Interested in contributing? See the contributing guide and ideas on useful contributions.
-
Have code improvements for cleanlab? See the development guide.
-
Have an issue with cleanlab? Search existing issues or submit a new issue.
-
Need professional help with cleanlab? Join our #help Slack channel and message one of our core developers, Jonas Mueller, or schedule a meeting via email: team@cleanlab.ai
License
Copyright (c) 2017-2022 Cleanlab Inc.
cleanlab is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
cleanlab is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See GNU Affero General Public LICENSE for details.
About the tool
You can click on the links to see the associated tools
Developing organisation(s):
Tool type(s):
Objective(s):
Purpose(s):
Type of approach:
Usage rights:
Github stars:
- 2799
Github forks:
- 278
Use Cases
Would you like to submit a use case for this tool?
If you have used this tool, we would love to know more about your experience.
Add use case