Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

BigScience Catalogue of Language Data and Resources



BigScience Catalogue of Language Data and Resources

The main goal of the catalogue is to support the creation of the BigScience dataset while adhering to the values laid out by the various data working groups: collecting diverse resources (Data Sourcing), supporting information required for open and easily usable technical infrastructure (Data Tooling), and respecting the privacy of data subjects and the rights of data owners (Data Governance).

As per 14 December 2021, the catalogue contained 192 entries with 432 different language tags (each entry can have multiple language tags). The most frequent language tags are those of the BigScience target language groups. English is the most frequent language across all entries. The most frequent varieties of Arabic are Modern Standard Arabic and Classical Arabic, the most frequent Indic languages are Hindi, Bengali, Telugu, Tamil and Urdu , and the most frequent NigerCongo languages are Swahili, Igbo, Yoruba and isiZulu.

As a result of our efforts, we were successfully able to create an openly available catalogue of 192 data sources with at least 10% of the entries representing each of our target languages (with the exception of programming languages) in locations around the world. The bulk of these resources are primary sources, presenting opportunities to collect data in these languages in new contexts and topics. Together the resource custodians themselves cover a wide range of geographic and institutional contexts.

The form also provides a mode for a second participant to review and validate entries already submitted to the catalogue. After selecting an entry, the form updates with the responses originally submitted for that entry with a validation checkbox at the end of each section. The validator may review and edit the selections for each question and mark the section as validated. Once each section has been reviewed, the validator may save their work. Already validated entries will include a note indicating that the entry has been validated and allow the validator to review either the original entry or later entries listed by their save date. 

An interactive map of the world shows the number of entries submitted by various geographical levels of detail, such as region or country, for either the location of the data creators or the location of the data custodians. Both the map and a pie chart showing the proportion of entries by language may be filtered using one of the many properties produced by the form such as the resource type, the license type, or the media type. 

Disclaimer: the aforementioned information is based on the paper “Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources”, co-authored by Angelina McMillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco De Toni, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Zeerak Talat, Daniel van Strien, Yacine Jernite. Link: https://arxiv.org/abs/2201.10066 

Links: 

Use Cases

There is no use cases for this tool yet.

Would you like to submit a use case for this tool?

If you have used this tool, we would love to know more about your experience.

Add use case
catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.