Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Project Moonshot



Project Moonshot

Developed by the AI Verify Foundation, project Moonshot is one of the world’s first Large Language Model (LLM) Evaluation Toolkits, designed to integrate benchmarking, red teaming, and testing baselines. Its aim it to assist multiple stakeholders including developers, compliance teams, and AI system owners in mitigating deployment risks associated with LLMs by offering a seamless way to evaluate the performance of their applications, both before and after deployment.

This allows companies to address the significant opportunities, and associated risks of generative AI, including LLM technology. Within the LLM space, companies may question, which LLMs are the most appropriate for achieving their goals, or how to ensure they are building a model which is robust and safe. Moonshot helps companies answer these questions through a comprehensive suite of benchmarking tests and scoring reports, so they can deploy responsible generative AI systems with confidence.
Five of the Project Moonshot characteristics include:

  • Benchmark to measure model safety and performance: Project Moonshot offers a list of benchmarks which are popular, including those widely discussed in the community, and those used by leaderboards such as Hugging Face, to measure your model’s performance. This provides developers with valuable insights to improve and refine the application.
  • Setting testing baselines & simple rating systems: Project Moonshot simplifies model testing with curated benchmarks scored on a graded scale. This empowers users to make informed decisions based on clear and intuitive test results, enhancing the reliability and trustworthiness of their AI models.
  • Enabling manual and automated red-teaming: Project Moonshot facilitates manual and automated red-teaming, incorporating automated attack modules based on research-backed techniques to test multiple LLM applications simultaneously.
  • Reduce testing & reporting complexity: Project Moonshot streamlines testing processes and reporting, seamlessly integrating with CI/CD pipelines for unsupervised test runs and generating shareable reports.
  • Customisation for your unique application needs: Recognising the diverse needs of different applications, Project Moonshot’s Web UI guides users to identify and run only the most relevant tests, to optimise the testing process. Users can also tailor their tests with custom datasets, to evaluate their models for their unique use cases.

Project Moonshot reflects Singapore’s commitment to address Artificial Intelligence (AI) risks through principled approaches, practical tools, and inclusive international engagement.

About the tool


Developing organisation(s):



Objective(s):


Country of origin:


Lifecycle stage(s):


Type of approach:




Target groups:




Programming languages:


Tags:

  • biases testing
  • large language model
  • open source
  • performance
  • robustness
  • model risk management
  • safety

Modify this tool

Use Cases

There is no use cases for this tool yet.

Would you like to submit a use case for this tool?

If you have used this tool, we would love to know more about your experience.

Add use case
catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.