Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

CarefulAI: Prompt-LLM Improvement Method (PLIM)

Website

When working with large language models (LLMs), accuracy is important. However, there is a lack of understanding of the co-dependency between LLM outputs and prompts. Existing LLM benchmarks do not specify this; they allude to historical accuracy scores on LLM benchmarks that may not be relevant to the end user. In addition, LLMs are usually dynamic in practice. Their behaviour is not static, but changes over time, and often cannot be explained by LLM providers. Users, therefore, can only partially depend upon LLM benchmarks. In practice, to make LLMs fit for purpose and safe, users are required to constantly test Prompt-LLM outputs for specific cases. This can be time-consuming.

CarefulAI’s approach to this is based on the discovery that by serving a model with a standard set of end user-specific examples of questions and answers—validated by the end-user community (with each prompt validated by a minimum of 3 subject matter experts/end users), the time taken to get acceptable answers is significantly reduced (tenfold). In addition to getting Prompt-LLM combinations that are deemed safe, the approach enables sector/subject matter prompt benchmarking against multiple models.

PLIM is designed to make benchmarking and continuous monitoring of LLMs safer and more fit for purpose. This is particularly important in high-risk environments, e.g. healthcare, finance, insurance and defence. Having community-based prompts to validate models as fit for purpose is safer in a world where LLMs are not static.

The PLIM method consists of question-and-answer prompts that can be applied to specific purposes validated by the community the Prompt-LLM output seeks to support. These prompts are shared widely across sector leads for validation purposes (for example, in a healthcare context, this would be senior clinicians, NICE and MHRA). At least three subject matter experts independently validate each prompt and carry safety case information (e.g. in mental health, these would be phrases that would be problematic, i.e. suicide ideation phrases and correct responses). Synthetic prompts are also created that mirror the interactions that have been validated to increase the test boundaries. They, too, are validated by the subject matter experts.

Benefits:

Effective deployment of safer AI systems based on LLMs: previously, LLMs were deployed with prompts that were not maintained, which has meant that engineering teams could not effectively deploy stable LLM enabled services: as such services behaviour could not be guaranteed.
Reduced risk of incurring costs associated Prompt-LLM-Service engineering groups: by decreasing the amount of re-engineering.
Improved alignment of LLM outputs with desired objectives: by having a community subject matter validating prompts with LLM engineering teams.

Limitations:

The effectiveness of this approach depends on the quality of prompt safety rules and the experience and availability of subject matter experts prepared to validate question and answer sets for the target markets where LLMs are to be deployed. In essence, if LLM engineering teams do not have access to subject matter experts, they are forced into a cycle of prompt re-engineering to deliver a stable prompt-LLM.
The dependency of LLM performance on prompt types has yet to be well understood. Existing high-profile universities and institutions that publish LLM benchmarks are not set up to manage the risk their benchmarks create around overconfidence in individual models. There will therefore always be a co-dependency between LLM providers and prompt engineering.

Further links:

Link to the full use case.
Link to CarefulAI resources on BS 30440.
Link to ISO/IEC 42001.
Link to AISI safety cases.

This case study was published in collaboration with the UK Department for Science, Innovation and Technology Portfolio of AI Assurance Techniques. You can read more about the Portfolio and how you can upload your own use case here.

About the tool

You can click on the links to see the associated tools

Developing organisation(s):

CarefulAI

Objective(s):

Performance

Impacted stakeholders:

Consumers
Regulators

Country of origin:

United Kingdom

Type of approach:

Educational

Maturity:

Implemented in multiple projects

Usage rights:

Fee-based

License:

Apache 2.0

Target users:

Data scientist
Developer
Other

Stakeholder group:

Academia
Business
Government

Validity:

Always up to date

Geographical scope:

International

Tags:

ai responsible
performance
benchmarking
llm

Modify this tool

Use Cases

There is no use cases for this tool yet.

Would you like to submit a use case for this tool?

If you have used this tool, we would love to know more about your experience.

Add use case

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.