AI Policy Toolkit – Design and system documentation

Access detailed documentation about the methodology and underlying framework.

Motivation & Project overview

Problem statement

Governments need to assess where their AI policy landscape stands, identify policy priorities, and find concrete, comparable examples from other countries to inform their next steps. Existing diagnostic tools are often too high-level to guide implementation, too resource-intensive to complete, or insufficiently tailored to national context and capacity.

At the same time, the volume and complexity of AI policy documents have grown rapidly, making it difficult for policymakers to locate trusted, relevant information quickly.

What the AI Policy Toolkit is

The AI Policy Toolkit is an interactive platform on OECD.AI. It helps governments to:

1. Assess their current AI policy landscape and identify policy priorities.

2. Discover relevant global policy examples tailored to their priorities and context.

The Toolkit comprises three integrated components:

ComponentDescription
1Lightweight, modular questionnaire mapping existing national AI policy options and identifying priorities
2AI-powered module surfacing relevant policy examples based on questionnaire results, priorities, and regional context

How the Toolkit was developed with governments and partners is summarised in the section: Methodology and participatory design.

For more information about what the Toolkit is and is not, please review our frequently asked questions.

Target audience

Primary users include:

  • Development agencies, international partners, and research institutions supporting countries building AI governance capacity.
  • National policymakers and government officials responsible for AI strategy, regulation, or oversight.
  • OECD national contact points and government delegates engaged through OECD/GPAI processes.

Example use cases

Case 1. A Ministry of Digital Affairs from an emerging economy is updating its national AI strategy. The team uses the questionnaire to map institutional measures under the “Governance and institutions” and “Human capital and education” pillars. In one session, they find an AI strategy and national AI body in place, but no formal oversight mechanism or AI skills programme for public sector workers. These gaps feed into Component 2, where the team retrieves examples of oversight mechanisms and upskilling programmes from the same UN regional group as the user’s country. The team can reach out to the responsible organisation or the relevant national contact point for bilateral coordination. 

Case 2. A Ministry of Science and Technology seeks regional examples of building national AI compute infrastructure. The team selects the “Energy, connectivity, and compute” pillar and identifies policy priorities based on their current physical and digital infrastructure maturity. The system surfaces relevant policy examples for each policy priority. The team selects two documents for deeper exploration, using the in-platform query interface to ask about institutional mandate, funding, and reporting lines, with responses shown alongside the relevant document pages.

What this Toolkit adds

The Toolkit is:

  • Designed for all economies: it is flexible across diverse governance capacities and contexts. Advanced economies with established policy priorities can bypass the full questionnaire and access policy examples directly.
  • Lightweight and modular: the questoinnaire (Component 1) is less resource-intensive than existing tools and facilitates cross-ministerial participation; pillar-based modules allow teams to target their strategic focus first.
  • Focused on implementation: beyond diagnostics, component 2 surfaces policy options and examples tailored to capacity and priorities. Where appropriate, it also highlights the responsible organizations or contact points for bilateral coordination. Implementation Factsheets (Component 3) will extend this with structured implementation guidance once published [forthcoming].
  • Evidence-based and traceable: outputs are linked to OECD publications and official government documents from the AI Policy Navigator.

Compared with general-purpose AI tools, its AI question-answering component additionally:

  • Remains non-prescriptive: factual summaries and examples without ranking, grading, or evaluating policy effectiveness.
  • Relies on trusted policy sources.
  • Returns information at document level rather than as aggregate summaries.
  • Lets users control source selection and verify responses against originals. The source documents and AI-generated responses are displayed side by side. Responses include page links with bounding boxes that highlight exactly where the answers are grounded.

How the Toolkit was developed

The Toolkit was developed through participatory design with governments and partners across regions, not as a top-down OECD checklist. This section summarises that process. Detailed evidence is in Annex A.

  • Asynchronous written input (November-December 2025) from OECD/GPAI delegates.
  • Regional co-creation workshops (August 2025-March 2026) in Africa, Asia-Pacific, and Latin America and the Caribbean: pillar and policy-option validation, usability feedback, and rapid prototyping (including interface feedback in March 2026). Formal end-to-end beta testing was not conducted before the first release.
  • UX interviews (January-February 2026): 14 sessions with government officials and development agency representatives across four regions.

Insights from interviews and workshops are reported under the Chatham House Rule: themes may be used, but individual countries are not named. Regions and participant types may be cited


Component 1: Identify relevant policy priorities (questionnaire)

What it is

The questionnaire helps countries take stock of their AI policy landscape and identify gaps and priority areas. Users progress through four steps:

1. Select country: the country whose policy landscape is being assessed.

2. Select pillar: one or more of the six policy pillars.

3. Answer questions: complete the questionnaire for the selected pillar(s).

4. Review result: a structured summary of existing measures and identified gaps. Maps which policy options within each pillar are in place, partially in place, or not in place. Produces a prioritised list of policy areas requiring further action, based on the gap pattern and, where applicable, user input on strategic priorities.

The questionnaire can be completed partially (pillar by pillar) or collaboratively across ministries, with progress saved for later.

Objectives and rationale

The questionnaire is designed to:

  • Support collaborative workflows where different ministries contribute to different pillars.
  • Provide a structured, evidence-grounded starting point for governments without a recent baseline review of their AI policy landscape.
  • Identify specific policy gaps across each of the six pillars.
  • Generate a policy profile that directs users to relevant examples in Component 2. The profile is transparent, user-editable, and non-prescriptive throughout.
  • Support collaborative workflows where different ministries contribute to different pillars.

Link to full questionnaire

Structure of the questionnaire

The questionnaire is organised in six pillars (Table 1). A seventh pillar on sectoral implementation is planned for future iterations, building on collaboration with OECD Directorates developing sector-specific tools (e.g. ELS’s AI Health Policy Checklist, GOV’s “Governing with AI” framework, and CFE’s smart cities tool).

Table 1. Questionnaire pillars

PillarAreas covered
Governance and institutionsAI strategy, dedicated bodies, institutional mandates, oversight mechanisms, and international engagement
Research, investment, and commercialisationPublic R&D funding, research infrastructure, startup support, private investment, technology transfer, and export capacity
Energy, connectivity, and computeCompute availability and access, power supply, broadband connectivity, environmental standards, and infrastructure policy
Data governanceData governance frameworks, open data infrastructure, interoperability, cross-border data flows, and AI-ready datasets including in local languages
Adoption and useAI deployment support for SMEs and sectors, responsible deployment guidance, public sector AI use, transparency and accountability requirements, and impact measurement
Human capital and educationAI literacy, curriculum integration, workforce upskilling, talent management, labour market monitoring, and international mobility

Notable design decisions

Design rationales below draw on asynchronous written input (November-December 2025, from OECD/GPAI countries), regional co-creation workshops, and UX interviews (Annex A).

Why pillars rather than a principles-first questionnaire

The current structure reflects how governments plan and govern AI in practice.

The questionnaire is organized around six operational pillars rather than a principles-first layout; each pillar is tagged to the relevant OECD AI Principles directly on the platform. This structure was a deliberate design choice to signal that the platform is not intended as a compliance audit. Workshop and interview feedback consistently called for a framing that guides governments without enforcing strict compliance or ranking them against a single international standard.

This structure reflects how governments actually plan and govern AI in practice. National AI strategies are typically organized around operational pillars rather than abstract concepts. For instance, one Southeast Asian national AI strategy consider six pillars spanning human resources, data and infrastructure, adoption, sectoral application, ethics, and R&D, while another focuses on regulation, infrastructure, human resources, sectors, and adoption. A pillar-based layout directly matches these existing ministerial remits and strategy workflows. In contrast, organizing by abstract principles, such as the rule of law, transparency, and accountability, makes it unclear where specific initiatives and mechanisms should sit within a government’s division of labor.

Adopting an operational pillar approach also resolves several structural and regional challenges that emerged during testing the questionnaire. A principles-first draft created significant ambiguity about where specific questions belonged; for example, it was unclear whether regulatory sandboxes fit under governance or the policy environment, or if open data initiatives belonged under R&D or a dedicated chapter on data. Pillars reduce this duplication and eliminate disagreement over placement. This framing was highly supported by African workshop participants, who found that broad “inclusive ecosystem” language was much easier to operationalize when explicitly split into concrete modules like compute and connectivity, data, adoption, and R&D.

Finally, user research confirmed that mapping national plans to the OECD AI Principles was not the primary pain point for governments, as policymakers generally find that exercise straightforward. The more pressing need is operational: governments are looking for a practical framework to scan what policies are currently in place and identify what is missing.

Choice of policy options

Within each pillar, policy options represent discrete, identifiable government actions (e.g. “national AI strategy adopted”, “dedicated AI regulatory body established”, “AI skills programme for public sector workers”). Options were derived from:

  • OECD analytical work on AI policy implementation.
  • Submissions to the OECD.AI Policy Navigator across 80+ jurisdictions.
  • Insights from co-creation workshops and user interviews on common gaps and priority areas (see Annex A).

Options are factual and non-normative: the questionnaire records what is in place and what is missing, without prescribing what should be done.

Question scope and wording

Question coverage follows asynchronous feedback from GPAI/OECD member countries and regional co-creation workshops (see Annex A.2):

  • Resource constraints and policy maturity – Some questions are conditional so economies are not asked about measures that assume capacity they have not yet built (requested in Asia-Pacific and MENA workshops).
  • Definitions. Terms such as “AI research centre” or “complementary financing” are defined in the questionnaire where delegates requested clarity (notably from European written input).
  • Scope and target population. Questions state who a measure applies to where that was ambiguous (public vs private sector, developers vs deployers), reflecting European delegate feedback.

Illustrative policy options in the final questionnaire include: AI research centres aligned with national priorities; research on legal and societal aspects of AI; frictionless data sharing across government; market and regulatory barriers to cloud access for private organisations (including SMEs); shared national AI research infrastructure; and tracking of complementary financing for the national AI strategy.

Response options

The Toolkit intentionally avoids the use of numeric Likert scales, which can be difficult to interpret in this context and inadvertently suggest a scoring or ranking system.

Additionally, the design moves beyond a restrictive yes/no framework. Based on direct feedback from workshops and written delegate input, including a European delegate who noted that “governance answers need nuance beyond yes/no”, there was a clear demand to capture degrees of partial implementation. Consequently, the questionnaire uses the following response options: “No/Don’t know”, “Planned”, “In progress”, “Yes/Implemented”.

Question order

Questions within each pillar run from foundational measures (strategy, mandate) to more advanced ones (oversight, enforcement, evaluation). This ordering yields a useful partial profile even when not all questions are answered, and reflects sequencing commonly observed in national AI policy development among co-creation workshop regions (see Annex A.2).


Component 2: Find examples relevant to your policy priorities

What it is

An AI-powered module that helps countries discover relevant global AI policy examples aligned with national priorities and context. Users progress through three steps:

1. Select pillar: a policy pillar of interest.

2. Select policy option: the specific policy area for which examples are needed.

3. View results: context-specific policy examples from verified OECD sources.

Users may also enter Component 2 directly, without completing the questionnaire first.

Objectives and rationale

Component 2 is designed to:

  • Save policymakers time by surfacing relevant, trusted, country-level examples that would otherwise require extensive manual searching.
  • Support evidence-based decisions by anchoring outputs to trusted and verifiable sources.
  • Mitigate risks of general-purpose AI tools (reduce hallucination, improve source traceability, and flag outdated information by constraining retrieval to verified collections and supporting side-by-side verification).

Data

Sources used to retrieve policy examples

The Toolkit draws on the following document collections:

SourceTypeApproximate countIngestion frequencyNotes
OECD.AI Policy NavigatorStructured policy initiative records + PDFs(2297 initiatives and 1091 documents across 80+ jurisdictions Monthly (TBC)Submitted by OECD national contact points; covers AI policies across six thematic pillars. Continuous integration with the Toolkit is planned.
OECD publications (oecd.org)Research reports and publications (PDF)1808 reportsMonthlyFiltered by AI-related keywords; English and French
OECD Wonk (AI-related blog posts)Short-form analysis (HTML/text)58 postsMonthly 

Data maintenance

Short-term manual augmentation: ahead of the first release, a focused window ensured each policy option is supported by a minimum set of relevant examples across regions, including targeted collection and upload of missing initiatives into the Policy Navigator.

Policy Navigator: high-frequency updates. Countries can submit new policy initiatives through an updated submission interface; submissions appear immediately on the platform. OECD.AI analysts and national contact points (NCP) review and validate entries. NCP data validation runs twice yearly.

OECD publications (oecd.org) subset: filtered on publication date and AI-related keywords; updated semi-annually by re-running the ingestion pipeline.

Quality control: the ingestion pipeline de-duplicates documents, filters corrupt files, and segments content into chunks with metadata (title, year, source type, country). See the AI component section for the ingestion-to-embedding pipeline.

Notable design decisions

Country similarity and grouping

When users retrieve policy examples, results are grouped by the user’s UN regional group. Examples are prioritised from the same regional group as the user’s country. A global view (all regional groups) is also available. Users can override regional filters and explore examples from any jurisdiction.

Presentation of policy examples

Examples come from the OECD.AI Policy Navigator, which collects voluntary submissions from government contact points. The OECD Secretariat does not vet submissions for effectiveness. Before surfacing in the Toolkit, candidates pass automated retrieval and multi-judge validation, then manual review where needed. Examples are therefore presented as factual descriptions of what other countries have done, without grading, ranking, or endorsement, and without constituting official OECD recommendations.

Component 3 (Implementation Factsheets, [forthcoming]) will complement this by summarising high-level good practice for each policy priority, based on prior OECD analytical work.

Single-document retrieval

Multi-document summaries can obscure context, hinder verification, and increase the risk of irrelevant or inaccurate outputs. Drawing on the OECD AI Principles and evidence from prior user studies, Component 2 uses a two-step pattern: 1) Identify the most relevant document (user-controlled retrieval scope). 2) Query within that document, with responses shown alongside source pages for direct verification.

This design prioritises user control, document-specific context, and verifiability over convenience. This is a deliberate risk-reduction choice for high-stakes policy work.

AI component

The AI component uses a Retrieval-Augmented Generation (RAG) architecture.

1. Ingestion: documents are scraped from the OECD.AI Policy Navigator, OECD publications, and Wonk posts. Each document receives structured metadata (title, year, country, source type, policy areas, implementation details). PDFs are converted to text using PyMuPDF (selected for speed, accuracy, and active maintenance).

2. Chunking: text is segmented with spaCy’s NLP model (en_core_web_sm) at linguistic boundaries (sentences and paragraphs) to preserve grammatical coherence. This approach was selected after systematic evaluation of nine chunking methods (recursive character, token-based, semantic, page-level, and spaCy-based) on OECD policy documents.

3. Embedding: each chunk is embedded using semantic contextual embedding that combines metadata prefixing (structured attributes such as theme, responsible organisations, and country), improving retrieval accuracy for policy-domain queries beyond standard semantic similarity.

4. Retrieval and human-in-the-loop quality assurance: After metadata filtering, the user’s embedding is matched against stored chunk embeddings based on semantic similarity. Top candidate documents are scored based on multi-judge LLM validation and reviewed by the OECD.AI analysts where flagged (see Annex B.1). Results are ranked and surfaced within defined caps.

5. Generation: Upon document selection, the system performs within-document RAG, generating a response grounded in the selected document’s chunks and displaying it alongside relevant pagess.

Pipeline overview

        
    Policy Navigator / OECD publications   
     │   
     ▼   
     Ingestion & embedding   
     │   
     ▼   
     Metadata labelling (policy priority, pillar, sector)   
     │   
     ▼   
     Hybrid retrieval (semantic similarity + metadata filters + diversity)   
     │   
     ▼   
     Multi-judge LLM validation (all candidate pairs)   
     │   
     ▼   
     Policy-team manual review & revision (where flagged)   
     │   
     ▼   
     Surfaced in Component 2 (validated, capped result sets)   
        

Models and providers

All OECD.AI-hosted LLM calls for the Toolkit use Microsoft Azure OpenAI Service. The in-product interface displays the model used for live response generation. The table below reflects choices of models for the first release.

FunctionModel(s)Provider
Document ingestion (text summariesgpt-4.1-nano  Azure OpenAI
Document ingestion (PDF vision extractiongpt-4.o  Azure OpenAI
Chunk embeddings (retrieval index)text-embedding-3-smallAzure OpenAI
Metadata labelling (policy priority / pillar classification)gpt-4o-mini  Azure OpenAI
Live within-document Q&A (Component 2)gpt-4.1-mini  Azure OpenAI
LLM judges to validate policy example retrieval (see Annex B.1)gpt-4o-mini, gpt-5.4-mini, o4-mini    Azure OpenAI

Models are selected on performance, safety, latency, and cost efficiency within OECD.AI Azure allocations.

Prompt library to retrieve documents

Policy priorities in the questionnaire are paired with OECD.AI-maintained prompts that define each policy area in neutral, diplomatic language. These prompts are used when retrieving and validating documents for each policy priority, and when interpreting user queries in Component 2. 

Response generation

The model generates responses grounded exclusively in the selected document’s content. Input filters refuse prompt injections and queries unrelated to AI policy.

Evaluation method and criteria

Evaluation spans two strands: policy-example retrieval quality (hybrid retrieval with LLM judges and policy-team review), and live RAG behaviour (within-document answering, safety, and automated scoring) (see Annex B.2).


Planned next steps

The Toolkit is a long-term, continuously iterated digital product. Priorities below apply across components unless noted.

1. Expand policy examples in the OECD.AI Policy Navigator, and augment coverage with additional initiatives across regions and policy options. The collection should target where the Toolkit lacks sufficient examples (particularly for lower-income and non-OECD economies).

2. Reflect Navigator updates (including new submissions) in the Toolkit without delay, by automating synchronisation through the full retrieval and validation pipeline (hybrid retrieval, multi-judge validation, OECD analyst review where required, and sync to the live index).

3. Identify, explore, and potentially include additional and complementary sources of trusted AI policy data

4. Iteratively improve the platform based on user feedback

Annex A

Annex A.1 and A.2 provide detailed evidence supporting the section: Methodology and participatory design. Questionnaire and pillar rationales are in Component 1.

Annex A.1 Participatory design and design decisions

Summary of participants

The Toolkit was developed through structured user research: 14 UX interviews with government officials and development agency representatives, conducted between January and February 2026.

RegionParticipantsFormats
Asia-Pacific (including ASEAN)6Mix of live interviews and written responses
Latin America and the Caribbean5Mix of live interviews and written responses
Europe (national ministries, EU body, development agency)2Live interviews
Middle East1Written response

 Participants spanned digital economy ministries, science and technology agencies, telecommunications regulators, competition authorities, trade ministries, and international development partners.

Key insights from user interviews

The following themes emerged consistently and informed Toolkit design:

Preference for per-document over aggregated summaries. A strong majority preferred per-document summaries that preserve source context and allow selective deep reading. A minority (including one trade-focused ministry in Latin America) preferred instant consolidated outputs; the platform accommodates this through optional direct access to policy examples.

Advanced and emerging economies have different primary needs. Emerging economies valued the questionnaire most for identifying gaps and knowing where to start. Advanced economies in Europe and Asia-Pacific sought operational detail, such as implementation mechanics, enforcement, and evaluation, rather than high-level descriptions. The Toolkit’s two entry paths (questionnaire-first vs quick retrieval of relevant policy examples) reflect this split.

  • Knowing where to start is a major barrier. Interviewees from Latin America and Asia-Pacific described challenges over which policy gap to tackle first. This validated the questionnaire as a structured starting point and the priority-ordering logic in its output.

Operationalisation is harder than drafting. Across regions, the hardest step was translating high-level strategy into concrete, implementable actions, particularly with limited technical, legal, and budgetary capacity. Policy assistance tools must address how countries implemented measures, not only what they did.

Context and comparability outweigh volume. Interviewees valued geographically or institutionally comparable examples over comprehensive but context-free multi-document summaries.

Sector-specific guidance is a clear gap. Participants in Asia-Pacific and MENA noted that cross-cutting guidance is relatively well served, whereas sector-specific best practices (healthcare, agriculture, education) are harder to find, informing the planned addition of sectoral pillar in future iterations of the Toolkit.

Trust in AI tools depends on traceability and human oversight. Many of the interviewees were already using existing AI chat interfaces to do exploratory research. For the Toolkit to be considered as a complementary tool, users preferred an interface that features clear source traceability, human verification, and support for, rather than the replacement of, professional judgment. These elements served as critical acceptance criteria for workplace AI use and high-stakes policy work. Unacceptable errors included incorrect legal or regulatory information, fabricated citations, outdated data, and biased outputs that could impact fundamental rights.  These findings shaped the Toolkit’s front-end design, including the integration of a side-by-side verification panel.

Annex A.2 Regional co-creation workshops

Summary of workshops

The Toolkit was developed through regional co-creation workshops:

WorkshopRegionDateFocus
Southeast Asia workshop 1Asia-PacificAugust 2025Content validation: pillars, policy options
Latin America and Caribbean workshopLatin America and the CaribbeanDecember 2025Content validation: pillars, policy options
Africa workshopAfricaMarch 2026Content validation; usability feedback from rapid prototyping
Southeast Asia workshop 2Asia-PacificMarch 2026Content validation; usability feedback from rapid prototyping

Workshops brought together government representatives, OECD national contact points, and (where applicable) regional partner organisations. Feedback covered pillar structure and policy option coverage, platform usability and accessibility. Participants alos shared an overview of the regional policy challenges and landscape to inform both the questionnaire and knowledgebase of the Toolkit.

Key insights from workshops

Asia-Pacific:  Validated the pillar structure but identified a gap in sector-specific guidance. Feedback on Component 2 emphasised showing recent examples and indicating publication year.

Africa: Strong demand for practical, implementation-ready guidance given limited technical capacity. Participants emphasised acknowledging resource constraints in recommendations. Content validation confirmed the relevance of the six-pillar structure.

Latin America and the Caribbean: Requested the tool to guide rather than evaluate; unanswered areas from the questionnaire should not imply failure. Collaborative, iterative completion was positively received.

Annex B

Annex B.1 Retrieval validation and manual revision workflow

Retrieval rules and surfacing caps

Candidate documents are retrieved per (policy priority, region) using hybrid retrieval (metadata filtering plus semantic similarity on document chunks). The process of determining document relevance follows with a retrieval filter, which functions as a high-recall step designed to cast a wide net across available data. Substantive relevance is then strictly evaluated by a panel of three LLM judges, supplemented by an OECD analyst review when necessary. Any documents that fail this validation process are immediately excluded, even if they successfully passed the initial retrieval screen.

Once the relevant documents are screened and approved, they are organized using a multi-tiered ranking system. Documents are first sorted by their assigned relevance band, followed by a preference for matching the user’s geographic region. From there, the system prioritizes publication recency, country diversity, and initiative diversity. In cases where multiple documents meet all these criteria equally, a final tie-break is applied based on exact text similarity.

The rules below summarise what is surfaced in Component 2 (as of the first release; thresholds may be tuned in later releases):

RuleObjective
Initial retrieval filterA wide embedding-similarity screen so potentially relevant documents are not discarded before review. The exact threshold is tuned internally and not published; cosine scores are easy to misread as “percent relevant.” Passing this step does not mean a document is shown to users.
Candidates considered per priority and regionUp to 50 ranked internally for validation and analyst review
Maximum surfaced – global (ALL)Up to 10 examples per policy priority on the global view (upper limit, not a target to fill)
Maximum surfaced – regionalUp to 3 examples per policy priority, prioritising policy examples from the same UN regional group (upper limit, not a target to fill)
Cross-region fallbackWhen no in-region document passes validation, examples may be drawn from the global (ALL) pool only, not by mixing regions arbitrarily

Multi-judge validation

Every (policy priority, retrieved document) pair that passes retrieval screening is scored by three independent LLM judges. Judges receive the policy-priority prompt, the policy option label, and the top retrieved text chunks from the document, in addition to structured document metadata.

BandJudge agreementTriggered action
Auto-pass≥ 2 of 3 judges answer Yes (confidence ≥ 0.70)Surfaced as policy examples in the Toolkit
Flag for review1 or 2 of 3 judges answer Yes (0.30-0.70)Flagged for review by OECD.AI analysts
Auto-drop0 of 3 judges answer Yes (< 0.30)Excluded from Toolkit results

Judge prompts are written to be conservative: a document is marked Yes only if it substantively addresses the policy priority, reducing false positives at the cost of more borderline cases sent for human review. The judges are not yet calibrated against a fixed human-labelled benchmark; calibration from policy-team review is planned for future iterations.

Human-in-the-loop review and calibration

Following automated retrieval and validation, OECD.AI analysts manually review any flagged documents. Reviewers can approve a flagged document, replace it with a more suitable alternative from the corpus, or remove it entirely. Once analysts sign off, the validated results synchronize with the Component 2 retrieval index, and any newly introduced documents are automatically re-scored before appearing in the user interface.

For future quality assurance and system improvement, every manual edit is strictly logged. Cases where a human decision contradicts the majority vote of the LLM judges are recorded as calibration pairs. While manual review currently serves as the final arbiter for surfacing examples, these pairs will be used in future iterations of the Toolkit to fine-tune judge prompts and adjust confidence thresholds.

Annex B.2 Evaluation metrics (policy-example retrieval and live RAG)

Policy-example retrieval (LLM judges and manual revision)

Quality is measured on whether each retrieved document is relevant to the policy priority for a given region. Metrics do not assess whether rank 1 is “more correct” than rank 2 or 3 – only substantive relevance.

MetricWhat it tells the public
Accepted as relevantShare of retrieved documents judged substantively relevant to the policy priority after LLM judges and, where needed, OECD analyst review
Flagged for reviewShare sent to OECD analysts because judges did not agree confidently on relevance
Human correctionsShare of human corrections among documents flagged by llm judges
Excluded as not relevantShare automatically excluded as not substantively on-topic for that policy priority
Systemic gaps (priorities without examples)Systemic gaps are policy priorities for which the Toolkit has no policy examples yet – because the OECD document corpus contains no substantively relevant material for that priority. These priorities are not included in the Component 1 questionnaire in V1.0. The list is published and updated each release to guide Policy Navigator expansion
Regional representationFor each policy priority and pillar that has examples, which UN regional groups are represented in the approved set and which are not

Live RAG pipeline (within-document Q&A)

Component 2 answers are grounded in a single selected document. Metrics are chosen to consider (short) single-document Q&A rather than multi-document retrieval.

DimensionWhat is measured
Answer quality (RAGAS)Faithfulness and factual correctness against the retrieved passages from the selected document. Context recall is reported only where the source has enough text for the metric to be meaningful.
Safety – harmful topicsRefusal rate for blocklisted topics
Safety – irrelevant queriesRefusal rate for off-topic queries
Safety – overrefusalRate at which legitimate on-topic queries (logged user questions in the overrefusal evaluation set) receive a substantive answer rather than an erroneous refusal; complements the refusal metrics above
Security – jailbreakingJailbreak attack detection rate
Security – prompt injectionPrompt injection detection rate