Consensus-based Image Description Evaluation (CIDEr)

Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

Github

Website

The CIDEr (Consensus-based Image Description Evaluation) metric is a way of evaluating the quality of generated textual descriptions of images. The CIDEr metric measures the similarity between a generated caption and the reference captions, and it is based on the concept of consensus: the idea that good captions should not only be similar to the reference captions in terms of word choice and grammar, but also in terms of meaning and content.

The CIDEr metric is computed as follows:

1. First, a set of reference captions is provided for each image. These captions serve as the ground truth for the evaluation.

2. The generated caption is compared to each reference caption using the BLEU (Bilingual Evaluation Understudy) score, which measures the n-gram overlap between the generated caption and the reference captions.

3. The BLEU scores are then modified using an IDF (Inverse Document Frequency) weighting, which gives more weight to words that are rare in the reference captions but appear in the generated caption.

4. Finally, the weighted BLEU scores are averaged over all reference captions to produce the final CIDEr score.

The CIDEr metric has become a standard in the field of image captioning and has been used in several benchmark datasets and competitions. It is a widely used evaluation metric because it provides a comprehensive evaluation of the quality of generated captions, taking into account both the language and content of the captions.

Related use cases :

3D Hand Reconstruction via Aggregating Intra and Inter Graphs Guided by Prior Knowledge for Hand-Object Interaction Scenario

Uploaded on Mar 15, 2024

Most recent scribble-supervised segmentation methods commonly adopt a CNN framework with an encoder-decoder architecture. Despite its multiple benefits, this framework generally ca...

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Uploaded on Mar 15, 2024

The task of stock earnings forecasting has received considerable attention due to the demand investors in real-world scenarios. However, compared with financial institutions, it is...

HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Uploaded on Mar 15, 2024

With the proposal of the Segment Anything Model (SAM), fine-tuning SAM for medical image segmentation (MIS) has become popular. However, due to the large size of the SAM model and ...

About the metric

You can click on the links to see the associated metrics

Objective(s):

Explainability

Purpose(s):

Recognition/object detection
Content generation

Target sector(s):

Public governance
Investment
Health
Finance and insurance
Corporate governance

Lifecycle stage(s):

Build & interpret model

Target users:

Developer

Risk management stage(s):

Define
Assess
Govern
Treat

Modify this metric

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.