Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Resaro: Evaluating the performance & robustness of an AI system for Chest X-ray (CXR) assessments

Dec 9, 2024

Resaro: Evaluating the performance & robustness of an AI system for Chest X-ray (CXR) assessments

A healthcare organisation in Singapore engaged Resaro to evaluate the feasibility of deploying a commercial AI solution for evaluating Chest X-ray (CXR) images.

This evaluation was done within the context of a triage framework, which implements a two-tiered approach. The first step in this is, classifying CXRs as either “normal” or “abnormal” and the second is categorising the “abnormal” cases into those that are critical, needing urgent attention, and non-critical abnormalities. The main purpose of the evaluation was to identify the overall performance and robustness of the CXR AI system, as well as to get deployment recommendations such as optimal thresholds for AI predictions.

For this evaluation, Resaro used CXR images from participating healthcare facilities in Singapore. They corrected potential deviation between the evaluation dataset and the representative population in the healthcare institution using statistical reweighting methods. The analysis included assessing the performance of the AI solution on these CXR images, through metrics such as sensitivity and specificity. Additionally, they identified and calculated operationally relevant metrics and thresholds to offer a well-rounded view of the model’s utility and reliability in real-world applications. To contextualise the AI’s performance, they compared its results against the diagnostic accuracy of experienced human radiologists.

Resaro’s analysis extended to assessing fairness, where the performance difference between different subpopulations were investigated. While the primary investigation showed no bias across gender, performance differences between age groups flagged up the need for further analysis for any hidden confounders. Additionally, the analysis included robustness testing, which the team conducted by measuring the AI model’s performance to inputs that are algorithmically augmented with image corruptions encountered in typical CXR imaging.

This evaluation has been pivotal in building clinical confidence and fostering trust in the effectiveness of the CXR AI solution when deployed in the local setting. Resaro used the results to inform decision makers on the trade-offs between the right operational threshold to use. 

Performance metrics (such as accuracy, sensitivity, specificity, etc.) reported by AI vendors may not fully reflect the system’s effectiveness in a local setting, as distribution shifts in patient or health data can significantly impact results. This highlights the importance of conducting independent evaluations tailored to the specific context of use.

To ensure the AI system’s feasibility for the use case of reviewing CXRs for triaging, Resaro conducted evaluations using real CXR images closely aligned with those encountered in routine operation. Comparing the AI’s performance with that of working radiologists allowed Resaro to identify any performance gains or losses from modifying the CXR review pipeline. Additionally, robustness testing focused on typical image corruptions verified by medical professionals, ensuring that the system remains reliable under realistic conditions and that any performance fluctuations are well-understood.

Benefits of using the tool in this use case

This evaluation helped to build trust in the AI triage solution, ensuring that performance remains consistent when transitioning to an AI-assisted workflow rather than relying solely on human radiologists. It enabled the organisation to set the right thresholds to maintain appropriate sensitivity and specificity across different triage levels, ensuring accurate prioritisation of patient cases. Additionally, the evaluation helped identify potential biases in the system, facilitating deeper analysis and adjustments for fair treatment across patient demographics. By revealing specific types of image corruption that can impact AI predictions, it also guides the implementation of filters in the pipeline to flag CXRs that are overly degraded. These measures collectively contribute to more reliable, equitable, and accurate AI-supported patient care.

Shortcomings of using the tool in this use case

A possible limitation of this approach lies in the assumptions made regarding the prevalence of various findings in a representative population to reweigh the evaluation dataset. The validity of these assumptions should be reviewed as the system is implemented at scale at more healthcare institutions in Singapore. Additionally, Resaro recommends a larger and more diverse dataset for future evaluations to identify other potential confounders in AI predictions, such as variations introduced by different X-ray machines. Lastly, Resaro proposes adopting a more fine-grained triaging approach in the future to gather a deeper understanding of the AI system's performance at a more detailed level.

Link to the full use case.

This case study was published in collaboration with the UK Department for Science, Innovation and Technology Portfolio of AI Assurance Techniques. You can read more about the Portfolio and how you can upload your own use case here.

Modify this use case

About the use case


Developing organisation(s):


Objective(s):


Impacted stakeholders:



Target sector(s):


Country of origin: