Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

Robust Reinforcement Learning with Alternating Training of Learned Adversaries (ATLA)

Github

This repository contains a reference implementation for alternating training of learned adversaries (ATLA) for robust reinforcement learning against adversarial attacks on state observations.

Our ATLA training procedure can be somewhat analogous to “adversarial training” for supervised learning, but we are based on the state-adversarial Markov decision process (SA-MDP) which characterizes the optimal adversarial attack for RL agents. During training, we learn an adversary along with the agent following the optimal attack formulation. The agent must defeat this strong adversary during training time, thus becomes robust against a wide range of strong attacks during test time. Previous approaches were not based on SA-MDP and used gradient-based attack heuristics during training which are not strong enough, and they become vulnerable under strong test time attacks.

Following SA-MDP, we can find the optimal adversarial attack which achieves the lowest possible reward given an agent and an environment by solving a transformed MDP. This can be analogous to the minimal adversarial examples (which can be found via MILP/SMT solvers) in classification problems. In DRL setting, this MDP can be solved using any DRL algorithm such as PPO. We demonstrate that adversarial attacks based on the optimal adversary framework can be significantly stronger than previously proposed stronger attacks (see examples below). The optimal attack framework can be useful for evaluating the robustness of RL agents for defense techniques developed in the future.

Details on the optimal adversarial attack to RL and the ATLA training framework can be found in our paper:

“Robust Reinforcement Learning on State Observations with Learned Optimal Adversary“,
by Huan Zhang (UCLA), Hongge Chen (MIT), Duane Boning (MIT), and Cho-Jui Hsieh (UCLA) (* Equal contribution)
ICLR 2021. (Paper PDF)

Our code is based on the SA-PPO robust reinforcement learning codebase: huanzhang12/SA_PPO.

Optimal Adversarial Attack and ATLA-PPO Demo

In our paper, we first show that we can learn an adversary under the optimal adversarial attack setting in SA-MDP. This allows us to obtain significantly stronger adversaries to attack RL agents: while previous strong attacks can make the agents fail to move, our learned adversary can lead the agent into moving towards the opposite direction, obtaining a large negative reward. Furthermore, training with this learned strong adversary using our ATLA framework allows the agent to be robust under strong adversarial attacks.

	Vanilla PPO No attacks	Vanilla PPO under Robust Sarsa (RS) attack	Vanilla PPO under Learned Optimal attack	Our ATLA-PPO under Learned Optimal attack (strongest attack)
Ant-v2
Episode rewards	5358 Moving right	63 Not moving	-1141 Moving left (opposite to the goal)	3835 Moving right
HalfCheetah-v2
Episode rewards	7094 Moving right	85 Not moving	-743 Moving left (opposite to the goal)	5250 Moving right

Setup

First, clone this repository and install the necessary Python packages:

git submodule update --init
pip install -r requirements.txt
sudo apt install parallel  # Only necessary for running the optimal attack experiments.
cd src  # All code files are in the src/ folder

Note that you need to install MuJoCo 1.5 first to use the OpenAI Gym environments. See here for instructions.

Pretrained agents

We release pre-trained agents for all settings evaluated in our paper. These pre-trained agents can be found in src/models/atla_release, six subdirectories corresponding to six settings evaluated in our paper. Inside each folder, you can find agent models (starting with model-) as well as the adversary we learned for the optimal adversarial attacks (starting with attack-). We will show how to load these models in later sections. The performance of our pre-trained agents is reported below. Here we report our strongest ATLA-PPO (LSTAM + SA-Reg) method and a strong baseline SA-PPO, as well as vanilla PPO without robust training. We report their natural episode rewards without attack as well as episode rewards under our proposed optimal attack. For full results with more baselines please check out our paper.

Environment	Evaluation	Vanilla PPO	SA-PPO	ATLA-PPO (LSTM + SA-Reg)
Ant-v2	No attack	5687.0	4292.1	5358.7
	Strongest Attack	-871.7	2511.0	3764.5
HalfCheetah-v2	No attack	7116.7	3631.5	6156.5
	Strongest Attack	-660.5	3027.9	5058.2
Hopper-v2	No attack	3167.3	3704.5	3291.2
	Strongest Attack	636.4	1076.3	1771.9
Walker2d-v2	No attack	4471.7	4486.6	3841.7
	Strongest Attack	1085.5	2907.7	3662.9

Note that reinforcement learning algorithms typically have large variances across training runs. Thus, we repeatedly train each agent configuration 21 times and rank them with their average cumulative rewards over 50 episodes under the strongest (best) attack (among 6 attacks used). The pre-trained agents are the ones with median robustness (median episode reward under the strongest attack) rather than the best ones. When compared to our work, it is important to train each agent repeatedly at least 10 times and report the median agent, rather than the best. Additionally, for robust sarsa (RS) attack and the proposed optimal attack, a large number of attack parameters are searched and we choose the strongest adversary among them. See the section below.

The pre-trained agents can be evaluated using test.py (see the next sections for more usage details). For example,

# Ant agents.
## Vanilla PPO:
python test.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --deterministic
## SA-PPO:
python test.py --config-path config_ant_sappo_convex.json --load-model models/atla_release/SAPPO/model-sappo-convex-ant.model --deterministic
## Vanilla LSTM:
python test.py --config-path config_ant_vanilla_ppo_lstm.json --load-model models/atla_release/LSTM-PPO/model-lstm-ppo-ant.model --deterministic
## ATLA PPO (MLP):
python test.py --config-path config_ant_atla_ppo.json --load-model models/atla_release/ATLA-PPO/model-atla-ppo-ant.model --deterministic
## ATLA PPO (LSTM):
python test.py --config-path config_ant_atla_ppo_lstm.json --load-model models/atla_release/ATLA-LSTM-PPO/model-lstm-atla-ppo-ant.model --deterministic
## ATLA PPO (LSTM+SA Reg):
python test.py --config-path config_ant_atla_lstm_sappo.json --load-model models/atla_release/ATLA-LSTM-SAPPO/model-atla-lstm-sappo-ant.model --deterministic

Note that the –deterministic switch is important, which disables stochastic actions for evaluation. You can change ant to walker, hopper or halfcheetah in the config file names and agent model file names to try other environments.

Optimal Attack to Deep Reinforcement Learning

Train a Single Optimal Attack Adversary

To run an optimal attack, we set --mode to adv_ppo and set --ppo-lr-adam to zero. This essentially runs our ATLA training but with the learning rate of the agent model set to 0, so this will learn the adversary only. The learning rate of the adversary policy network can be set via --adv-ppo-lr-adam, the learning rate of the value network can be set via --adv-val-lr, the entropy regularizer of the adversary can be set via --adv-entropy-coeff, the clipping epsilon for PPO optimizer for the adversary can be set via --adv-clip-eps.

# Note: this is for illustration only. We must correctly choose hyperparameters for the adversary, typically via a hyperparameter search.
python run.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --mode adv_ppo --ppo-lr-adam 0.0 --adv-ppo-lr-adam 3e-5 --adv-val-lr 3e-5 --adv-entropy-coeff 0.0 --adv-clip-eps 0.4

This will save an experiment folder at vanilla_ppo_ant/agents/YOUR_EXP_ID, where YOUR_EXP_ID is a randomly generated experiment ID, for example e908a9f3-0616-4385-a256-4cdea5640725. You can extract the best model from this folder by running

python get_best_pickle.py vanilla_ppo_ant/agents/YOUR_EXP_ID

which will generate an adversary model best_model.YOUR_EXP_ID.model, for example best_model.e908a9f3.model.

Then you can evaluate this trained adversary by running

python test.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --deterministic --attack-method advpolicy --attack-advpolicy-network best_model.YOUR_EXP_ID.model

Finding the Best Optimal Attack Adversary

The above command only trains and tests one adversary using one set of adversary hyperparameters. Since the learning of this optimal adversary is also an RL problem (solved using PPO), to obtain the best attack results and evaluate the true robustness of an agent model, we need to train the adversary using multiple sets of hyperparameters and take the strongest (best) adversary. We provide scripts to easily scan the hyperparameters of the adversary and run each set of hyperparameters in parallel:

cd ../configs
# This will generate 216 config files inside agent_configs_attack_ppo_ant.
# Modify attack_ppo_ant_scan.py to change the hyperparameters for the grid search.
# Typically, for a different environment you need a different set of hyperparameters for searching.
python attack_ppo_ant_scan.py
cd ../src
# This command will run 216 configurations using all available CPUs. 

# You can also use "-t " to control the number of threads if you don't want to use all CPUs.
python run_agents.py ../configs/agent_configs_attack_ppo_ant_scan/ --out-dir-prefix=../configs/agents_attack_ppo_ant_scan > attack_ant_scan.log

To test all the optimal attack adversaries after the above training command finishes, simply run the evaluation script:

bash example_evaluate_optimal_attack.sh

Note that you will need to change the line starting with with scan_exp_folder in example_evaluate_optimal_attack.sh to run an evaluation of the learned optimal attack adversaries for another environment or results in another folder. You need to change that line to:

scan_exp_folder <config file> <path to trained optimal attack adversarial> <path to the victim agent model> $semaphorename

This script will run adversary evaluation in parallel (the “GNU parallel” tools are required), and will generate a log file attack_scan/optatk_deterministic.log containing attack results in each experiment id folder. After the above command finishes, you can use parse_optimal_attack_results.py to parse the logs and get the best (strongest) attack result with the lowest agent reward:

python parse_optimal_attack_results.py ../configs/agents_attack_ppo_ant_scan/attack_ppo_ant/agents

If you would like to conduct the optimal adversarial attack, it is important to use a hyperparameter search scheme demonstrated above as the attack itself is a RL problem and can be sensitive to hyperparameters. To evaluate the true robustness of an agent, finding the best optimal attack adversary is necessary.

Pretrained Adversaries for All Agents

We provide optimal attack adversaries for all agents we released. To test a pre-trained optimal attack adversary we provide, run test.py with the --attack-advpolicy-network option:

# Ant Agents.
## Vanilla PPO:
python test.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --deterministic --attack-method advpolicy --attack-advpolicy-network models/atla_release/PPO/attack-ppo-ant.model
## SA-PPO:
python test.py --config-path config_ant_sappo_convex.json --load-model models/atla_release/SAPPO/model-sappo-convex-ant.model --deterministic --attack-method advpolicy --attack-advpolicy-network models/atla_release/SAPPO/attack-sappo-convex-ant.model
## Vanilla LSTM:
python test.py --config-path config_ant_vanilla_ppo_lstm.json --load-model models/atla_release/LSTM-PPO/model-lstm-ppo-ant.model --deterministic --attack-method advpolicy --attack-advpolicy-network models/atla_release/LSTM-PPO/attack-lstm-ppo-ant.model
## ATLA PPO (MLP):
python test.py --config-path config_ant_atla_ppo.json --load-model models/atla_release/ATLA-PPO/model-atla-ppo-ant.model --deterministic --attack-method advpolicy --attack-advpolicy-network models/atla_release/ATLA-PPO/attack-atla-ppo-ant.model
## ATLA PPO (LSTM):
python test.py --config-path config_ant_atla_ppo_lstm.json --load-model models/atla_release/ATLA-LSTM-PPO/model-lstm-atla-ppo-ant.model --deterministic --attack-method advpolicy --attack-advpolicy-network models/atla_release/ATLA-LSTM-PPO/attack-lstm-atla-ppo-ant.model
## ATLA PPO (LSTM+SA Reg):
python test.py --config-path config_ant_atla_lstm_sappo.json --load-model models/atla_release/ATLA-LSTM-SAPPO/model-atla-lstm-sappo-ant.model --deterministic --attack-method advpolicy --attack-advpolicy-network models/atla_release/ATLA-LSTM-SAPPO/attack-atla-lstm-sappo-ant.model

You can change ant to walker, hopper or halfcheetah in the config file names, agent, and adversary model file names to try other environments.

Agent Training with Learned Optimal Adversaries (our ATLA framework)

To train an agent, use run.py in src folder and specify a configuration file path. Several configuration files are provided in the src folder, with filenames starting with config. For example:

Halfcheetah vanilla PPO (MLP) training:

python run.py --config-path config_halfcheetah_vanilla_ppo.json

HalfCheetah vanilla PPO (LSTM) training:

python run.py --config-path config_halfcheetah_vanilla_ppo_lstm.json

HalfCheetah ATLA (MLP) training:

python run.py --config-path config_halfcheetah_atla_ppo.json

HalfCheetah ATLA (LSTM) training:

python run.py --config-path config_halfcheetah_atla_ppo_lstm.json

HalfCheetah ATLA (LSTM) training with state-adversarial regularizer (this is the best method):

python run.py --config-path config_halfcheetah_atla_lstm_sappo.json

Change halfcheetah to ant, hopper or walker to run other environments.

Training results will be saved to a directory specified by the out_dir parameter in the JSON file. For example, for ATLA (LSTM) training with state-adversarial regularizer, it is robust_atla_ppo_lstm_halfcheetah. To allow multiple runs, each experiment is assigned a unique experiment ID (e.g., 2fd2da2c-fce2-4667-abd5-274b5579043a), which is saved as a folder under out_dir (e.g., robust_atla_ppo_lstm_halfcheetah/agents/2fd2da2c-fce2-4667-abd5-274b5579043a).

Then the agent can be evaluated using test.py. For example:

# Change the --exp-id to match the folder name in robust_atla_ppo_lstm_halfcheetah/agents/
python test.py --config-path config_halfcheetah_atla_lstm_sappo.json --exp-id YOUR_EXP_ID --deterministic

You should expect a cumulative reward (mean over 50 episodes) of over 5000 for most methods.

Agent Evaluation Under Attacks

We implemented random attacks, critic-based attacks, and our proposed Robust Sarsa (RS) and maximal action difference (MAD) attacks.

Optimal Adversarial Attack

Please see this section for more details on how to run our proposed optimal adversarial attack. This is the strongest attack so far and is strongly recommended for evaluating the robustness of RL defense algorithms.

Robust Sarsa (RS) Attack

In our Robust Sarsa attack, we first learn a robust value function for the policy under evaluation. Then, we attack the policy using this robust value function. The first step for RS attack is to train a robust value function (we use the Ant environment as an example):

# Step 1:
python test.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --sarsa-enable --sarsa-model-path sarsa_ant_vanilla.model

The above training step is usually very fast (e.g., a few minutes). The value function will be saved in sarsa_ant_vanilla.model. Then it can be used for an attack:

# Step 2:
python test.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --attack-eps=0.15 --attack-method sarsa --attack-sarsa-network sarsa_ant_vanilla.model --deterministic

The L infinity norm for the attack is set by the --attack-eps parameter (for different environments, you will need a different epsilon for the attack, see Table 2 in our paper). The reported mean reward over 50 episodes should be less than 500 (reward without attack is over 5000). In contrast, our ATLA-PPO (LSTM + SA-Reg) robust agent has a reward of over 4000 even under this specific attack:

# Train a robust value function.
python test.py --config-path config_ant_atla_lstm_sappo.json --load-model models/atla_release/ATLA-LSTM-SAPPO/model-atla-lstm-sappo-ant.model --sarsa-enable --sarsa-model-path sarsa_ant_atla_lstm_sappo.model
# Attack using the robust value function.
python test.py --config-path config_ant_atla_lstm_sappo.json --load-model models/atla_release/ATLA-LSTM-SAPPO/model-atla-lstm-sappo-ant.model --attack-eps=0.15 --attack-method sarsa --attack-sarsa-network sarsa_ant_atla_lstm_sappo.model --deterministic

The Robust Sarsa attack has two hyperparameters for robustness regularization (--sarsa-eps and --sarsa-reg) to build a robust value function. Although the default settings generally work well, for a comprehensive robustness evaluation it is recommended to run the Robust Sarsa attack under different hyperparameters and choose the best attack (the lowest reward) as the final result. We provide a script, scan_attacks.sh for the purpose of comprehensive adversarial evaluation:

# You need to install GNU parall first: sudo apt install parallel
source scan_attacks.sh
# Usage: scan_attacks model_path config_path output_dir_path
scan_attacks models/atla_release/PPO/model-ppo-ant.model config_ant_vanilla_ppo.json sarsa_ant_vanilla_ppo_result

In this above example, you should see minimum RS attack reward (deterministic action) reported by the script to be below 300. For your convenience, the scan_attacks.sh script will also run many other attacks including the MAD attack, critic attack, and random attack. A robust sarsa attack is usually the strongest one among them.

Note: the learning rate of the Sarsa model can be changed by --val-lr. The default value should be good for attacks on the provided environments (with normalized reward). However, if you want to use this attack on a different environment, this learning rate can be important as the reward may be unnormalized (some environment returns large rewards so the Q values are larger, and a larger --val-lr is needed). The rule of thumb is to always check the training logs of these Sarsa models – make sure the Q loss has been reduced sufficiently (close to 0) at the end of training.

Maximal Action Difference (MAD) Attack

We additionally propose a maximal action difference (MAD) attack where we attempt to maximize the KL divergence between the original action and perturbed action. It can be invoked by setting --attack-method to action. For example:

python test.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --attack-eps=0.15 --attack-method action --deterministic

The reported mean reward over 50 episodes should be around 1500 (this attack is weaker than the Robust Sarsa attack in this case). In contrast, our ATLA-PPO (LSTM + SA-Reg) robust agent is more resistant to MAD attack, achieving a reward of over 5000.

python test.py --config-path config_ant_atla_lstm_sappo.json --load-model models/atla_release/ATLA-LSTM-SAPPO/model-atla-lstm-sappo-ant.model --attack-eps=0.15 --attack-method action --deterministic

We additionally provide a combined attack of RS+MAD, which can be invoked by setting --attack-method to sarsa+action, and the combination ratio can be set via --attack-sarsa-action-ratio, a number between 0 and 1.

Critic-based attacks and random attack

Critic-based attacks and random attacks can be used by setting --attack-method to critic and random, respectively. These attacks are relatively weak and not suitable for evaluating the robustness of PPO agents.

# Critic based attack (Pattanaik et al.)
python test.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --attack-eps=0.15 --attack-method critic --deterministic
# Random attack (uniform noise)
python test.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --attack-eps=0.15 --attack-method random --deterministic

In this case, under critic or random attack the agent reward is still over 5000, which means that these attacks are not very effective for this specific environment.

Snooping attack

In this repository, we also implemented an imitation learning-based Snooping attack proposed by Inkawhich et al.. In this attack, we first learn a new agent from the policy under evaluation. Then, we use the gradient information of the new agent to attack the original policy. The first step for Snooping attack is to train a new imitation agent by observing (“snooping”) how the original agent behaves:

# Step 1:
python test.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --imit-enable --imit-model-path imit_ant_vanilla.model

The above training step is usually very fast (e.g., a few minutes). The new agent model will be saved in imit_ant_vanilla.model. Then it is loaded to conduct the Snooping attack:

# Step 2:
python test.py --config-path config_ant_vanilla_ppo.json --load-model models/atla_release/PPO/model-ppo-ant.model --attack-eps=0.15 --attack-method action+imit --imit-model-path imit_ant_vanilla.model --deterministic

Note that snooping attack is a black box attack (does not require the gradient of the agent policy or interaction with the agent), so it is usually weaker than other white-box attacks. In the above example, it should achieve an average episode reward of roughly 3000.

About the tool

You can click on the links to see the associated tools

Tool type(s):

Toolkit/software

Objective(s):

Privacy
Robustness

Purpose(s):

Event/anomaly detection
Goal-driven optimisation

Country of origin:

United States

Type of approach:

Technical

Maturity:

Implemented in multiple projects
In development

Programming languages:

Python
Shell

Github stars:

Github forks:

Modify this tool

Use Cases

There is no use cases for this tool yet.

Would you like to submit a use case for this tool?

If you have used this tool, we would love to know more about your experience.

Add use case

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.