HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection
Hate speech is a challenging issue plaguing the online social media. While
better models for hate speech detection are continuously being developed, there
is little research on the bias and interpretability aspects of hate speech. In
this paper, we introduce HateXplain, the first benchmark hate speech dataset
covering multiple aspects of the issue. Each post in our dataset is annotated
from three different perspectives: the basic, commonly used 3-class
classification (i.e., hate, offensive or normal), the target community (i.e.,
the community that has been the victim of hate speech/offensive speech in the
post), and the rationales, i.e., the portions of the post on which their
labelling decision (as hate, offensive or normal) is based. We utilize existing
state-of-the-art models and observe that even models that perform very well in
classification do not score high on explainability metrics like model
plausibility and faithfulness. We also observe that models, which utilize the
human rationales for training, perform better in reducing unintended bias
towards target communities. We have made our code and dataset public at
https://github.com/punyajoy/HateXplain