XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

Dasol Choi; Eugenia Kim; Jae-won Noh; Sanghyun Seo; Eunmi Kim; Yunjin Park; Brigitta Jesica Kartono; Josef Pichlmeier; Helena Berndt; Sai Krishna Mendu; Glenn Johannes Tungka; Ozlem Gokcce; Suresh Gehlot; K. Pratt; Amanda Minnich; Haon Park

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

Dasol Choi ,
Eugenia Kim ,
Jae-won Noh ,
Sanghyun Seo ,
Eunmi Kim ,
Yunjin Park ,
Brigitta Jesica Kartono ,
Josef Pichlmeier ,
Helena Berndt ,
Sai Krishna Mendu ,
Glenn Johannes Tungka ,
Ozlem Gokcce ,
Suresh Gehlot ,
K. Pratt ,
Amanda Minnich ,
Haon Park

May 2026

arXiv

Related File

Download BibTex

Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model’s ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.

GitHub