XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

  • Dasol Choi ,
  • Eugenia Kim ,
  • Jae-won Noh ,
  • Sanghyun Seo ,
  • Eunmi Kim ,
  • Yunjin Park ,
  • Brigitta Jesica Kartono ,
  • Josef Pichlmeier ,
  • Helena Berndt ,
  • Sai Krishna Mendu ,
  • Glenn Johannes Tungka ,
  • Ozlem Gokcce ,
  • Suresh Gehlot ,
  • K. Pratt ,
  • Amanda Minnich ,
  • Haon Park

arXiv

Related File

Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model’s ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.

GitHubGitHub