ABDP: Adversarial Backdoor Detection and Purification
- Haojie Yuan ,
- Bin Benjamin Zhu ,
- Qi Chu ,
- Tao Gong ,
- Nenghai Yu ,
- Dongmei Zhang
IEEE Transactions on Information Forensics and Security |
In domains like driving and healthcare, deep learning models often rely on large, diverse datasets that can inadvertently harbor backdoor attacks. In this paper, we propose ABDP, a novel post-processing defense method, to effectively remove backdoor contamination from datasets and generate clean models without relying on any pre-existing clean data. ABDP capitalizes on the intrinsic link between untargeted adversarial attacks and backdoor attacks to detect the presence of backdoor attacks within trained models and ascertain their target labels. Subsequently, it trains a clean model capable of recognizing all labels except the target label, thus treating poisoned data as in-distribution and clean data of the target label as out-of-distribution. This distinction enables the identification of backdoor poisoned data. Finally, ABDP applies unlearning techniques to effectively eradicate the backdoor from the model. Extensive experimental evaluations across diverse datasets and against multiple backdoor attack scenarios validate the robustness and state-of-the-art performance of our approach. By employing ABDP for data and model cleansing, the attack success rate of resulting models is reduced to 1% or less, while retaining approximately 70% or more clean data at a true positive rate of 0.01 false positive rate. Notably, ABDP exhibits no adverse impact when applied to purely clean datasets, owing to its ability to detect backdoor presence in models before cleansing. Thus, our proposed method achieves state-of-the-art performance in cleansing both backdoor-poisoned data and backdoor models. The code for ABDP will be made available upon publication of the paper.