Understanding the Impact of Data Noise in Federated Learning
- Jinming Hu ,
- Jiahao Gu ,
- Kenta Ploch ,
- Hao Wang ,
- Jingxian Wang ,
- Wentao Wu ,
- Qizhen Zhang
Federated learning (FL) has emerged as a popular paradigm for distributed machine learning over decentralized
data. A typical FL training task involves a !eet of client devices with private data and a centralized server for
aggregating the global model. Data generated by FL clients, e.g., smart phones, vehicles, and cameras, is prone
to noise. While the impact of data noise on centralized learning (CL) is well understood, to our best knowledge
there is a lack of a systematic study from this point of view for FL. In this paper, we “ll this gap by presenting
an empirical investigation to provide a deeper understanding regarding the impact of data noise on FL. Our
study is enabled by DataNoiseGenerator, an open-source and extensible toolkit that we developed for the
injection of controlled data noise across “ve diverse data modalities: image, video, audio, text, and tabular data.
Wethen carry out extensive experiments based on the noisy data generated by DataNoiseGenerator, and our
experimental evaluation results reveal that FL is significantly more vulnerable to data noise compared to CL, in
terms of the quality of the trained ML models. This gap between FL and CL widens as the intensity of data
noise and the proportion of noisy FL clients increase. We further present a detailed analysis to diagnose the
root cause of this increased sensitivity of FL to data noise. Our analysis “nds that the aggregation performed
by the FL server can amplify divergent updates from FL clients trained on noisy data, thereby hindering global
model convergence. We conclude that data quality issues are a fundamental challenge for deploying robust FL
systems and demand novel decentralized data cleaning mechanisms.