How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service
Best Paper Award
BibTex 다운로드
Research
Best Paper Award
BibTex 다운로드Production incidents in today’s large-scale cloud services can be extremely expensive in terms of customer impacts and engineering resources required to mitigate them. Despite continuous reliability efforts, cloud services still experience severe incidents due to various root-causes. Worse, many of these incidents last for a long period as existing techniques and practices fail to quickly detect and mitigate them. To better understand the problems, we carefully study hundreds of recent high severity incidents and their postmortems in a large cloud based service used by hundreds of millions of users. We answer: (a) why the incidents occurred and how they were resolved, (b) what the gaps were in current processes which caused delayed response, and (c) what automation could help make the services resilient. Finally, we uncover interesting insights by a novel multi-dimensional analysis that correlates different troubleshooting stages (detection, root-causing and mitigation), and provide guidance on how to tackle complex incidents through automation or testing at different granularity.
한국마이크로소프트(유)
대표이사: 조원우
주소: (우)110-150 서울 종로구 종로1길 50 더 케이트윈타워 A동 12층
전화번호: 02-531-4500, 메일: ms-korea@microsoft.com
사업자등록번호: 120-81-05948 사업자정보확인
호스팅서비스 제공자: Microsoft Corporation
통신판매신고: 제2013-서울종로-1009호
사이버몰의 이용약관: Microsoft Store 판매 약관