This article is also available in 繁體中文.
On October 27, 2023, Zeabur experienced a service outage. The outage affected the tpe1 cluster located in the Taipei region, resulting in a service interruption of approximately 1 hour. The cause of the outage was a mistake made by an intern, leading to the deletion of our Kubernetes cluster located in the Google Cloud Platform's Taipei data center.
Upon receiving alerts from the monitoring service, Zeabur's operations team acted swiftly and restored services for both enterprise and developer users within two hours. The scope of the outage was relatively broad, with the main users affected including:
If your deployed services happen to be in the affected region, then you might have been unable to access the services you deployed on Zeabur during this time on October 27, 2023.
We deeply apologize for the mistakes made during this incident and the resulting impact.
During the process of deleting an obsolete test cluster, an intern, due to lack of operational experience and internal oversight in permission management, failed to report this risky operation to supervisors. Mistakenly, the production Google Kubernetes Engine cluster was deleted, causing it to become unavailable and all services in the tpe1 (Taipei) region to be disrupted.
The monitoring system rapidly detected multiple 502 errors across various services, and a large number of alarms were triggered in a short time. Investigation showed that all issues were originating from the `TPE1` cluster. Subsequent internal checks within the development group revealed that there was an operation to delete the development cluster within a minute. However, due to the failure to report to the supervisory developers for review in time, and continuously operating under a high-privilege account, the production cluster was accidentally deleted.
In the shortest possible time, the development team initiated emergency measures. Engineers from the team contacted Google Cloud Platform customer service and confirmed that once a deletion operation is executed, it cannot be reversed. Moreover, since the Google Kubernetes Engine backup function was not enabled, it was impossible to restore the cluster through this method.
Upon team investigation, it was determined that the persistent storage hard drives of the cluster were not destroyed due to the cluster deletion. Therefore, the decision was made to manually create new stateful services in the backup cluster by binding to the original hard drives. Stateless services were rebuilt from scratch.
Infrastructure operations engineers increased computing nodes in the Taipei backup cluster, `tpe0`, creating new deployments to accommodate user services from the faulty cluster. Once the cluster was ready, the team started restoring user services using automated scripts and the original persistent hard drives.
Thanks to the robustness of the existing Zeabur system and some straightforward migration scripts, the development team quickly transferred services to the backup cluster. After validating the feasibility of experimental manual migration, services for enterprise solution users were given priority for restoration.
Once the migration of affected users' clusters was completed, the services for enterprise solution users were restored. The development team then began addressing user tickets. Services for developer plan users and free plan users began to be restored gradually through a refined automated process.
Updated Emergency Plan: Users can temporarily migrate stateless services to the Hong Kong `HKG1` cluster or the US `SFO1` cluster as a solution. Stateful services, on the other hand, will be transferred to temporary clusters to effectively safeguard the integrity of user data.
Upon verification, it was confirmed that the business system can start correctly on the backup cluster and provide services to the outside world, resolving the service interruption incident.
After the emergency event concluded, the team immediately minimized control by deactivating all IAM permissions of the developers. Simultaneously, a security inspection of all infrastructure was undertaken, identifying several potential security risks and formulating an improvement plan.
Once all clusters were restored to normal, operational engineers activated the backup service for every cluster, performing a full backup. Regular backup services were initiated to incrementally back up user data at set intervals.
The team summarized and analyzed the event, reflecting and improving on the mistakes made during the incident. There was also a reassessment and redesign of the infrastructure's high availability and security.
User Service Assurance For the occasional errors that continued to appear in user services, the team established a service recovery reporting process to ensure quick responses and high availability for service recovery work orders:
Strengthen security and standardized operational training for employees and interns. Any modifications to the production environment require review by a senior engineer. Once approved, the operation will be performed by the senior engineer. If feasible, a mechanism for restoration should always be available.
1. Following this incident, a complete redesign of IAM permission management was undertaken internally to ensure all developers are granted only the minimum required permissions.
2. For destructive operations (especially those irreversible), developers must manually assume the role for resource operations. This adds an extra layer of confirmation to ensure such oversights don't recur.
3. Both the monitoring system and management team will receive alerts when sensitive operations roles are assumed, allowing immediate rectification of potential mistakes.
Set protective locks on crucial system resources to prevent erroneous modifications and deletions. Risky operations would require system administrator authentication and Multi-Factor Authentication (MFA).
Based on this experience, Zeabur's disaster recovery and response measures for single-region failures were redesigned. By backing up Kubernetes resources and user service container images, we can ensure service restoration within 15 minutes in the worst-case scenarios.
Additionally, the team decided to enhance backup levels for persistent user service data. Cross-region backups for persistent data from enterprise and developer solutions ensure Zeabur user data safety, even in the event of upstream provider failures.
The monitoring alert system will be improved to identify anomalies in infrastructure operations in real-time. Operations logs from upstream systems, such as AWS CloudTrail and Google Cloud Audit Logs, will be integrated into Zeabur's internal management system. Using Clickhouse and Grafana, corresponding alert rules will be set to promptly identify risky operations.
Zeabur has always been dedicated to delivering a highly available service deployment experience. However, this incident showed we haven't met our own standards. Through this, we've recognized shortcomings in infrastructure security management and high-availability design. We are committed to learning and improving to prevent similar incidents. We deeply apologize to all users who couldn't access Zeabur services during this outage. Improvements and reflections are underway to ensure such events do not reoccur.
© 2023 Zeabur Corp.