Zeabur Taipei region outage on October 27, 2023

Incident Summary

On October 27, 2023, Zeabur experienced a service outage. The outage affected the tpe1 cluster located in the Taipei region, resulting in a service interruption of approximately 1 hour. The cause of the outage was a mistake made by an intern, leading to the deletion of our Kubernetes cluster located in the Google Cloud Platform's Taipei data center.

Upon receiving alerts from the monitoring service, Zeabur's operations team acted swiftly and restored services for both enterprise and developer users within two hours. The scope of the outage was relatively broad, with the main users affected including:

Containerized services of the free, developer, and enterprise plans deployed in the tpe1 (Taipei) region
Serverless services of the free, developer, and enterprise plans deployed in the tpe1 (Taipei) region

If your deployed services happen to be in the affected region, then you might have been unable to access the services you deployed on Zeabur during this time on October 27, 2023.

We deeply apologize for the mistakes made during this incident and the resulting impact.

Incident Timeline

During the process of deleting an obsolete test cluster, an intern, due to lack of operational experience and internal oversight in permission management, failed to report this risky operation to supervisors. Mistakenly, the production Google Kubernetes Engine cluster was deleted, causing it to become unavailable and all services in the tpe1 (Taipei) region to be disrupted.

UTC 2:40 - tpe1 cluster is deleted, all services go offline.
UTC 2:41 - The service health monitoring system detects the interruption and sends an emergency alert to the project monitoring team.
UTC 2:45 - The cause of the outage is identified. Operations staff immediately start investigating and discover the cluster is inaccessible. Further inspection reveals the cluster has been deleted and the process is irreversible. The team initiates emergency response measures.
UTC 2:50 - After attempting to communicate with Google Cloud Platform, the project tech lead and operations staff try to directly restore the cluster but fail. They attempt to restore the cluster using Google Kubernetes Engine backup, but upon inspection, it's found that the backup service was not enabled for this cluster.
UTC 3:00 - Team investigations reveal that the deleted cluster does not affect the Persistent Disk in Google Cloud Platform's Compute Engine, confirming that user service data has not been lost.
UTC 3:10 - The process of migrating disrupted services to the tpe0 (Taipei) region begins. The team starts re-linking user data to a backup cluster in the Taipei region through automation scripts.
UTC 3:30 - The primary functions of the cluster are restored, user services go back online, services for enterprise plan customers are restored, and the development team starts addressing user-reported tickets.
UTC 3:48 - The team issues an outage announcement, suggesting users temporarily migrate stateless services to the hkg1 (Hong Kong) or sfo1 (North America) regions as a temporary solution.
UTC 3:56 - The team releases a service recovery statement, offering users a solution and a process for ticket submission.
UTC 6:10 - The team announces full service restoration. All tickets from developer plan customers submitted during the outage are resolved. They set up a ticket submission process for user service recovery requests and continue restoring services for users.

Root Cause Analysis

A lack of security awareness among employees. Before executing dangerous operations, they did not promptly report to supervisory developers for review and failed to double-check the targeted resources.
In-house permission management was not granular enough. This led to employees having administrative access to the production environment clusters, without specific access controls.
The disaster recovery plan was inadequate. Kubernetes resources were not backed up automatically on a regular basis, making it impossible to rapidly restore from backups after an incident.
There was a lack of established emergency response procedures for related urgent situations, leading to longer response times.

Emergency Response

Fault Discovery

The monitoring system rapidly detected multiple 502 errors across various services, and a large number of alarms were triggered in a short time. Investigation showed that all issues were originating from the TPE1 cluster. Subsequent internal checks within the development group revealed that there was an operation to delete the development cluster within a minute. However, due to the failure to report to the supervisory developers for review in time, and continuously operating under a high-privilege account, the production cluster was accidentally deleted.

Fault Recovery

In the shortest possible time, the development team initiated emergency measures. Engineers from the team contacted Google Cloud Platform customer service and confirmed that once a deletion operation is executed, it cannot be reversed. Moreover, since the Google Kubernetes Engine backup function was not enabled, it was impossible to restore the cluster through this method.

Upon team investigation, it was determined that the persistent storage hard drives of the cluster were not destroyed due to the cluster deletion. Therefore, the decision was made to manually create new stateful services in the backup cluster by binding to the original hard drives. Stateless services were rebuilt from scratch.

Infrastructure operations engineers increased computing nodes in the Taipei backup cluster, tpe0, creating new deployments to accommodate user services from the faulty cluster. Once the cluster was ready, the team started restoring user services using automated scripts and the original persistent hard drives.

Thanks to the robustness of the existing Zeabur system and some straightforward migration scripts, the development team quickly transferred services to the backup cluster. After validating the feasibility of experimental manual migration, services for enterprise solution users were given priority for restoration.

Once the migration of affected users' clusters was completed, the services for enterprise solution users were restored. The development team then began addressing user tickets. Services for developer plan users and free plan users began to be restored gradually through a refined automated process.

Updated Emergency Plan: Users can temporarily migrate stateless services to the Hong Kong HKG1 cluster or the US SFO1 cluster as a solution. Stateful services, on the other hand, will be transferred to temporary clusters to effectively safeguard the integrity of user data.

Verification

Upon verification, it was confirmed that the business system can start correctly on the backup cluster and provide services to the outside world, resolving the service interruption incident.

Handling

After the emergency event concluded, the team immediately minimized control by deactivating all IAM permissions of the developers. Simultaneously, a security inspection of all infrastructure was undertaken, identifying several potential security risks and formulating an improvement plan.

Once all clusters were restored to normal, operational engineers activated the backup service for every cluster, performing a full backup. Regular backup services were initiated to incrementally back up user data at set intervals.

The team summarized and analyzed the event, reflecting and improving on the mistakes made during the incident. There was also a reassessment and redesign of the infrastructure's high availability and security.

User Service Assurance For the occasional errors that continued to appear in user services, the team established a service recovery reporting process to ensure quick responses and high availability for service recovery work orders:

Refer to Community > Seek Help and submit a ticket on Zeabur's official Discord Server.
The official Zeabur team will handle the ticket. For user recovery requests, we commit to completing service recovery within 10 minutes.

Subsequent Improvement Measures

Management Process

Strengthen security and standardized operational training for employees and interns. Any modifications to the production environment require review by a senior engineer. Once approved, the operation will be performed by the senior engineer. If feasible, a mechanism for restoration should always be available.

Permission Management

Following this incident, a complete redesign of IAM permission management was undertaken internally to ensure all developers are granted only the minimum required permissions.
For destructive operations (especially those irreversible), developers must manually assume the role for resource operations. This adds an extra layer of confirmation to ensure such oversights don't recur.
Both the monitoring system and management team will receive alerts when sensitive operations roles are assumed, allowing immediate rectification of potential mistakes.

Resource Locking

Set protective locks on crucial system resources to prevent erroneous modifications and deletions. Risky operations would require system administrator authentication and Multi-Factor Authentication (MFA).

Disaster Recovery Enhancement

Based on this experience, Zeabur's disaster recovery and response measures for single-region failures were redesigned. By backing up Kubernetes resources and user service container images, we can ensure service restoration within 15 minutes in the worst-case scenarios.

Additionally, the team decided to enhance backup levels for persistent user service data. Cross-region backups for persistent data from enterprise and developer solutions ensure Zeabur user data safety, even in the event of upstream provider failures.

Optimized Alerts

The monitoring alert system will be improved to identify anomalies in infrastructure operations in real-time. Operations logs from upstream systems, such as AWS CloudTrail and Google Cloud Audit Logs, will be integrated into Zeabur's internal management system. Using Clickhouse and Grafana, corresponding alert rules will be set to promptly identify risky operations.

Conclusion

Zeabur has always been dedicated to delivering a highly available service deployment experience. However, this incident showed we haven't met our own standards. Through this, we've recognized shortcomings in infrastructure security management and high-availability design. We are committed to learning and improving to prevent similar incidents. We deeply apologize to all users who couldn't access Zeabur services during this outage. Improvements and reflections are underway to ensure such events do not reoccur.

Blogs