Web Server Outage — PostMortem Mock Interview in Holberton

lh1008
3 min readJun 2, 2020
Photo by Koushik Pal on Unsplash

August 13, 1:00 AM (UTC). We received and error 500 that affected 100% of our clients traffic through our main server. The load balancer after receiving 10000 500 errors it distributed all the traffic to our secondary server. Secondary server doesn’t have enabled the update information feature, so our clients were not able to send any information. The root cause of the error was a limited usage in the disk utilization of 60% that the load balancer didn’t realize. Once the limit is reached the server send an alert and blocks every inter connection.

Timeline (UTC — Universal Time Zone)

  • 00:00 AM — First alert of disk usage limit capacity
  • 1:00 AM — Traffic blocked by server with 500 error connection.
  • 1:05 AM — Load balancer blocked main server traffic.
  • 1:05:30 AM — Alert sent to our internal system notifying error 500
  • 1:35 AM — Our system administrator first insight of the alert.
  • 1:45 AM — Intern Response Team (IRT) was notified, escalating the issue to our main engineer.
  • 2:30 AM — IRT found the root of the outage.
  • 3:00 AM — First try to run main server failed.
  • 3:20 AM — IRT second notification of failure.
  • 3:30 AM — Second engineer notified.
  • 3:45 AM — Third tryout response 200 ‘OK’.
  • 3:50 AM — Restarting the server.
  • 4:00 AM — Server running and traffic again back to the main server

Root Cause

The main cause of the server outage was an internal alert that blocked all server traffic once the 60% of the disk memory was being used. The server had a capacity of 85% of the usage before slowing traffic but our software engineer developer had by mistake forgotten to increase the servers capacity to 85%. The load balancer also did not respond to the first alert expecting for the 85% to be reached so it kept sending traffic. The 85% disk capacity alert never got to the load balancer causing the load balancer to keep on sending traffic until the 500 error 10000 request was reached then the load balancer redirected the traffic to the secondary server.

Preventive and Corrective Measures

Following the disk memory outage limit event, our software engineer team and the IRT took the following measures to prevent and correct the 500 error.

  • Increased and verified that our main server and the secondary server have a capacity disk memory of at least 85% of the total usage before sending the alert.
  • Synchronization of the server and the load balancer calls were corrected.
  • A notification to our lead software engineer was made.
  • The 10000 500 error limit load balancer blockage was reduced to 400. The 10000 request was a extremely high number before redirecting traffic to our secondary server.
  • All of our clients were notified at 5:00 AM (UTC) of our error 500 message.

We in the Intern Response Team (IRT) are committed to SERVER_ACTIVe Co. and doing our best to maintain our service 24/7 to our clients. Make this notification available to anyone interested.

Best Regards,

IRT

--

--

lh1008

Life just keeps on happening in the eternal present. Keep building your present.