Root Cause Analysis for Service Slowdown and Interruption
March 5, 2025
We sincerely apologize for the service issues on March 5th that affected some of our applications. We understand how much your institution relies on our services, and we regret any inconvenience this may have caused. Below, we’d like to provide details on what happened and the steps we are taking to prevent future occurrences.
At approximately 10:00am PST on March 5, 2025, we started receiving alarms indicating a rapid increase in the failure rate of students starting and completing the pre-exam steps in Respondus Monitor. However, everything appeared normal in the application servers in terms of processing requests, server load, memory usage, and thread and connection counts. Requests were serviced without errors and within our maximum target of 250ms response time.
But while all health checks within the AWS environment indicated the servers were healthy and operating normally, we began receiving alerts from monitoring services that run external to AWS showing elevated response times and intermittent failures.
We then focused on the AWS load balancer service (ALB) which distributes incoming traffic to the application servers. This service is fully managed by AWS, and hence, we don’t have good insight into the health of the appliance, nor can we restart the nodes, etc. After examining the access logs for the load balancer, we saw a very high number of Error 460 entries, which according to the AWS documentation means:
"Client errors are generated when requests are malformed or incomplete. These requests were not received by the target, other than in the case where the load balancer returns an HTTP 460 error code. This count does not include any response codes generated by the targets."
Our initial thought was that a DDoS attack was flooding the load balancer with malformed packets. But this was quickly ruled out because the requests associated with the errors looked normal in the access logs. Additionally, our Global Accelerator endpoints are protected by AWS Shield, which should stop network layer attacks before they reach the load balancer. However, this explained why the application servers appeared to be operating normally – because a large percentage of requests were not reaching them.
Engineers at AWS said the 460 errors would also occur if the client (LockDown Browser) closes the connection before the load balancer sends the response. This seemed implausible because there hadn’t been recent updates to the client applications. Moreover, such an issue would emerge gradually – over days or weeks – based on how we introduce new releases. This event escalated in minutes.
Concurrently with our investigation, we performed two rolling restarts of the application servers which didn't result in much improvement. Given these symptoms, we suspected the issue might be with the load balancer service itself. We decided to terminate all application servers at once (to fully scale down the load balancer) and then launch new application servers (to scale up the load balancer, but on different nodes).
This had the immediate effect of restoring the service, and our initial theory was that the load balancer had gotten into a bad state. After further investigation, however, we determined the problem wasn’t with the load balancer. The root cause was that a bandwidth limit had been reached on the elastic network interfaces attached to the application server instances, resulting in a bottleneck that continued to grow in the early stages as failed requests were retried.
This service event primarily entailed slowdowns in request responses and intermittent failures. Once students entered an exam, this event would not have affected them until after the exam was submitted on the learning system. At that point, students may have experienced delays or failures as they attempted to exit the Respondus Monitor system. The only time there was a complete outage was when everything was shut down for a few minutes to restart the entire service.
We have since performed a detailed analysis of the event and have configured new alarms that trigger auto-scaling of application servers before the network bandwidth limit is reached. We have also increased the minimum number of application servers which will smooth the opening minutes of a massive autoscaling event.
Finally, we also want to note that the StudyMate Campus service was similarly affected during this event, as were users trying to start exams using the Chromebook version of LockDown Browser. In the latter case, the impact was due to how the Chromebook extension retrieves settings at startup.