Incident Report for EGD UI Outage on May 22, 2023
timestamp1684888995195
Incident Summary (All times in UTC)
At approximately 3:09pm on May 22, 2023, a routine update to Email Gateway Defense's (EGD) spam scoring logic inadvertently caused customers' email scores to be increased by 2.9 pts. This unfortunately led to a significant amount of email messages to be blocked as false positives. The spam rule update was reverted, however a large number of users needed to sign onto EGD redeliver their incorrectly blocked messages.
At 4:30pm, the EGD team started to receive reports that the UI was experiencing an outage and customers were seeing a 504 timeout error when attempting to sign in and release their blocked mail. While we investigated the incident, the CAPTCHA login step was turned on for administrators in an abundance of caution. After investigation, it was determined that this UI outage in service was due to the web servers being overloaded by the large number of users trying to sign in and resend their mail at one time.
The EGD team tried to remediate this issue by refreshing and creating new web server instances to accommodate the much larger flow of traffic. While EGD has autoscaling settings in place to automatically fluctuate to changes in traffic patterns, these instances were incorrectly displaying as healthy and were not refreshing as needed. By 6:17pm, the fix to increase instance capacity for the web servers successfully accommodated the higher traffic and the UI returned to normal.
The following day, May 23, 2023, at approximately 2:38pm, the EGD team began hearing additional reports that the UI for the new React platform was displaying timeout errors ("link to login is invalid") for certain end users. The EGD team investigated and implemented a fix at 6:10pm for a log-in call to the database that was causing abnormally long load times. While this additional incident was not directly related to the spam scoring update, this problem was exacerbated by the higher than usual traffic attempting to log into the new UI to remediate their email.
The EGD team is currently investigating ways to prevent any future changes to spam scoring from leading to such a high number of emails being blocked as false positives. The team is also investigating ways to adjust autoscaling logic to prevent any loss in service in the future. We apologize for the inconvenience that this has caused to our customers.
Impact Timeline (All times in UTC)
2023-05-22:
03:09pm: Update to spam scoring logic was implemented inadvertently causing customer's email scores to be increased (leading to a large amount of false positives).
03:09pm-04:29pm: Users begin to log into EGD at a high rate to redeliver their incorrectly blocked email messages causing the web servers to be overloaded. CAPTCHA login step was implemented for administrators.
04:30pm: EGD team creates incident and begins investigation of UI outage.
05:02pm-05:56pm: Multiple attempts were made to refresh and replace unhealthy web server instances with healthy ones to accommodate for larger traffic patterns.
06:17pm: Traffic flow is back to being accommodated by the new healthy instances and the UI loads correctly for users.
2023-05-23:
02:38pm - The EGD team began receiving reports that the UI for the new React platform was displaying timeout errors ("link to login is invalid").
02:40pm - The team begins investigation team identifies problem with autoscaling capacity and implements fix.
06:10pm - A fix is implemented to a call in the log-in process that was causing abnormally long load times and the UI begins to load as expected.
Impact Duration
Initial Issue Start: May 22, 2023 @ 3:09pm UTC
Initial Issue End: May 22, 2023 @ 6:17pm UTC
Secondary Issue Start: May 23, 2023 @ 2:38pm UTC
Secondary Issue End: May 23, 2023 @ 6:10pm UTC
Incident Analysis
EGD's Web interface interruption for customers in the US region on May 22, 2023 was due to an autoscaling failure during increased traffic load. This higher traffic load was due to an update to spam scoring logic that caused a large amount of customer mail to be incorrectly blocked. As users logged in to manage their blocked mail, the web servers became overloaded and the UI failed to load. Once instances were refreshed successfully, the UI worked as expected. However, the following day, the new React platform web servers were also overloaded due to a call in the log-in process that was creating abnormally long load times. This call combined with the increased traffic caused the new UI login flow to be interrupted with timeout errors for customers. This problem was also remediated.
Corrective Actions
These are the areas being addressed as a result of this incident:
The EGD team is investigating the problem caused by the update to the spam scoring logic that caused many false positives.
The EGD team is investigating improvements in the incident resolution process for quicker identification of the problem and remediation planning.
The EGD team is investigating updates to the autoscaling logic for web servers to automatically accommodate higher traffic changes.
The EGD team is also investigating ways to allow users to more quickly and efficiently rescan and redeliver mail that may be blocked incorrectly due to an incident or setting update.
Next Steps for Customers
Unfortunately, as there was an extremely high number of affected customers and email, we are unable to manually rescan and redeliver mail that was incorrectly blocked due to this incident. We apologize for the inconvenience this has caused to our customers.
Please see below for the steps to allow end users to redeliver their mail. This will lessen the burden on administrators with a high number of emails to remediate.
Admin signs into EGD under the Users tab, select Default Policy
Set the setting labelled Allow end users to view and deliver blocked messages to Yes (if this is set to off) and click Save Changes
Now users will be able to login and deliver their incorrectly blocked mail. Note: this will potentially allow the users to send mail that should not be delivered, so please advise users to proceed with caution. If admins are concerned about future user behavior, they can reset this setting to No after mail has been delivered from this May 22, 2023 incident.
Did you like this update?
Leave your name and email so that we can reply to you (both fields are optional):