Description
Today’s incident was a major interruption that should never have happened. We are grateful for your patience and understanding during the course of this incident. The following is a break down of the events that occurred during Incident #1001 and what we are doing to improve because of it:
Who was affected?
This incident affected customers that have domains pointed at our nameservers, NS1.MEDIATEMPLE.NET and NS2.MEDIATEMPLE.NET.
When did the incident occur?
This incident started on Tuesday, April 19th at 8:40 AM EDT and ended at 11:15 AM EDT.
What happened?
In short, the cluster of servers that handles your domain names crashed. These servers were operating in a degraded state and became overloaded, unable to handle normal DNS functions. When these servers stopped responding, requests to your domains went unanswered.
Why did this happen? (Technical Details)
The root cause for this incident was faulty monitoring. The connection tracking tables on several PowerDNS servers had filled up without warning, causing a string of server failures that ultimately resulted in a complete DNS interruption. Normally, our monitoring systems will catch this and notify us before your service is affected; however, a new monitoring system put in place was not properly configured.
How are you planning to improve? (Technical Details)
Several improvements have already been completed in response to this incident. First, monitoring has been added for the DNS servers that caused the interruptions for our customers. We have also increased the amount of connection tracking our DNS servers can handle. The end result is a decreased chance of dropped requests due to the connection tracking tables being full. Lastly, we have begun monitoring these tables directly to provide an early warning system.
In closing,
We would like to sincerely apologize for the events regarding this incident. We understand that the availability of your service is of the utmost importance to you and your clients. If you would like to discuss this situation in further detail, please feel free to contact us at the number below, or via Support Request at any time.
This concludes Incident #1001. If his issue is still affecting you, please contact us at +1 647 800 5879
—————————————————————————-
Engineers are reaching a resolution. Until the final fix is in place some customers will see interruptions to Web, Mail, and FTP services. This is a temporary condition and is only affecting some users who are using Three Squared Hosting. Watch this status blog for further updates.
Updates
April 19, 2011 10:16 am EDT
Please be advised that Engineers are approaching a fix and that this is a temporary network issue affecting some Three Squared Hosting Sites.
An ETA on the final fix is pending. Watch this status page for fresh information.
April 19, 2011 11:12 am EDT
Efforts by Engineers are beginning to pay off. We are beginning to see an improvement. Many customers are now able to see their sites and get their email. Engineers and continuing to work this issue. Watch this status page for fresh information.
April 19, 2011 12:00 pm EDT
The situation has improved dramatically, but our engineers are still actively working toward a final fix. In tandem, we’re keeping an eye on short-term network health and also drilling down to find the root cause of today’s network issues. New information will be posted here as soon as it is available. Thank you for your continued patience.
April 19, 2011 1:15 pm EDT
Engineers have mitigated the network issue that was causing service interruptions this morning. All web, email, FTP and SSH services are fully available once again.
We are in the process of a root cause analysis and will publish a full incident review as soon as possible.
If you are still experiencing any sort of service interruption, please open a support request with support@threesquaredstudios.com. Thank you for your patience in this matter.





