Reason for Outage Follow-up (8/10/11)
Dear Colo4 Customers,
Thank you for your patience and understanding with our equipment failure
this week. We apologize for the disruption to your business and the
stress and frustration that you experienced. As promised, we have
compiled this Reason for Outage report as part of our after-action
assessment.
What Happened: On Wednesday, August 10, 2011 at 11:01AM CDT, the Colo4
facility at 3000 Irving Boulevard experienced an equipment failure with
one of the automatic transfer switches (ATS) at service entrance #2,
which supports some of our long-term customers. The ATS device was
damaged and did not allow either commercial or generator power
automatically -- or through bypass mode. Thus, to restore the power
connection, a temporary replacement ATS was required to be put into
service.
Colo4’s standard redundant power offering has commercial power backed up
by diesel generator and UPS. Each of our six ATSs reports to its own
generator and service entrance. The five other ATSs and service
entrances at the facility were unaffected.
The ATS failure at service entrance #2 affected customers who had single
circuit connectivity (one power supply). For customers who had
redundant circuits (or A/B dual power supplies), they access two ATS
switches, so the B circuit automatically handled the load. (A few
customers with A/B power experienced initial downtime due to a separate
switch that was connected to two PDUs and the same service entrance.
Power was quickly restored.)
Response Actions: As soon as this incident occurred we worked to
mobilize the proper professionals in our facility and extended team.
Our on-site electrical contractors and technical team, worked quickly
with the general contractors and UPS contractors to assess the situation
and determine fastest course of action to bring customers back online.
As part of our protocol, we first conducted a thorough check of the
affected ATS as well as the supporting PDU, UPS, transformer, generator,
service entrance, HVAC, and electrical. It was determined that all
other equipment was functioning properly and that the failure was
limited to the ATS device. This step was important for us to ensure that
the problem did not affect other equipment or replicate at other
service entrances.
It was further determined that the ATS would need extensive repairs and
that the best scenario for our customers would be to install a temporary
ATS. As the ATS changeover involved high-voltage power, it was
important that we moved cautiously and deliberately to ensure the safety
of our employees, contractors and customers in the building as well as
our customers’ equipment. Safely bringing the new unit online was our
top priority.
After the temporary ATS was installed and tested, the team brought up
the HVAC, UPS and PDU individually to ensure that there was no damage to
those devices. Then, the team restored power to customer equipment.
Power was restored as of 6:31PM CDT.
The UPSs were placed in bypass mode on the diesel generator to allow the
batteries to fully charge. The transition from diesel generator to
commercial power occurred at 9:00PM CDT with no customer impact.
Colo4 technicians worked with customers to help bring any equipment
online that did not come back on with the power restore or to help reset
devices where breakers tripped during the power restoration. This
process continued throughout the evening.
Assessment: As part of our after-action assessment, the Colo4 management
team has debriefed with all on-site technical team and electrical
contractors as well as the equipment manufacturer, UPS contractors and
general contractors to provide assessments on the ATS failure. While an
ATS failure is rare, it is even rarer for an ATS to fail and not allow
it to go into bypass mode.
While the ATS could be repaired, we made the decision to order a new
replacement ATS. This is certainly a more expensive option, but it is
the option that provides the best solution for the long-term stability
for our customers.
Lessons Learned: Thankfully we’ve experienced few issues during our 11
years in business though any issue is one too many. As part of our
after-action review, we have made additional improvements to our
existing emergency/disaster recovery plans.
Our technical team, HVAC, electrical and general contractors brought
exceptionally fast, sophisticated thinking and action to get our
customers back in business as quickly as possible. The complexity of
working with power of that size and scale at any time, but especially
under pressure, shows the level of merit, knowledge and resolve that
these individuals have. Thank you to the technical team and all our
contractors for a job well done to safely restore power for our
customers.
As part of the debrief, all Colo4 network gear in both facilities was
checked to ensure all equipment was on redundant power, and all is
connected properly.
Unfortunately, we weren’t well prepared on the customer service side.
Our customers were stressed and needed more frequent updates from us
along the way. We very much wanted to provide you with an ETA earlier.
Due to the extent and complexity of the failure, we were unable to
provide a proper ETA quickly and did not want to send out false
information or set the wrong expectation.
For any future scenarios, we plan to provide process updates along the
way even if we are unable to provide an exact ETA at that moment. We
hope that this step will provide insight into the assessment period
efforts that are occurring.
We will continue to send direct emails to affected customers and post
website status updates. As the website received heavy hits during the
incident, we are upgrading the website server to better handle requests.
Based on our web server stats for the past year, the server had
excellent capacity, but in this case, we experienced a heavier load from
our customers and our customers’ customers. We will move some equipment
to secondary offsite locations.
We’ve also set up a Twitter account @colo4 to post future updates and
more timely responses. As you may have noticed, we began using Twitter
during that afternoon.
Next Steps: Once we receive and test the new ATS, we will schedule a
maintenance window to replace the equipment. We will provide at least
three days advance notice and timelines to minimize any disruption.
Thank you again for your patience and understanding. We take our
relationships very seriously and realize that you rely on us to keep
your business online. We’re sorry that our equipment failure caused
challenges this week.
Please let us know if you have any questions or need assistance.
Sincerely,
Paul Estes Paul VanMeter
CEO CTO