Massive Data Center Failure This Afternoon Due to Drunk Guy

Thread Tools
 
Search this Thread
 
Old 07-24-2007, 11:46 PM
  #1  
Registered User
Thread Starter
 
Hollandaze's Avatar
 
Join Date: Nov 2002
Location: San Leandro, CA
Posts: 5,245
Car Info: 14 Mazda3 sGT, SOLD 12/26: 00 2.5RS Sedan
Massive Data Center Failure This Afternoon Due to Drunk Guy

http://valleywag.com/tech/breakdowns...out-282021.php

Wow. That was amusing.

Last edited by Hollandaze; 07-24-2007 at 11:51 PM.
Hollandaze is offline  
Old 07-25-2007, 02:55 AM
  #2  
03.23.67 - 06.14.13
iTrader: (3)
 
ldivinag's Avatar
 
Join Date: Nov 2002
Location: N37 39* W122 3*
Posts: 8,495
amazing that a data center is IN sf...
ldivinag is offline  
Old 07-25-2007, 02:57 AM
  #3  
Registered User
 
wrXtian's Avatar
 
Join Date: Jun 2006
Location: four one five
Posts: 401
Car Info: 06 wrx
i wonder if that's why videobox.com isn't working...
wrXtian is offline  
Old 07-25-2007, 07:07 AM
  #4  
the artist formerly known as mcdrama
iTrader: (23)
 
mattsn0w's Avatar
 
Join Date: Apr 2004
Location: Santa Cruz Mountains, CA.
Posts: 6,428
Car Info: WRBP 2015 WRX Premium/CVT
yes it does suck. All the servers my employer has there happen to be in colo 4. Now I have to go up there to figure out why half the systems have not coming back online.

Yay for drunken idiots and explosions from man holes!
mattsn0w is offline  
Old 07-25-2007, 08:53 AM
  #5  
Registered User
iTrader: (6)
 
wrxguy's Avatar
 
Join Date: Apr 2004
Location: SSF
Posts: 2,615
Car Info: 04wrx
No wonder I couldnt get into craigslist yest
wrxguy is offline  
Old 07-25-2007, 09:14 AM
  #6  
Yeah, You've Probably Never Heard Of Me.
iTrader: (21)
 
Krinkov's Avatar
 
Join Date: Sep 2003
Location: in a glass case of emotion.
Posts: 17,962
Car Info: 345/30/19s
actually, the storys BS, there was widespread power outages all over SF yesterday, 10,000 homes and businesses were out for most of the day. That story has already been debunked, I can understand people outside the bay area falling for it but Im surprised anyone that lives here fell for it, didnt you guys here about the blackouts here yesterday??
Krinkov is offline  
Old 07-25-2007, 10:12 AM
  #7  
Registered User
iTrader: (14)
 
Egan's Avatar
 
Join Date: Nov 2001
Location: Peoples Republik of Kalifornia
Posts: 14,221
Car Info: 05 H2 SUT, 45 GPW, 10 Murano, 13 Boss 302
So why didn't their UPS and Gens kick in?
Egan is offline  
Old 07-25-2007, 11:11 AM
  #8  
Registered User
iTrader: (8)
 
mcowger's Avatar
 
Join Date: Dec 2004
Location: Seattle
Posts: 1,737
Car Info: 2009 A3 2.0T quattro
Originally Posted by Egan
So why didn't their UPS and Gens kick in?
365 Main (my Colo as well) has yet to provide a root cause for us. Gensets kicked on after about 45 minutes. I was there from 2PM until 12:30 AM yesterday fixing my ****. Seeing your entire DC (mine is about half the floor of Colo 1 and about 600 servers and a few big storage arrays) dark and quiet sucks.

sorry no drunk guys anywhere, but there was a line out the door for 2 hours for getting temp badges.

As to colos in SF - there are some MAJOR colos in SF.....Level(3) in 185 Berry St, 200 Paul St, 365 Main St.....pretty big name companies.
mcowger is offline  
Old 07-25-2007, 11:41 AM
  #9  
Registered User
iTrader: (14)
 
Egan's Avatar
 
Join Date: Nov 2001
Location: Peoples Republik of Kalifornia
Posts: 14,221
Car Info: 05 H2 SUT, 45 GPW, 10 Murano, 13 Boss 302
Originally Posted by mcowger
365 Main (my Colo as well) has yet to provide a root cause for us. Gensets kicked on after about 45 minutes. I was there from 2PM until 12:30 AM yesterday fixing my ****. Seeing your entire DC (mine is about half the floor of Colo 1 and about 600 servers and a few big storage arrays) dark and quiet sucks.

sorry no drunk guys anywhere, but there was a line out the door for 2 hours for getting temp badges.

As to colos in SF - there are some MAJOR colos in SF.....Level(3) in 185 Berry St, 200 Paul St, 365 Main St.....pretty big name companies.
45 minutes - holy crap!

Sounds like they have some major issues with their system. The UPS should have switched to battery immediately and the ATS/STS should have switched the Genset on at almost the same time. Besides that, they should have multiple UPS modules .

I wonder if they had a cascade failure of their UPS system.

I'm sure they have some seriously pissed off customers right now.
Egan is offline  
Old 07-25-2007, 11:58 AM
  #10  
Registered User
iTrader: (8)
 
mcowger's Avatar
 
Join Date: Dec 2004
Location: Seattle
Posts: 1,737
Car Info: 2009 A3 2.0T quattro
Originally Posted by Egan
45 minutes - holy crap!

Sounds like they have some major issues with their system. The UPS should have switched to battery immediately and the ATS/STS should have switched the Genset on at almost the same time. Besides that, they should have multiple UPS modules .

I wonder if they had a cascade failure of their UPS system.

I'm sure they have some seriously pissed off customers right now.

Pissed off customers including me

They actually dont use batteries - they use a flywheel system (why I personally find a better solution). The generators are speced to go from stop to clutch engagement in 3 seconds, and the flywheels last about 10 seconds each (10 x 2.1 Megawatt Hitec systems). However, they had trouble getting the generators started, which was the problem. The CPSs worked fine, the generators bit it.
mcowger is offline  
Old 07-25-2007, 06:31 PM
  #11  
pwn
Registered User
iTrader: (10)
 
pwn's Avatar
 
Join Date: Mar 2006
Location: Dublin, California
Posts: 2,498
Car Info: 09 STi, 10 Cayman S
Yep, its always either the generator or the transfer switch. You'd be surprised how often these seemingly redundant systems get caught up on something. Wonder how often 365 powers up their generators for testing..
pwn is offline  
Old 07-25-2007, 07:21 PM
  #12  
the artist formerly known as mcdrama
iTrader: (23)
 
mattsn0w's Avatar
 
Join Date: Apr 2004
Location: Santa Cruz Mountains, CA.
Posts: 6,428
Car Info: WRBP 2015 WRX Premium/CVT
Originally Posted by Krinkov
actually, the storys BS, there was widespread power outages all over SF yesterday, 10,000 homes and businesses were out for most of the day. That story has already been debunked, I can understand people outside the bay area falling for it but Im surprised anyone that lives here fell for it, didnt you guys here about the blackouts here yesterday??
I know, but it sounds more entertaining.

Either way it is their fault for not testing the equipment.


I just checked my work email and found this:


UPDATE: 5:00 P.M., Wednesday, July 25, 2007

A complete investigation of the power incident continues with several specialists and 365 Main employees working around the clock to address the incident.



Generator/Electrical Design Overview

The San Francisco facility has ten 2.1 MW back-up generators to be used in the event of a loss of utility. The electrical design is N+2, meaning 8 primary generators can successfully power the building (labeled 1-8), with 2 generators available on stand-by (labeled Back-up 1 and Back-up 2) in case there are any failures with the primary 8.



Each primary generator backs-up a corresponding colocation room, with generator 1 backing up colocation room 1, generator 2 backing up colocation room 2, and so on.



Series of Electrical Events

· The following is a description of the electrical events that took place in the San Francisco facility following the power surge on July 24, 2007:

o When the initial surge was detected at 1:47 p.m., the building’s electrical system attempted to roll all colocation rooms to diesel generator power.

o Generator 1 detected a problem in its start sequence and shut itself down within 8-10 seconds. The cause of the start-up failure is still under investigation though engineers have narrowed the list of suspected components to 2-3 items. We are testing each of these suspected components to determine if service or replacement is the best option. Generator 1 was started manually by on-site engineers and reestablished stable diesel power by 2:24 p.m.

o After initial failure, Generator 1 attempted to pass its 732 kW load to Back-up 1, which also detected a problem in its start sequence. The exact cause of the Back-up 1 start sequence failure is also under investigation.

o After Generator 1 and Back-up 1 failed to carry the 732 kW, the load was transferred to Back-up 2 which correctly accepted the load as designed.

o Generator 3 started up and ran for 30 seconds before it too detected a problem in the start sequence and passed an additional 780 kW to Back-up 2 as designed.

o Generator 4 started up and ran for 2 seconds before detecting a problem in the start sequence, passing its 900 kW load on to Back-up 2. This 900kW brought the total load on Back-up 2 to over 2.4 MW, ultimately overloading the 2.1 MW Back-up 2 unit, causing it to fail. Generator 4 was manually started and brought back into operations at 2:22 p.m. Generator 4 was switched to utility operations at 7:05 a.m. on 7/25 to address an exhaust leak but is operational and available in the event of another outage.

o Generators 2, 5, 6, 7 and 8 all operated as designed and carried their respective loads appropriately.

o By 1:30 p.m. on Wednesday, July 25, after assurance from PG&E officials that utility power had been stable for at least 18+ continuous hours, 365 Main placed diesel engines back in standby and switched generators 2,5,6,7, 8 to utility power.

· Customers in colocation rooms 2, 4, 5, 6, 7 & 8 are once again powered by utility, and are backed up in an N+1 configuration with Back-up 2 generator available.

· Generators that had failed during the start-up sequence but were performing normally after manual start (1 & 3) continue to operate on diesel and will not be switched back to utility until the root causes of their respective failures are corrected.



Other Discoveries

· In addition to previously known affected colocation rooms 1, 3 and 4, we have discovered that several customers in colo room 7 were affected by a 490 millisecond outage caused when the dual power input PDUs in colo 7 experienced open circuits on both sources. A dedicated team of engineers is currently investigating the PDU issue.



Next Steps

· Determine exact cause of generator start-up failure and PDU issues through comprehensive testing methodology.

· Replacements for all suspected components have been ordered and are en route.

· Continue to run generators 1 & 3 on diesel power until automatic start-up failure root cause is corrected.

· Continue to update customers with details of the ongoing investigation.
mattsn0w is offline  
Old 07-25-2007, 08:28 PM
  #13  
Registered User
iTrader: (8)
 
mcowger's Avatar
 
Join Date: Dec 2004
Location: Seattle
Posts: 1,737
Car Info: 2009 A3 2.0T quattro
Originally Posted by pwn
Yep, its always either the generator or the transfer switch. You'd be surprised how often these seemingly redundant systems get caught up on something. Wonder how often 365 powers up their generators for testing..
According to them, weekly.
mcowger is offline  
Old 07-26-2007, 11:25 AM
  #14  
03.23.67 - 06.14.13
iTrader: (3)
 
ldivinag's Avatar
 
Join Date: Nov 2002
Location: N37 39* W122 3*
Posts: 8,495
Originally Posted by mcowger
According to them, weekly.
can you get that in writing?

so much for the five 9's???????
ldivinag is offline  
Old 07-26-2007, 11:29 AM
  #15  
Registered User
iTrader: (8)
 
mcowger's Avatar
 
Join Date: Dec 2004
Location: Seattle
Posts: 1,737
Car Info: 2009 A3 2.0T quattro
Originally Posted by ldivinag
can you get that in writing?

so much for the five 9's???????
Oh, it is in writing.

They will be paying us lots of $$$ for blowing their SLA.
mcowger is offline  


Quick Reply: Massive Data Center Failure This Afternoon Due to Drunk Guy



All times are GMT -7. The time now is 08:43 AM.