Data center design
Building a data center -- is your team without a coach?
04 NOV 2005 23:58 EST (04:58, GMT)
If you've been reading this blog the past two weeks, you have an idea how complex a data center project is and the range of expertise and number of difficult decisions it requires. But we've only scratched the surface. It would take books to cover all the details, and they'd be obsolete before they were printed. This is the multi-million-dollar "big game" -- probably the most critical and expensive single space your company will ever build, and the action changes constantly. A team of players has been signed. Hopefully, everyone is all-star caliber. But is information technology even there? For IT, this is not just another job -- it's the foundation of a career. They really ought to be playing quarterback, but we all know it rarely works that way.
IT is often sitting on the bench, or even left in the locker room. If they're allowed on the field, they may not be heard when they try to call a play, or they have no defense and are sacked before they can get off their first pass. But when the game is finished, and the stands are empty, it's IT that will be responsible for making the place a success, and it's IT that will be blamed for its failures. They know the playing field better than anyone else. So why are they so often on the sidelines, and what can be done about it?
It's not really that hard to understand. IT doesn't talk the same language as the rest of the team, nor does the rest of the team speak IT. Everyone knows the data center is the "heart" of modern business, but few in IT are accustomed to the professional design process that builds it. They don't communicate well with the "non-IT" world and, quite frankly, they scare "outsiders." And its well known that most ITers tend to delve into the "nitty gritty" at the concept stage, so they're regarded as indifferent to budgets or the clock running out. Most of the team will have had experience with IT people before, much of which may have prejudiced them, so the team shouldn't be surprising if they're fearful about heavy IT involvement slowing down the process. But make no mistake: The design team is also very interested in having a good project.
What everybody needs is a good coach.
But wait. There's more. The "team" of professionals includes a whole range of talents: architect, facilities, electrical engineer, mechanical engineer, structural engineer, plumbing and fire protection engineer, cable designer, general contractor, sub-contractors for every trade, and perhaps a realtor, owners rep, finance and more. In short, you have a lot of individual talent, and perhaps some "vested interests," but you may be surprised to learn there is often not much real "team play" among them in this all-important game.
The architect is the general manager, but a good coach is still needed. The data center is a very unique area.
This last blog may seem self-serving, but it's really meant to serve everybody. There aren't a lot of us yet who make data center design a real specialty, who are up to date on the latest technology and techniques, and who are also able to speak the languages of everyone on the team. A handful of engineering firms have someone who can do it. There are some design-build firms that have all the talent under one roof, but they probably won't be the favorite of the architect on a multi-part project. The big hardware vendors provide consulting, but most owners find it difficult to consider them independent. And there are those few IT consultants with this kind of specialization. Again, there're aren't many of us right now, but it's worth looking for someone good with so much at stake.
So let's talk about what a winning coach will do in a data center project.
First is to make sure both timetables and budgets are realistic. Designing and building a data center takes time. Period. If you try to push it or do it cheap, you'll pay in any number of ways, both initially and down the road. A major data center (10,000 sq. ft. or larger) could easily take a year and a half or more. If someone is reluctant to take on your project because the timetable is too short, I suggest you respect them. They're probably telling it like it is and are unwilling to risk your future, or their reputation, on something that will be nothing but problems. Someone with knowledge and experience is usually better heard.
Schedule is usually the biggest headache when the data center is part of a larger project, rather than a standalone facility. Even if the entire timetable is long enough, the data center has to be completely ready (and I emphasize "completely") at least two or three months before the first people move into the building. Without this time the systems can't be installed and tested the way they need to be. An experienced specialist, knowledgeable in all facets of the design and construction process, cannot only make the point strongly, but can help schedule the various facets of the job so it really can get done. They are also in a position to monitor construction, alleviate concern over things that waste time and don't yet matter, and be very vocal at every meeting about the things that do. You can't wait until the week before a deadline to blow the whistle on status. It has to be an ongoing process, and it has to be backed up with knowledge.
Where most projects go astray, get into trouble or go over budget is in design. The architect needs to know how big the space must be, and the engineers need power and heat load figures, and they need them right at the beginning. Unless you do this every day and understand the total design process, developing that information early in the project and presenting it so it's both useful and justifiable to the various disciplines is very difficult to do. And if you can't do this, one of two things will happen: Either the project will get delayed with IT getting the blame, or the individual designers will base their work on "norms" that may be totally unrealistic for your facility. When IT then rushes in with what they really need later in the game, it will involve very expensive change orders if it can be done at all. Not the best way to play the game. The old adage applies: "You can have it good, fast, cheap. Which two do you want?"
Next are three issues: design detail, best practices and balanced design. It really doesn't need to be said again that data centers are very expensive facilities. And they get outrageously expensive when change orders are necessary. Nearly all change orders can be avoided IF the project is thoroughly designed and detailed. Again, this takes time, but the level of detail is also beyond what is normal for standard construction. Someone needs to know what's necessary, and be in a position to insist on it. And even if it's detailed, and each team specialist is superb, if they're all "doing their own thing" you're not going to spend wisely. The likelihood is you'll spend beaucoup bucks on one system, like power, that is designed for ultra-reliability and redundancy, but have something else like air conditioning that is relatively under-designed and one day throws the UPS into thermal failure when it goes down. The configurations of each system should meet performance goals that are realistic for your operation, without going overboard and wasting budget somewhere that doesn't need it or under-designing through lack of understanding in someplace that does. A good coach will outline the plays for each member of the team and make sure they are executing them properly.
Next is the most challenging part of today's data center design -- planning ahead. It's unrealistic to expect team members who daily work on a wide variety of projects to keep up with everything happening in the IT world or with the all the special products coming onto the market to handle the power, heat, cable management, monitoring and equipment mounting challenges. It's even more unrealistic to assume they maintain close industry contacts that keep them apprised of what's coming or that they attend all the seminars where ideas, experiences and non-proprietary information is exchanged. A good data center today should have the flexibility to absorb coming technologies for some years in the future, but that kind of planning takes intimate knowledge of IT industry directions. In addition, decisions have to be made between what really works and what's marketing hype, as well as the appropriate tactics for a particular facility. A good "coach" should be scouting the field of players, and knowledgably assessing their strengths and weaknesses.
Last is the game itself -- construction. There isn't a single part of a data center that doesn't require some sort of special attention to construction technique. It takes someone who was involved in the design process and who also understands each system and how it should be built to assist in the bid review and contractor selection process and to make site visits and catch problems before they become hidden or impractical to correct. Again, the job of a good coach should be to recognize good players, make them better at their roles, spot the mistakes and get them fixed.
In short, a professional "coach" brings to the table not only a wealth of experience, but a good game plan and the ability to communicate it as well. Building a new data center can be a wonderful opportunity. It can also be a career disaster. It may look gorgeous on Day 1, no matter how well it's really designed and built, because it's all new, clean and organized. But if someone has to tell upper management in a few years that it needs to be replaced or renovated, there will be some tall explaining to do. Numbers of data centers built only a short time ago are already being replaced because they weren't designed with the future in mind and can't be realistically upgraded without seriously endangering operations. If you have any part of the responsibility, whether you're in IT, facilities or management, you have a tremendous challenge ahead of you. A good coach can make it a lot easier.
I've thoroughly enjoyed writing this blog for the last two weeks and answering some very well thought-out questions. I hope the time to read this material has been valuable, and that one day, some of us may meet first-hand.
Posted by Robert McFarlane
To gas or not to gas? That is the question!
03 NOV 2005 21:29 EST (02:29, GMT)
The question is simple, and we are asked it almost immediately by every client: "What kind of fire protection should I use in my data center? Sprinklers or gas?" Actually the question is usually phrased as: "Pre-action or FM-200?", which is not entirely proper as we'll explain, but you get the point. I've previously answered a reader question about this so there will be some repetition, but we'll go into a lot more detail about this critical decision here. It involves philosophy, regulations, corporate policy and, of course, that ever-present factor, money.
We can state the premise in two sentences. Water is intended to protect buildings and people. Inert gas is intended to protect equipment. Either one should be capable of doing both, but we'll see as we go why that's not a generally accepted position.
There is no single "right answer" regarding fire protection in a data center and, as I stated in my answer to the reader's question, I doubt you will find anyone, other than a fire suppression salesperson, who will tell you there is a "correct solution." Unpredictable things happen in fires, and if the "recommended" solution doesn't cover that particular contingency, the liabilities are just too high. What we can do is explain the different systems, discuss the factors you need to understand, investigate and consider, and give the pros and cons of the different approaches. The decision must be, sorry to say, entirely up to you. Let's start by explaining the systems, some of their history and where we are today. s
In the days when large mainframe computers were the norm and virtually everything in the data center relied on one giant machine, the concern about water was enormous and definitely legitimate. These computers were usually the sole processors in data centers, and they ran on high-voltage, high-current power supplies that generated a great deal of heat for the amount of compute power (by today's standards) they offered. If water got into these big machines, the damage could be quick and enormous, and machine repair or replacement could take weeks.
With the growth of business dependency on these computers came a simultaneous paranoia about water in the data center -- or at least about water above the equipment. The largest of these mainframes ultimately utilized water cooling from pipes below the floor, but that was not likely to run into the computers. The concern about overhead water led to the development of the "pre-action" sprinkler system, and ultimately to other forms of fire suppression as well.
The concern about overhead water has not changed today. No IT department wants water pipes running over their technology, out of concern that the "running" part might become literal. This is where we can definitely make a solid recommendation. As noted above, the "pre-action, dry pipe" sprinkler system was developed specifically to alleviate this concern.
In a "dry pipe" system, the overhead pipes are charged with compressed air until fire or smoke is detected, normally by at least two sensing devices. Only then is the valve opened that allows water to enter, and in a good design there's a "override" available to even delay that. Until the valve actually opens, there is no water in the pipes and, therefore, no water over your equipment. Prior to the detection event, the air pressure is constantly monitored so that even a small leak in a pipe or head would be detected and an alarm sounded. Therefore, even after the pipes are "charged," there is still little concern about leakage. And since there is still a fair amount of air ahead of the water, most of the pipes are not literally "wet" until a sprinkler head is actually activated. But that's where concerns develop that are even bigger than having "overhead water."
Sprinkler heads open, not on the detection of smoke, but as a result of elevated temperature -- usually around a factory-set temperature of 165 degrees F, although they can be obtained with higher ratings. You'll have a pretty good fire going by the time this happens, and water damage to equipment may be the least of your concerns. Sprinklers open only where the heat of the fire actually exists, so most fires will open only two to four heads. This will stop the destruction from direct fire or heat, but smoke, which is often the nastiest contaminate, may permeate a much wider area and can be worse than direct damage from water, which is often not as bad as it is always considered. We recommend that a data center sprinkler source be filtered to minimize contamination from any water that might get into the hardware, but we have seen computers get wet from other sources (not from fire systems, but from water leaks above, which shows poor design practice) and be dried out with a hair dryer and put back into operation within a few hours.
The biggest vulnerability of a "dry pipe" sprinkler system in a data center is damage to a sprinkler head. If heads are exposed below a ceiling there is always the potential of breaking one while moving a ladder or handling a tool overhead. A broken head will allow the air to immediately escape, opening valves and bringing water right behind it. Data centers without ceilings (which we recommend for a number of other reasons) offer a certain advantage in this regard, in that sprinkler heads are then turned upward rather than down. This obviously makes them less vulnerable to damage. In data centers where a false ceiling is used, it is best if the heads can be flush with the surface, as they are in most finished office and corridor spaces. But since heads must be arranged in relation to the cabinet layout to be effective, it is often necessary to have them below the ceiling surface in order to achieve the necessary coverage, exposing them to damage.
All this brings us to the question most asked regarding sprinkler systems: "What is the potential for actual equipment damage if a sprinkler head were to open?" In today's world, the answer is actually "minimal," simply because of the way in which most of our equipment is mounted. Most of today's computing equipment is comprised of literally hundreds of small, discrete "boxes" stacked one above the other in equipment cabinets that are arranged in rows within the data center. Many of these cabinets have, or should have, solid tops to prevent hot discharge air from finding a short path back to what is supposed to be the cold air intake side of the devices. (See blog article #8.) As a result, water from sprinklers is unlikely to even get into most of the computers, let alone damage them. That obviously begs the question as to whether sprinklers will extinguish an in-cabinet computer fire at all. The answer is, they will, eventually. But the greatest server exposure can actually be from water vapor pulled into machines not even directly touched by water as their ventilating fans pull it right through. This concern is addressed by installing an "emergency power off" (EPO) shutdown that activates when the sprinkler pipes are charged. But a good design will also have an "override" button that will prevent EPO activation temporarily, since the sudden shutdown of power in a data center can actually result in more monetary loss than damage from water. When an entire facility must be restarted from a "crash" status, which can take hours or even days.
So if it is unlikely that water will actually get into most of the computers, and if we can't be assured that a fire in a computer cabinet will be extinguished by water before it grows and damages even more hardware from fire, heat or smoke, what can we do? Enter "inert gas" or "clean agent" fire protection systems. These systems were originally developed to provide a machine-safe fire suppression alternative to water. Large mainframe computers, due to their enormous power supplies, had components that caught fire more often than we ever see with today's equipment. Since repair and replacement time were so significant, it became important to extinguish fires with minimal additional damage to the computers or their peripherals. A number of approaches were tried, including CO&sub2; (carbon dioxide), which was bad for the people inside the room when the oxygen suddenly disappeared.
When DuPont developed the chemical compound trade-named "Halon" it seemed the preeminent solution had arrived. Despite its significant cost, Halon was installed almost universally in data centers because the business cost of a mainframe fire far outweighed the cost of the gas.
However, as municipalities started looking carefully at their building codes, many realized that data centers with huge amounts of power usage, and lots of people working 24/7, were installing Halon as the exclusive fire agent and not proving that it actually worked. Testing became required, which meant dumping the expensive gas while the fire department monitored the concentration with strategically placed sensors. Room after room failed, mostly due to construction errors -- another cost factor that will be discussed shortly. This required additional tests until the room passed. Businesses became distressed when tests had to be conducted as many as five, six, seven or eight times at often $50,000 or more a test. When code writers started realizing that most facilities had no secondary method if a Halon dump failed to extinguish the fire, requirements were instituted for backup Halon tanks and/or sprinklers. Some jurisdictions simply mandated sprinklers regardless, as did lots of insurance companies. Regular inspections were also required by some localities to detect leakage problems resulting from small data center modifications that rendered a previously tested Halon system unreliable. As years went by, and mainframes were no longer the primary processors, many people decided the cost was just too high and opted for pre-action sprinkler as the sole method of protection. But for those who continued to install Halon, another problem was yet to come.
As concerns were raised about the effect of various chemical compounds on the earth's ozone layer, one of the product's identified as a problem was Halon. Ultimately, it was "grandfathered out" as of December 31, 1993. Existing installations could remain, but new installations were not allowed and no industrialized country could continue to manufacture it. Halon rooms still exist today, but if a facility is disbanded, the Halon system has to be "decommissioned" (a potentially dangerous job requiring professional handling). The gas can either be disposed of in an environmentally safe manner or re-sold to one member of the small group of approved users for whom there is no other viable alternative. There was, of course, great pressure for alternatives, since concern still existed about sprinklers in the data center, as well as numerous other industries.
Several products have been developed to answer both the environmental and equipment protection problems. The most often heard is "FM-200," which is actually a brand name copyrighted by the originator, Great Lakes Chemical Corporation. A similar product is manufactured by DuPont. FM-200 extinguishes fire primarily by rapidly absorbing its heat. Another product trade, named "Intergen," is a mixture of nitrogen, argon and CO&sub2; that extinguishes fire by rapidly reducing the oxygen level to below 15%. These products are all considered both environmentally and "people" safe and discharge in exactly the same way as the familiar Halon. The construction methods necessary for their use are the same as well, including a sealed room, "clipped" ceiling panels where false ceilings are used, automatic power shutdown to air conditioners and equipment, room testing and, usually, the use of sprinklers in addition to the "clean agent" gas system. Because the quantity of gas required is volumetrically based, a tradeoff must be considered between the advantages of a high space with no ceiling and the cost of the clean agent gas in a room with a large volume. As said initially, there is no single "right answer."
Returning to our original premise, inert gas systems are designed to protect equipment. Obviously if the inert gas puts out the fire, then it should also protect the building. If it is also non-toxic (as it must be), it should protect people as well. Most fire codes don't look at it this way, however, and in most jurisdictions sprinklers are going to be required by code whether you have something more sophisticated or not. If you're in a jurisdiction where sprinklers are not also legally required, you're not necessarily home free. Check with your management and your insurer. They may insist that you put in sprinklers anyway, perhaps to avoid potential liability and possibly to reduce monetary exposure if you have a big data center with a huge equipment value. So, you will probably ask, if I'm going to have sprinklers anyway, should I also pay for a second "inert gas" system? The answer usually comes down to cost.
Room size is the first concern, because the amount of gas required is based on the volume of the space. We have indicated several times that inert gas is not inexpensive (although the newer gasses are less than their Halon predecessor), so the amount required has an obvious cost implication. Ceiling height, therefore, requires a real evaluation of tradeoffs. Keeping the ceiling as low as possible obviously reduces room as well as gas volume, but it also imposes a significant limitation on the flexible installation of overhead systems such as cable tray and lighting. It is also the antithesis of what we want for good cooling. (See blog article #3).
But the gas system itself is not the entire cost. The requirements that an inert gas system imposes on room construction can add significantly to the total price. A data center needs to be air-tight in order to contain the gas concentration long enough to extinguish a fire. This means door gaskets, sealing all walls to both the floor and ceiling slabs and closing all wall penetrations (which should be done anyway to maximize cooling performance). It is also necessary to automatically shut-down all air conditioners, close fire dampers on all air ducts entering and leaving the room and shut off all power to computer equipment before "dumping" the gas. This means a significant amount of control hardware and wiring. Lastly, the nearly explosive release of the gas creates such pressures inside the room that ceiling tiles need to be "clipped" into place so they won't blow out, thereby letting gas escape. Getting all this accomplished correctly, particularly under a construction deadline, takes a great deal of on-site supervision.
Both inert gas and good pre-action sprinkler systems will sound an alarm when a condition is detected, will incorporate a graphic enunciator panel to show where the smoke has been detected and will enable you to invoke a short "override" so that gas is not "dumped" or the pipes are not charged with water before you've had the chance to solve the problem differently. The cost of the inert gas is too high to simply dump it at the first sign of a problem if there's another way to quickly resolve it. And draining back a pre-action system that has been filled with water, even if a head never activates, is probably the biggest headache in their maintenance, so it's good to avoid it if reasonably possible.
So what is the advantage of spending all this extra money on an inert gas fire suppressant? It comes down to a matter of how much the total gas system costs versus your potential damage and loss from fire, smoke and water, and your assessment of the risk of having an equipment fire in the first place. Keep in mind that almost everything in your data center today is housed in metal enclosures and, unlike the old mainframes, can be physically replaced virtually overnight. If you have a backup site, with mirrored data, your vulnerability is even further reduced. But if your risk assessment says you need all the protection you can get, an inert gas system will operate much sooner than the sprinklers since it is generally programmed to activate when two detectors alarm. Even if you invoke overrides, the gas system is nearly certain to activate long before sprinklers would open. And you should have a much smaller mess to clean up after the fire is out.
I must mention one other item regarding fire protection, just as a matter of awareness. Most cabling in a data center tends to be under the raised floor. Prior to "low-smoke" or "plenum-rated" cables, certain jurisdictions with fire paranoia (Chicago being the unquestioned leader ever since Mrs. O'Leary's cow kicked over that famous lantern) required fire protection below the floor as well as above it. With the advent of these better cables, the under-floor fire suppression requirement mostly went away. That could be changing again. New code requirements appear likely to impose yet another consideration when the 2008 National Electrical Code is released. Data cabling in an under-floor air plenum space may no longer be acceptable as "plenum-rated." It may need to be "limited combustibility" (LC) or fire protection will be required below the floor as well as above. (Note that code already requires abandoned cable to be removed.) This leads to the recommended practice of installing a permanent LC tie cable infrastructure within the data center that can virtually eliminate the need to install or remove cable, or to even go under the raised floor. Although initially expensive, if installed correctly the long-term savings and operational benefits can be significant, particularly in facilities with a great deal of change.
Posted by Robert McFarlane
Cabinets, bloody cabinets!!
02 NOV 2005 07:26 EST (12:26, GMT)
It's hard to believe there could be so many companies making so many different racks and cabinets, or that there could be so many different ways touted to address the two main concerns facing everyone today: equipment cooling and cable management. Simply because there are so very many, we will avoid referring to any by name or including links to Web sites. It would be too long a list, and everyone we miss would object to the omission. You can use search engines as well as we can.
Instead we will dwell on some fundamentals and what our firm looks for and evaluates in recommending, specifying and selecting racks and cabinets for our clients' data centers. (Racks and cabinets for IDF rooms and other purposes like audio/visual equipment have different considerations and will not be addressed here.)
First let's define our terms. A "rack" is either a two-post or four-post open-frame mounting. A "cabinet" is always a four-post concept, but with options for side panels, top panels and front and rear doors making full enclosure possible. If a four-post cabinet frame is put on the floor without enclosure panels or doors, by our definition it's a "rack." Conversely, if a product sold as a "rack" has components available to enclose it, we will regard it as a "cabinet" if those components are installed. Both racks and cabinets have EIA-standard (Electronics Industries Association) hole spacing in the vertical mounting rails, which may be in the front, or in both front and rear. As a general rule, "rack rails" are fixed, whereas "cabinet rails" are usually adjustable forward and backward within the structural frame to accommodate different mounting depths. Rail holes are usually one of two types: tapped or square-cut for use with snap-in cage nuts. Rail spacing is meant for either 19" or 23" wide EIA standard panel widths, and herein lies one of today's more interesting and obscure concerns.
ANSI/EIA Standard RS-310 specifies the minimum clear width between rails as 450 mm (17.7165"). However, there is hardware on the market today from at least two major-name vendors that specifies a clear rail width of 17.75" (450.85 mm) or more. This may not seem like much, but we've seen more than 200 cabinets in a single data center that met the specs, but the servers still wouldn't fit in. And even if a cabinet bows slightly (as most metal will tend to do) it may provide adequate spacing at the top and/or bottom but not in the middle, or vice-versa. In short, verify the mounting clearance requirements of your hardware, and specify your racks or cabinets accordingly if special tolerance requirements must be met. Otherwise the manufacturer is well within his rights to charge you for field adjustments if he can demonstrate that his mounting rails meet the published standard, which is what will naturally be assumed if you didn't specify differently.
From this point on, we'll concentrate on cabinets, since those are what we use today to mount most of our server hardware. Let's examine several of the key factors that relate to cable management and cooling concerns.
Size: This a place where size does count! Servers have gotten smaller vertically by growing in depth. There are now "1U" servers that are 36" deep, and that doesn't include the rear connectors. A 42" deep cabinet still leaves only 6" behind equipment 36" deep, and that's only if the front mounting rails are very close to the front of the cabinet. Thankfully, most hardware has not yet reached this depth, but 36" deep cabinets are simply too limiting for much of today's technology. Even if you can't, or don't want to, install deep cabinets everywhere or reuse existing smaller cabinets, it's wise to plan your layout to accept as many deep cabinets as reasonably possible. And since most equipment loads from the front, you'll need that much aisle depth between cabinets as well (which you should have for cooling anyway). It's very difficult to install a 36" deep server in a 36" deep walking space.
Cabinet height is a matter of available space and personal preference, but keep two things in mind. First, it's not easy to install or work on anything at the top of an 8-foot-high cabinet. Second, cooling at the top of any cabinet or cooling a fully loaded cabinet of any height is problematic. The common 7-foot (84") nominal height is the most available and is the best choice for the vast majority of situations. As to cabinet width, we will address that next.
Cable management: Sir Walter Scott didn't have our modern data centers in mind when he said, "Oh, what a tangled web we weave," but that part of the phrase sure fits today. The higher equipment density made possible by small form factor and blade servers brings with it a prodigious number of cables. Consider a cabinet with 42 "1U" servers, each dual-corded and dual-homed with two NICs, each with monitor connections to a KVM switch and with a fiber connection to a SAN. This might not be the norm, but there are plenty of cabinets with this kind of load and many more that come close. That's five UTP cables, two power cables and one fiber pair per device, for a total of 336 wires and cables. If you're running a permanent cable infrastructure (highly recommended) from cabinet patch panels back to your network switches and SAN, that's another 210 UTPs plus a fiber bundle. You're just not going to get all this neatly into a conventional 23" wide cabinet when the standard chassis is already more than 17" wide. And if you use the space behind the servers in a deep cabinet, you'll block the hot air exhaust from the machines. (This is why those "folding cable managers" are so bad. When they ship with the machine, I figure the manufacturer must know how much sooner it will cause you to buy a new server due to overheating.) Remember, the fans inside these small server chassis are, of necessity, very small. They run fine in "free air" conditions, but make them push against any kind of blockage and they slow down under the static pressure buildup. They are just not powerful enough to push the necessary volume of air past that virtually solid wall of wires.
Therefore, for any part of the data center where high densities are likely (which, by our definition, means more than 12-15 machines), we recommend wide cabinets, which generally means from 28" to 30", depending on manufacturer. (Note that the mounting rails in these cabinets are still designed for EIA standard 19" wide devices.)
But cabinet width is only part of the story. That width must be efficiently utilized in order to be worthwhile, and that means a good method of dressing the cable to each side of the equipment (cable management system) and a good way of mounting multiple power strips without blocking access to the cable. Even with the extra width, this is not so easy as it sounds. We have seen some very clever approaches to doing this, and some that show no thought at all. There are highly proprietary solutions, meaning you can't use XYZ company's "smart power strips (sometimes also called "PDUs" or "CDUs") because they won't fit or they defeat the nice concept. This is something you should look at very carefully in selecting a cabinet and should observe in real situations, not just in pictures.
Cooling: I can't help but think of Aerosmith's recording "Big Ones" -- their blockbusters. But their performance is proven. Every cabinet vendor, however, claims they have the "big hit." Their product cools better than anyone else's. We're not going to dispute them. (What? You think we want to get sued??) But just like Aerosmith's best songs, it might not be right for every occasion, so each cooling solution might work well in one situation, but not perform well in another.
Therefore, what we are going to do is give you the principles -- some things to consider -- and to unequivocal guarantee that no one, irrespective of what their marketing department dreams, is going to defy the laws of physics. It still takes a certain volume of air at some known temperature to cool a device by a given number of degrees.
The goal of every cabinet design is, or at least should be, to help deliver the necessary amount of air to every device, regardless of how high or low it's mounted in the cabinet. Every device has been designed and tested in the lab to move that air through the box in the right volume, at the right velocity and past the right components. But getting this air out of the floor and up to the full height of the cabinet in sufficient quantity and at the right temperature is not all that easy. And make no mistake: The manufacturers dump that problem right in your lap. They know their servers will work in the lab, and they'll usually tell you what it requires in some fashion or another. (Unfortunately, too few follow ASHRAE's recommendations for providing this information, but hopefully that will change in time. If we can't meet the requirements with air directly from the floor, we may need a little help, and that's where all the special air-moving cabinet accessories come in.
Today's basic server cabinet should have at least 63% open perforated front and rear doors, a closed top plate with sealable cable pass-throughs and a base frame that fits reasonably tightly to the floor. It also helps to have interior side panels to keep air channeled through each cabinet and to avoid mixing with hot air from an adjacent cabinet. At least one manufacturer offers an insertable interior panel that can be put into channels or removed, depending on need. It's a little pricey, but it's an interesting idea.
If we're getting good air flow out of our raised floor, and we still have cooling issues in the cabinet, then one of the "air booster" solutions may help. There are basically three types: bottom front blowers that pull air from the floor and discharge it upward in front of the equipment; top-of-cabinet fans that pull air through the cabinet, either from holes in the floor or through the doors, and rear door fans that pull air through the front door and through the equipment. Let's examine each of these in a little more detail.
Bottom front blowers work very well, with three important caveats:
- They pull a fixed volume of air out of the floor, so it's critical to make sure the CRACs can supply it and that the suction doesn't deprive something else of air.
- A fixed spacing between the front door and the equipment is necessary for this solution to work right, since improper spacing affects the pressure, and variations in the surfaces disrupt what is trying to be a somewhat laminar air flow.
- The velocity has to be high enough for air to reach the top of the cabinet, but low enough to avoid air starvation to equipment at the bottom due to pressure reductions at high velocities (see blog article #3). In short, these devices are great for improving the evenness of air distribution up the height of the cabinet, but only for about 5 kW of equipment. In other words, they're really not for "extreme density"; they're to improve moderate density situations.
Top fans have been used for years and historically were the only thing available to cool an enclosure. Today you may see some significant performance figures stated for cabinets using them. Keep one thing in mind: The goal is to deliver the air needed to go through each device, not around it. Cooing the outside of a computer does not, in and of itself, cool the inside. Thankfully, virtually all blade servers today and many other processors, come with IP-addressable internal temperature sensors so you tell what's really happening inside the case, where it counts. When cold air is being pulled all around the equipment, it's difficult to get true temperature readings with probes placed inside the cabinet because the hot air being discharged from the servers is immediately mixed with cold air pulled from the raised floor. We're not saying top-fan cabinets don't work. They certainly can. We're just suggesting that a level of awareness accompany any testing of the performance capabilities and any interpretation of "test data." If the top fans are meant to exhaust hot air from behind the servers, rather than to pull it from the floor, a baffle should be installed to prevent also pulling cold air out of the front of the cabinet.
Rear-door fans come in many types and sizes, but they all have one purpose -- to move air through the equipment more efficiently. Recall that we said cooling requires the movement of a certain volume of air at a certain temperature in order to cool anything. We're not going to change the temperature of the under-floor air. It is what it is for some very good reasons, and it will get warmer as it exits the floor tiles and moves upward in front of the cabinet. Therefore, the one way to improve cooling is to move more of it through the equipment. (Notice, we said through the equipment, not simply through the cabinet.) There are doors that simply have fans of various sizes mounted in them. There are versions that pull the air into a chamber inside the door, in some cases engineered for an air flow gradation from bottom to top so as to compensate for the temperature change between lower and upper devices. There are some with chimneys that exhaust the hot air directly into the ceiling plenum rather than into the "hot aisle" so the hot air has no chance to bypass and mix with the incoming cold air. (See blog article #2.)
Whatever version you might consider, there are two factors to keep in mind. First, make sure the fans are actually pulling the additional air through the computers, not simply around the outside of them. This not only means putting blanking panels in unused rack spaces (see blog article #1), but also means a cabinet design that blocks the space around the front rails. Second, be cautious of fans that are too powerful and simply stuck in the doors without any real engineering behind them. Recall we mentioned the tiny fans in small form factor servers. They were designed to provide the required air flow over critical components, and they can be easily overwhelmed by the large fans in the doors, particularly by fans directly behind particular devices. If this disrupts the air flow inside the server because velocities get too high, then the door fans are actually counter-productive. The only way to really know is by checking internal temperature sensors, if they exist, or watching for unusual rates of particular component failures or data errors in devices in front of large fans. It's also possible to run air velocities so high that static is actually created by the fast moving air, but if the room humidity is well controlled, this is not a likely problem with door fan cooling. We'll encounter that concern next.
Liquid-cooled cabinets have come on the market in the last several years, and there are now at least five or six manufacturers making various versions. By the time you read this, there may be even more. The concept is irrefutable -- contain the high level of cooling inside the cabinet that produces the heat, rather than dissipating part of it into the room. In practice, however, this is not so simple a matter to accomplish. As mentioned for top-fan cabinets, simply putting a server inside a refrigerator doesn't necessarily cool the insides. The cold air has to get through the device to do the job. The best liquid-cooled cabinets do accomplish this, but they also need to avoid over-cooling the equipment. There are three reasons.
- First is the matter of condensation. If the temperature reaches the "dew point" of the air, then the otherwise desirable water vapor, which is normally in the air in the form of humidity, suddenly becomes liquid water. You do not want this happening inside your servers. And even if the internal cabinet environment is controlled to stay above the dew point, what do you think can happen when the cabinet door is opened and the warmer, more humid air in the data center suddenly enters? Instant condensation is a definite possibility if the cabinet needs to run too cold in order to perform as advertised.
- Second, if the server manufacturer publishes specs in accordance with the ASHRAE format mentioned earlier, you will see both maximum and minimum recommended operating temperatures. You don't want to effectively freeze the computers. They weren't designed to be run cryogenically -- at least not yet.
- Third, if air velocities get too high and humidity too low, the air passing rapidly over components can actually create static buildup, as mentioned above. This is much more possible in this closed environment where both temperature and humidity are usually lower than you find in the normal data center air. (See blog article #6 on grounding.) Ask any manufacturer of liquid-cooled cabinets how they have addressed each of these potential concerns, and expect good explanations. Don't simply accept "It's not a problem."
The last major consideration with liquid-cooled cabinets is fundamental redundancy. If you're running a data center that justifies cabinets this sophisticated, you probably have a high level of redundancy in both your electrical and air conditioning systems. You certainly don't want to put some of your most powerful processors in a housing that is less reliable than the rest of your facility. Before purchasing one of these cabinets (and they are rather expensive), look carefully at not only the technical performance, but also at what happens when something quits and how the design has provided for concurrent maintainability (see blog article #5).
The one thing we have not discussed is the fact that liquid-cooled cabinets obviously require water into your data center, which makes most IT managers cringe. All we can say is "get used to it." Water cools 3,561 times as efficiently as air. As servers get more powerful and use more energy in smaller housings, it's going to take more than just cool air to keep them running. The day is coming when you're going to have some form of coolant running directly to your servers, which could well negate all of the above discussion of special cabinet solutions. That will definitely require a complex of plumbing in your data center. It's inevitable. But until that happens, and until you're ready for it, look at the other cooling solutions that are already here. Just examine them thoughtfully and suspiciously before you buy.
Posted by Robert McFarlane
Auditing your data center
01 NOV 2005 06:26 EST (11:26, GMT)
You've been operating for a while -- maybe for years -- and lots of things have changed since your data center was built. Maybe it was done well at the time. Maybe it wasn't. Maybe something has failed and given you a wakeup call. Maybe it hasn't yet, but you're worried it will. Regardless, a lot has changed in those intervening years. Manufacturers are more open about what's coming, so we can all plan better than we could even five years ago. (If you're designing a new data center, and it's going to look a lot like the old one, but with a little more power and air conditioning, you'd better find someone who's up to date to help!)
A good data center audit will not only uncover hidden flaws, it will tell you how far you can reasonably take this facility and will provide general guidelines as to how to do it. It will be worth its weight in batteries if it saves you from a crash by identifying risks before they happen. If nobody acts on them, at least you can't be faulted for waiting until the proverbial horse is out of the barn before saying the door should be closed.
In one audit, we cautioned that the dual power feeds were from the same local electrical grid, with no generator to cover the contingency. Our report was presented to management just two days before the massive grid failure hit the Northeast, originating in that exact grid. The IT folks who had asked for the audit looked real good at that point and got their budget approved as soon as the lights came back on. Just a few days later they would have been regarded as no more than "Monday morning quarterbacks," regardless of the fact they had started the audit process weeks before. No one would have believed that kind of forethought wasn't inserted in the report after the fact!
As with everything, audits can be done well, or they can be little more than sales tools for services. Beware the "free" audit or the review that's too cheap. Doing this right can take several days on site and several more days compiling data, plus the time to write a meaningful report. The report should be tailored to your business, your facility and your particular problems and concerns, not a "canned" checklist and "file drawer" paragraphs. Ask to see a couple of reports before contracting and see if they're individual or if they look like "boilerplate" copies with different names. Data centers have a lot of things in common, but they are also each unique to the needs and circumstances of their companies. Also look at recommended solutions. Are they practical, or are they a list of "ideals" that could never be realistically implemented? You're going to pay for this study one way or another. You should get your money's worth in real advice.
A good audit will involve IT, facilities, a local electrician and your electrical and mechanical engineers if you have professionals you usually work with. Some engineering firms have the specialists on staff to conduct a good data center audit. Most do not. In any case, you'll get a better job in less time if the people most familiar with your installation are fully involved.
Posted by Robert McFarlane
Grounding -- the 'black art'
31 OCT 2005 07:48 EST (12:48, GMT)
We all know that grounding (or "earthing" as the Europeans call it) is a necessity. It's required by electrical codes; it's required by equipment manufacturers; and we all know it would be "good practice" even if it wasn't required. But exactly how to do it has probably been the subject of more debate and difference of opinion than any other aspect of the infrastructure. "Isolated grounds" are still called for by many people, even though they are actually counter-productive in the data center. And top-name manufacturers have even been known to stipulate grounding methods in their installation specifications that are just plain illegal and unbelievably dangerous. Why is it that this fundamental, and seemingly straightforward subject, is so misunderstood?
It's misunderstood because there are so many different reasons for doing it, each with its own set of concerns, considerations and installation methods. It's also misunderstood because the problems that can occur when it's done wrong are essentially invisible, difficult to comprehend, often without a good explanation and hard to track down when they happen.
Most professionals deal with only one or two types of grounding in their careers. The majority don't necessarily know that the communications industry has its own set of requirements, and don't realize that, while there are similarities, what is fine in one field doesn't always do the job in another. Let's identify some of these grounding specialties and what they're for, then pull the concepts together to get a better understanding of the principles of telecommunications grounding.
Electrical safety grounds: Probably the most fundamental of all grounds, these are required by code to protect people from injury in the event of a short or "fault" that puts current onto an equipment housing. That's why the "U-ground" pin is found on lots of appliances. One of the power wires, called the neutral (white conductor), is also grounded, but if something goes amiss with it, the "U-ground" keeps you safe. It's really bad to cut it off or to use a three-pin adapter in a two-pin socket without actually grounding the green wire or ground lug. (Appliances like power tools that just have a standard two-blade plug are "double insulated" to make sure a fault doesn't electrify the part you're holding. Because they use special construction, the manual will tell you not to disassemble it yourself.) The building power ground goes to an "earth terminal," is bonded to building steel and is also carried to every electrical panel in the building. Code requires a building safety ground to have a ground resistance of 25 Ohms or less. (It takes special equipment and techniques to measure this.) Keep this figure in mind.
Lightning grounds: These are designed to conduct lightning strikes directly to ground so they don't damage the building or its electrical systems, or injure people. Spiked rods on top of the building (called "air terminals") are the most commonly recognized form of protection, although not necessarily the best. But whatever technique is used, the intent is to carry the lightning strike to earth through the building steel or through wires run down the outside of the structure to rods driven into the ground. These ground rods are also bonded to the main electrical ground, as is the building steel. Lightning, by its nature, includes a large high frequency component. (If you studied mathematics, you will recall the Fourier Series, which defines the attributes of a sharply rising pulse, and understand why.) Therefore, it doesn't bend corners very well. All lighting wires are run with long radius bends -- no right angles. Keep this in mind as well for later in our discussion.
RF shielding and grounding: Radio frequencies are very high, (though not as complex as lightning) and therefore have very short wavelengths. Despite the experience we have daily with cell phone dead zones, RF tends to find its way into everything, especially where it is not wanted. The only way to stop RFI (radio frequency interference) is with a virtually continuous grounded shield -- often called a "brute force ground." This might be thought of as the opposite of an isolated ground. Commonly seen in broadcasting, this type of grounding is achieved by making sure all metal parts are solidly bonded together -- essentially grounded everywhere. If you have, or have ever seen, an RF shielded cabinet, you may have noticed that the doors close against hundreds of small, spring bronze fingers or against some sort of metallic braid that forms a continuous electrical connection around the entire door edge. (These cabinets are sometimes used to meet FCC regulations for RF emission from equipment and are usually labeled as such.) Keep this concept in mind as well as we proceed.
Electro-static grounds: After the mandatory electrical safety ground, this is what we want in our data centers. It's the reason we wear (or should wear) wrist straps when we work on micro-electronics and why we use anti-static floor tiles in data centers instead of carpet. Static discharge is just a personal lightning bolt. It's obviously much lower in power than nature's cloud-borne version, but it's exactly the same phenomenon -- a build-up of free electrons that suddenly finds a path to something with fewer electrons -- usually the earth, or "ground" -- and very rapidly discharges those electrons to equalize the balance. The problem is, it may find its ground path right through our sensitive and expensive hardware, where even a minute discharge, if it doesn't actually damage something, can cause data errors and even memory loss. And the smaller and faster our hardware becomes, the more vulnerable it is to static problems, either airborne or arriving as power line anomalies when our UPS is in bypass.
What we want to accomplish with an electro-static ground is not all that different from lightning protection; we want to draw those electrons away from anything important and get them to ground as quickly and as completely as we can. Recall that we said lightning, or any static discharge, is very high-frequency energy. We also said RFI, which is also high frequency, is best dealt with by grounding everything to everything. Recall also, probably from high school science, that electricity always seeks the path of least resistance. These three concepts should help us understand the requirements of the Joint TIA/EIA/ANSI Standard J-STD-607-A Commercial Building Grounding (Earthing) and Bonding Requirements for Telecommunications (ANSI/J-STD--607-A-2002)" and the concept of "equal potential grounds" that we try to achieve in a data center telecommunications environment.
If everything is well bonded to a robust and virtually omnipotent grounding system, that's the path any static discharges are going to take if the system leads back to the main building ground through a very low impedance path. This includes nearly all the stuff that might get onto your grounds from outside sources. I say "nearly all" because a sufficiently powerful lighting strike is going to go where it darn well pleases, perhaps even taking a hunk off the building in the process. As we well know, nature is more powerful than our abilities to fend her off, and once in a while she outdoes us. This is why we need good lightning protection on our building, as well as a top quality surge protector on our power system. We're now getting beyond the scope of this article, but some good information can be found here.
There are two main things we're trying to accomplish: provide a very low impedance path to ground from everything metallic in our data center; and avoid creating "circulating ground currents" in the process. Let's take these one at a time. They're really not that difficult.
Impedance is the electrical term we give to resistance when we're not dealing with direct current (DC). I'll use the proper term "impedance" in this article, but if you're more comfortable thinking of "resistance," that's fine. A low-impedance path is created in three ways: large copper conductors; short wire lengths; and clean, solid connections. The principles are simple. Longer paths require larger conductors, and good connections require the proper hardware, strongly pressure-connected to surfaces that have been well cleaned beforehand. There are many products for doing this. One of the best sources of both information and products on this subject can be found Panduit.com. There are also some excellent seminars and courses you can attend. Lyncole and MikeHolt.com.
There are two characteristics specific to the particular type of electrical energy we are dealing with, and these both go back to one concept we mentioned earlier in this article -- namely, static discharge is, by nature, a high frequency phenomenon. The two characteristics are: static energy tends to travel on the surface of the wire, rather than through it ("skin effect"); and it does not like to turn sharp corners. This is why we use stranded copper wire for most grounding and bonding connections, and why we should never make sharp bends in ground wires. They should always curve smoothly with a minimum bend radius of 8 inches. Stranded conductors provide more surface area than solid conductors for the same gauge of wire, and curves keep the energy in the wire, rather than letting it bleed off into the air or to some other metal from the corner of a sharp bend. Unfortunately, the reason for radiused bends is very difficult for most electricians to grasp, and it takes virtually constant supervision to achieve a proper installation.
Circulating ground currents create their own electrical noise, so are to be avoided. In principle, they're easy to stop. Just keep everything at the same electrical potential or voltage. Current will only flow between two points that have a difference of potential. (Recall how static discharge occurs.) If we ground everything together with heavy wires, then everything should be at "equal potential" and no current will flow. Not surprisingly, this is called an "equal potential ground" and is exactly what J-STD-607-A is trying to achieve. The difficulty is doing it in a practical way. It's unrealistic to weld everything in the building or even in just the data center, together with heavy copper bars. We need to use practical wire sizes and attach them the right way, and at the best places, to everything in the room and then run those wires the shortest reasonable distances to solid ground bars. We also need to get all of our grounding bars connected together with heavy gauge wires so they are at essentially the same potential and then get them run to the primary building ground -- the same point to which the building electrical service is connected -- so that everything stays at the same electrical level. This is where the "art" of grounding design comes in.
It should by now be obvious why "isolated grounds" have no place in the data center. The minute a metal chassis is screwed into a metal cabinet, another ground path is established -- and not a very good one either. Each piece of equipment does the same thing, until there are multiple ground paths, none of them very low-impedance, all running through small-gauge wires and ending up at the building ground via different paths of all different lengths. The result is a poor static ground and loads of circulating currents due to the many different electrical levels that result. It's a waste of money on something that will be counter-productive in the end.
We must also talk about the business of connecting to building ground. This is a safety issue, absolutely required by code. A good telecommunications ground can be built as a "separate system" all the way to the electrical vault, although it should really be bonded to building steel and local electrical panels at various places along the way. It can even have its own set of ground rods if that becomes necessary to approach the lower 5-Ohm ground resistance recommended for telecommunications services. But these ground rods had better be bonded to the main electrical ground for the building. If you have a vendor who tells you they require a "separate ground" connected only to its own ground rods, tell them to consult a qualified engineer or code authority. God forbid there should ever be something called a "ground fault" in your incoming, high-voltage, building electrical service. The soil resistance between the separated grounds will result in a huge voltage difference if a "fault" occurs, and the resulting current will instantly boil the earth. The force of the explosion could put the basement slab on the second floor, and the resulting power surge on your "separate ground" could fry everything, and everybody, that's in contact with a grounded device. In short, this is not a wise approach.
There's one more factor we will mention, but not try to explain because it's really the province of the electrical engineer to determine. This is the code requirement for a "neutral bond" on the secondary ("load") side of a transformer. The code defines a transformer, such as is often found in a large PDU and a full-time UPS, as a "separately derived source." This means that a neutral-to-ground bond is required. How this is connected to the telecommunications static ground is sometimes a little tricky and can require some analysis as well as a thorough understanding of equal potential grounds in general and the UPS and PDU designs in particular. We have often found ourselves advising the electrical engineer on this issue at the same time we provide advice regarding the telecom ground.


We should not close this discussion without at least mentioning the "ultimate" in telecommunications grounding practice -- the "PANI" ground. This approach actually divides the ground bar into four sectors identified as "producers," "surge arrestors," "non-isolated" and "isolated" ground networks (PANI). This is an even more exacting method of ensuring that ground currents flow within the ground bar in a way that further avoids ground current interaction. PANI grounds are used in major telecommunications carrier installations and are often required by the military. The photographs show a superb PANI ground installation. If you look closely, however, you may notice a couple of connections made after the fact by unknowledgeable electricians who must have thought that the care taken in the original installation was by someone far too anal-retentive. The electrical trades just don't understand telcom grounding.
In short, good data center grounding requires understanding, careful planning (as does any technical design), proper execution and good supervision. It is not inexpensive, but it could easily make the difference between reliably functioning equipment and never-ending data errors and failures. Take your choice.
Posted by Robert McFarlane
Concurrent maintainability -- your best insurance policy
28 OCT 2005 08:37 EDT (12:37, GMT)
Someone walks into your data center. On the way through, they open the door of a big PDU and throw the main breaker, or go up to an air conditioner and turn it off. Calmly, you stroll over to the affected unit, restore it and go about your daily business. Is this how you would react if something like this occurred in your data center?
If not, you're unprepared for the kinds of things that can happen any day, at any time. Worse yet, you're unable to properly maintain your facility so as to minimize the chance of unplanned failures. It might be that something simply died. It might be that someone was meddling where they shouldn't. If you take testing and preparedness seriously, it might even be a consultant or compliance officer carrying out a random verification of sustainability. (Yes, that is really done some places.) The cause is immaterial. It's the effect that counts.
All equipment fails. It's not a matter of "if," but "when." And all of your infrastructure is doomed to early -- and even to recurrent -- failure if it can't be easily and regularly maintained. Service contracts are an important and very necessary first step. When properly written, they guarantee not only timely repair when something goes AWOL, but a regular inspection and preventive maintenance schedule as well. Equipment needs to be accessible for this to occur -- proper service clearances, not blocked by stored hardware, legal distances in front of energized parts, etc. But if there's anything that can't be shut down for hours or perhaps even days without jeopardizing your operation, there's no way to properly maintain it, with or without a service contract, and you're walking a very thin tightrope without a net.
The principle here is something we call "concurrent maintainability." Very simply, it means that anything in your data center can be shut down and kept down for a period of time, without directly affecting ongoing processing. Obviously, during this period you will lose all or part of your redundancy in some part of your installation, so maintenance shutdowns bring with them a certain level of exposure. But if everything has been well maintained, the statistical chance of a simultaneous failure practically drops off the charts. Further, to actually bring down a facility that is designed and installed for concurrent maintainability most likely takes not two, but at least three sequential events. This is the same scenario pilots are taught about that causes unintentional contact with the ground. One or two things won't do it. It always takes at least three failures and/or mistakes.
In the case of a data center a catastrophe might require:
- the initial maintenance shutdown,
- a second UPS failure, major power failure or the like and
- failure of the bypass, generators or whatever third level of protection is built into the system being maintained.
And, in most cases, a human error will contribute to the sequence. (See my first blog article at the bottom of this page.)
Not every data center justifies total "Tier 4" redundancy, but there are very few businesses today that can survive a very long outage in their processing. If the thought of a single item shutdown gives you nightmares, perhaps you should be showing this article to your management and suggesting that some investment in a little more robustness might be a worthwhile business decision.
Posted by Robert McFarlane
UPS -- it's NOT uninterruptible
27 OCT 2005 07:36 EDT (11:36, GMT)
One of the most deceptive designations of all time is "uninterruptible power supply." The false sense of security this name implies has trapped many uninitiated. Why? Just two reasons: poor design configuration by the engineer or sales rep and lack of understanding on your part.
When everything works right, the UPS really does live up to its name. But when it doesn't, it can be a solid barrier between your equipment and perfectly functioning building or generator power. Try explaining that one to management! The lights are on, but your data center is down because that expensive UPS you wanted is out of commission. Have your resume ready!
In simplest terms, a UPS converts incoming AC power to DC through a rectifier, then back to AC through an inverter. The DC power also keeps the batteries charged. If incoming power fails, the batteries start discharging into the inverter to keep power flowing.
There are two basic types of UPS. (We will not try to deal with flywheels or other more esoteric devices in this limited space.)
- "Full-time" UPS, in which the equipment continuously powered from the re-created alternating current from the UPS. This is also known as a "double conversion" UPS.
- "Line-interactive" UPS, in which the equipment is actually running from normal building power, with some filtering, until power fails, whereupon the load is quickly switched to the actual UPS and the batteries start to drain. The prices of these units are kept down by making the rectifier large enough to charge the batteries, but not big enough for the full equipment load.
With full-time UPS, the equipment never sees the outage -- not even a ripple. Line-interactive UPS takes two or three power cycles (about 1/30 second) to switch to battery support -- a time short enough for equipment power supplies to maintain the computers. They "see" the interruption, but it rarely affects them.
Most of what follows assumes full-time UPS designs, since that is what is generally used in full-blown data centers. We must also caution about the use of line-interactive UPS with generators. Private generator power is not as well-stabilized as commercial power feeds. It's a lot better than darkness, but it can fool line-interactive UPS units into thinking power is being randomly restored and interrupted, causing the UPS to keep switching back and forth. As will be seen from what follows, this can ruin batteries, as well as the UPS, and the multiple switchovers can also be more than your hardware can tolerate.
So let's look at what can go wrong, and actually make the UPS "interruptible."
Batteries: Most UPS's today use "sealed-cell" batteries, properly known as VRLA (valve-regulated lead acid). These can be used in a normal, occupied environment because they don't emit explosive hydrogen gas like flooded lead acid "wet cells" do. Any battery can fail, but VRLA cells have a much shorter service life: a 5-10-year warranty as opposed to a 20-25-year one for wet cells. But that's if they're not used. If you're in an area that experiences multiple, short duration power losses, VRLA batteries have been known to fail in as little as a year. And, of course, failures occur most often when they're put under load; in other words, when there's a power failure and they're most needed. Since battery cells are connected in series, like those little Christmas lights, if one cell fails to "open" (the usual condition), battery power stops and your UPS is dead -- immediately! Remedy? Dual or multiple battery strings and either automatic or regular battery testing.
Bypass: Virtually all UPSs have internal maintenance bypass. This allows a technician to work on the insides safely. It's also supposed to click in automatically when the batteries run out, go bad or some other UPS failure occurs. But most UPSs have components -- usually input or output transformers -- that are outside the "bypass" chain. These things don't fail often, but when they do, you're dead in the water. Power is coming into your building, but it can't get past your UPS. Very embarrassing. Hard to explain. In one case, we saw a transformer literally go up in flames and fry not only itself, but the UPS innards as well. We've also seen instances where the internal bypass failed and there was no way to manually operate it.
There are only three ways around these situations:
- Run and hide. (Not a good career choice.)
- Get an electrician to wire around the UPS. (Time-consuming.)
- Install "full wrap-around bypass." (Initially higher cost, but safer.)
The latter is always our choice, but we often have to fight for it against "statistical failure" data and "value engineering" pressures. If you ever experience one of these failures first-hand, statistics become meaningless and the "savings value" plummets to zero. We would never advise a client to install a UPS without full wrap-around bypass.
Redundancy: This is a large topic and more complex than we can cover thoroughly in this forum. Suffice it to say that there are many approaches to UPS redundancy, all with differing levels of protection. Maximum reliability is achieved with a fully redundant "2N" design, with each UPS running at less than 50% load and static transfer switches to shift load within a few power cycles of a module failure. This is obviously also the most expensive and is not justified for everyone. Every step below this carries an increased risk -- sometimes very small and sometimes significant -- and the specifics of equipment selection and connection can make major differences in even the "ultimate design" performance. For example, with any redundant design, one of the most important things to verify is how the UPS responds to an instantaneous doubling of power draw ("step function"), since that is exactly what will happen if a module fails. With primary-side static transfer switches, it's important to look at how current rise is controlled, since the sudden current change created by switching can cause something called "saturation" in downstream transformers, resulting in unacceptable waveform distortion. There are many dozens more things to consider in arriving at the most realistic, cost-justifiable UPS for your needs.
Air conditioning and battery duration: Let's say it bluntly. There's no value in having four hours of battery if your hardware (including your UPS) is going to go into thermal shutdown in 10 minutes due to lack of air conditioning. Unless you have a backup generator, and your total air conditioning plant is properly connected to it and has been thoroughly tested in a real "pull the plug" commissioning process, most everything you have is going to be down in less than 30 minutes anyway. Big blade centers may make it only a minute or two without air, and some of the newest hardware can be down in seconds. If you have a generator, 15-30 minutes of battery should be more than enough. And if it doesn't start, it's still probably enough since you won't have air conditioning without it. The only exception may be IDF rooms with small stackable network switches for VoIP phones. If the heat rise in the room is slow and you can keep things cool by opening the door, and if you can keep the central phone and network switches running by shutting everything else down, then as much as four hours of UPS might be considered for those devices alone in order to keep the phones working as long as possible.
In short, UPS is expensive. Question everything. Examine each potential failure scenario, and evaluate the cost of remedy against the potential cost to your business. Ask each vendor what to ask their competitors and insist on thorough explanations, from both the sales reps and your engineer. If they seem unsure, or if it sounds like doubletalk or obfuscation, dig deeper. You don't need to be an engineer to understand the operational tradeoffs. There's too much money and business risk involved to take anyone's word at face value.
Posted by Robert McFarlane
Let's add an air conditioner
26 OCT 2005 07:26 EDT (11:26, GMT)
It happens all the time. Chances are, it's happened to you. Equipment is added. It starts getting hot. You call Facilities. They bring in another big air conditioner. They put it wherever it will fit. It's disruptive. It costs a lot of money. And the hot equipment is still hot!
In yesterday's blog we gave the first steps to getting the most out of what you already have, perhaps avoiding or at least postponing, the need to bring in more cooling. But if that still doesn't do it, you're going to need another remedy.
In the old mainframe days, we could just pressurize the floor and the air would come up through the openings right under the equipment. There weren't very many holes, so the air pretty much had to go through them. And the hardware was designed so an opening in the right place would push cold air right into the box, exactly where the manufacturer wanted it. Those days are gone. It's much more complicated now. We're trying to push a lot of air through a lot of perforated tiles, and it just doesn't work as nicely as we think it should. Why?
The air coming out of a computer room air conditioner (CRAC unit) moves fast. Basic physics (Bernoulli's law) tells us that the faster it moves, the lower the pressure -- the very pressure needed to push it through the tiles. Most people are surprised to learn that, even 8-12 feet in front of the CRAC, the pressure may actually be negative, pulling air back down through the tiles rather than pushing it up. The Computerized Fluid Dynamics (CFD) illustration below shows that happening (image is from Innovative Research's
"TileFlow" CFD program). Look for the red arrows pointing downward close to the CRAC that hasn't been turned off (opposite the one with the red "X"). You definitely do not want your highest heat hardware closest to your CRAC units. That's the place for patch racks and other relatively benign equipment.

There is a company called Technology Connection that advertises that they have developed analytical methods and products that enable them to reduce that velocity, re-channel the air and equalize the pressure. We have discussed this with them, and their reasoning seems scientifically sound, but we have yet to make a first-hand evaluation of their results. We can only mention them without comments, pro or con.
Second, we now know that air discharges from CRAC's in what are known as "plumes." These are like horizontal columns of air that spread out a little as they get farther away, but not really a whole lot -- especially if there is another CRAC blowing the same direction nearby. In that case the two air plumes create a barrier where they come together, keeping the air from one CRAC from mixing with or supplementing the air from the other. The CFD model illustrates this effect. (It also illustrates the pressure increase in the middle when two opposing air streams collide.) So adding a CRAC without carefully considering where it is going and what else is around may actually reduce the air flow to the places you need it most.

Worse yet is putting one CRAC at a 90-degree angle to another. This often occurs at the corner of a room, especially when there isn't room to put a unit anywhere else. You can easily demonstrate this effect with two garden hoses. Just shoot one stream at the other and see what happens where they meet. Air is a "fluid," so it obeys the laws of fluid dynamics. If one CRAC blows the air from another CRAC away from where you need it, it's counterproductive. If there's no other choice, at least space them as far from each other as possible.
So what can you do? First is to make sure you really need more cooling capacity (see my previous blog entry). Look for openings that waste air. Look for blockages under the floor. Measure your total load (your UPS should tell you) and compare it with your air conditioning capacity. As a quick estimate, figure that 80% of CRAC tonnage is available for equipment cooling. (The rest goes for humidification, normal losses and inefficiencies.) 1 kiloWatt (kW) = 0.283 tons.
EXAMPLE: For a 100 kW load, you need 28.3 tons of cooling. Multiply this by 1.25 (the inverse of the 80% derating) to get 35.4 tons. If you have four 15-ton air conditioners -- and assume 25% redundancy -- you should have 3 x 15 or 45 tons of constant cooling capacity, which is about 27% more than you're supposed to need. So if things are getting too hot, you should be looking at less expensive solutions before you just roll in another CRAC. (This assumes, of course, that you don't have miles of windows adding lots of solar load or some hidden problem with outside air infiltration.) In short, your problem shouldn't be capacity. You're just not getting the air where it's needed. It may take expert analysis to solve this, or to determine whether there's a solution at all based on how your room is laid out, but there are several things you can look for first.
- Are your high heat loads located where you have the best cooling and the most spare capacity? It's easier to move the cabinets than to install a CRAC.
- Are the dampers wide open on all the perforated tiles, including those where there are low heat loads? (This is bad.) Are there any perforated tiles in "hot" aisles? (This is worse.) Try closing down tiles where less air is needed and closing them completely where no air is needed. Get them out of hot aisles!! More air will be available where you really need it.
- Are CRAC units pulling cold air back into the returns? If you have enough ceiling height, try putting extension ducts on top of them so they draw mainly hot air from closer to the ceiling. You can even extend duct work all the way down hot aisles to ensure that only hot air gets back to the units. Again, this is a lot cheaper than more air conditioners, and in many cases will end up working better.
- If your cabinet manufacturer makes fan-powered doors or in-cabinet blowers that add onto your cabinets, try them. See if they help. They're not all equal. One type may do a better job than another. You can't always believe the marketing claims, but it's worth a try. Just one caution; blowers that suck the air out of the floor may cool the cabinets they're in very well, but may also steal too much air from other cabinets in the process. Don't solve one problem by creating another.
If none of this works, or can't be done, and you really need additional cooing, consider some of the newer, localized or "spot cooling" solutions. There are several approaches. We can't go into the pros and cons of each in this space, but here is a quick summary. The order has nothing to do with how good they may be, either in general or for your particular needs.
- Water-cooled cabinets may solve individual problems. If you need more than a few, however, you're probably better off with another solution. (See Knü rr, Sanmina-SCI's Ecobay, RIMatrix5, APC and Liebert XDF).
- "Heat containment rooms" are areas within your data center. They consist of two rows of cabinets, back-to-back, totally closed-in with a ceiling, end walls and their own power and air conditioning. They completely isolate the hot air from the cold, making them very efficient, and they definitely work. They consist of two rows, back-to-back, and can be made pretty much as long as you want, but it's best to decide on the size you will grow to when you put one in. They can be expanded, but it can be problematic. CAUTION: The center "hot aisle" of this design is definitely hot! Be prepared for objections from your technicians when they need to work inside. (See APC's Hot Aisle Containment.)
- Flexible "spot cooling" enables you to add cold air to the cold aisle at just the cabinets that need it and to easily add or move cooling if your hardware locations change. It requires fundamental infrastructure (piping and power), but it's no more difficult or disruptive to install than adding one CRAC. Once in place you have modular growth as well as significant flexibility for a number of cabinets from a single installation. Note, however, that you don't put this in for one or two cabinets. It's designed for a significant load, like from four to 20 cabinets. (See Liebert's XD line.)
Whatever solution you may choose, it's best to locate these specialized systems in a designated area of the data center. It's more cost effective, easier and less disruptive to install and manage if it's not scattered all over. And if you still think another CRAC is your best answer (and in some cases it may be), be sure a thorough, knowledgeable analysis has been done -- including CFD modeling -- before you spend that money.
Posted by Bob Konigsberg
Block those holes!
25 OCT 2005 00:16 EDT (04:16, GMT)
Where does all that air go? One thing's for sure -- in most data centers much of it never makes it to the equipment it's supposed to cool. Lots of cold air leaks out of a multitude of openings in the floor tiles, doing virtually nothing. And a lot more disappears right in front of the cabinets after it gets out of the floor. Air conditioning is expensive, and that's a lot of wasted energy and a pile of wasted money, to say nothing of the shorter life you get from equipment that overheats.
It wasn't so critical a few years ago. Energy was cheaper and heat loads weren't as high. But with fuel costs going through the roof and heaters being shipped to data centers disguised as computers, we now have to make things a lot more efficient. The fundamentals are actually easier than you might think. In fact, basic remedies are downright simple, and pretty darn cheap compared with installing more refrigeration.
In most data centers, 25% or more of the cold air is probably being lost. There are two major places to look: your raised floor and your equipment cabinets. Let's start with the raised floor.
The biggest holes are usually the ones the cable comes through (although we've seen entire floor tiles removed, which is just complete foolishness). It used to be standard practice to just cut a 6- or 8-inch square hole, or even larger, no matter how many or how few wires needed to go through it. At one time, when mainframes used those huge "buss and tag" cables, large openings were needed to pass the oversized connectors. And since those holes were usually under equipment that was cooled from below anyway, it really didn't matter. Not so today. RJ-45's, and even the largest power plugs, will go through a much smaller hole. But an amazing amount of air will still leak through that opening, around the spaces that aren't filled with wires. Those holes have got to be sealed. There are two ways: Make some kind of seal yourself -- out of Masonite and duct tape or some such contrivance -- or use a commercial product made for the job that makes it easy to add or remove cables in the future. Two such products are the KoldLok Brush Grommet, and the Sub-Zero Pillow. Take your choice. The Pillow will seal most holes more completely, is less expensive, easier to install and adapts to a wide variety of opening sizes. The Brush Grommet comes in only a few sizes, stops most of the air but not all and can be a little pricey, but it's a lot neater, and no one can remove it and forget to put it back.
Next, look for all those places where pipes, conduits or anything else penetrates the floor. Unlike cables, which are subject to change, these things aren't going anywhere. Seal them with Fire Stop Putty or any good caulking that won't dry up and shrink. If they're too big, the fire stop manufacturers make products to go behind the putty (CableOrganizer.com, NelsonFireStop.com and a host of others). Just don't use fiberglass, mineral wool or any other product that can flake off and get into the air going to your equipment.
Now look all around the room where tiles have been cut to the walls or air conditioners or anything else. A good quality, closed-cell weather stripping will usually seal all these openings. Lastly, look for tiles that don't seat tightly. Some air will leak through the seams between the floor tiles. That's inevitable unless the installation has been made with special products and techniques that fully seal these joints, which is highly unlikely in a data center. But the amount of leakage in a normal, well-installed floor is tolerable IF you have sealed all the other holes. If the floor is older, it may be necessary to have a raised floor contractor come in to re-level the tiles and get them as well aligned and seated as possible. After equipment is in place, however, there can only be a certain amount of improvement. Tiles trapped under equipment racks can't be moved or re-aligned, so they will determine how well adjacent tiles can be aligned. But every little bit helps.
Now let's get to the easiest, most overlooked and usually most effective way to improve cooling in the whole data center: unused panel spaces in cabinets. We must assume that your layout conforms with the accepted "hot aisle/cold aisle" approach, with cabinets oriented "front-to-front" and "back-to-back." If not, there aren't many things you can do to help except to re-orient your cabinets and change your whole layout, which is obviously not easy. But if your installation is "hot/cold aisle," you just MUST close those unused panel spaces. If you don't, the air you manage to push through your perforated tiles gets up to the first unused panel space and just flows right through the cabinet to the back. It's called "bypass air," and it does two really bad things. First, it starves all the equipment above the opening of cold air. There's always a temperature gradient from bottom-to-top that makes the upper equipment run hotter than that closer to the base of the cabinet, but if most of the cold air has escaped through the cabinet before it even gets to the top, that upper hardware is going to run much hotter and will have a much shorter life. Second, the cold air bypassing through the cabinet mixes with the hot air that must return to the air conditioners, cooling it down. That's the air that tells CRAC's how much new cold air to put out. If the return air is already cooled down somewhat, it fools the air conditioners into thinking everything is fine, so they stop working so hard. The result? Less cooling to the hardware, higher temperatures, shorter life and some strange cycling of the air conditioners than can also upset the humidity control.
And there's another factor. (Who said this was easy?) Not only can cold air bypass from front to back, but hot air can bypass from back to front. Since warm air rises naturally, this just worsens a bad situation by delivering even warmer air to the upper computers. In short, you're engaging in "computer euthanasia" simply by leaving these openings. Is it any wonder that the servers toward the tops of the cabinets statistically have a higher failure and error rate than those at the bottom? Load cabinets from bottom to top, and then close all the remaining spaces with blank panels. If you make a lot of changes, or you can't get people to pick up a screwdriver to replace the panels, several manufacturers now make "snap-in" panels. IBM and SMC make them, too, if you can ever locate them on their Web sites. There are probably others, and we know of several cabinet manufacturers who are planning to come out with them. Snap-ins are a little more expensive, but there's simply no excuse for not putting them back in when a change is made.
Posted by Robert McFarlane
Where do failures originate?
24 OCT 2005 07:00 EDT (11:00, GMT)
The famous Pogo line applies: "We have seen the enemy, and it is us!"
Sad, but true, the vast majority of failures in data centers are caused, triggered or exacerbated by human error. Like it or not, the hardware and software are a lot more reliable than we are.
We'll never completely eliminate our human failings, but we can sure do some things to make ourselves less vulnerable. Here are a few. You can undoubtedly add to the list yourself, especially after giving some deep thought to the next instance or past occurrence or even to those conditions you've noticed, but haven't acted on yet.
Logic! Nothing traps us into inadvertent mistakes like things that aren't what they seem. Here are some cases of what should be, but too often isn't:
- Circuit breaker order in the panels clearly related to cabinet rows.
- A and B circuits in each cabinet on the same breaker number in each pair of PDUs (assuming proper design for dual-corded hardware).
- Receptacles exactly "mirrored" inside cabinets.
- Dual-corded circuits run from "paired" PDUs. (All devices on PDU A are also on PDU B. NOT Device 1 from PDUs A and B, and Device 2 from PDUs B and C, etc. This is nearly impossible to keep track of and makes power balancing a nightmare with a built-in overload trap if a PDU fails.)
It may not be possible to achieve all this in an existing data center, but anytime you need rewiring or must to add circuits, it's an opportunity to take a step forward instead of exacerbating the problem.
Labels! Even if you can't get everything physically organized, you can still label it clearly. The best job I've ever seen used color-coded labels for every PDU, with character sizes you could read across the room and corresponding color labels on every circuit. Likewise, clear, meaningful and organized cabinet and patch panel labels make everyone's life easier. Avoid relying on the "one guy" who knows where everything is.
Another great approach is labeling each tile row at the upper wall or ceiling so any location can be identified by an alpha-numeric grid identification that can be seen from anywhere in the room. It makes it faster to find things, above or below the floor, even for the inexperienced.
And don't forget about machine labels! "Cutsie" idents can be fun for the programmers, but they don't tell people much about what each machine does, what goes with what or who's responsible for it. If the apps guys insist on clever server names, add another to the tag that's more descriptive. And absolutely include the name and number/e-mail of whomever is knowledgeable about and responsible for each device.
Visibility! Something else to consider is clear plexiglas panels at key locations in your raised floor. Put them over CRAC unit thermometers and valves, ground bars and anyplace else something could go wrong, but go unnoticed. Just don't put excessive weight on them. They're not as strong as regular panels.
In short, make your data center an easy place for everyone to work in. If you're doing your job well, there's no need to withhold key information as a form of job protection, and you certainly shouldn't let anyone else do it either. If you're not there and a problem occurs, you'll be found out real fast, with a different end result than the one you were planning on.
Posted by Robert McFarlane
|
|
 |
 |
 |
 |
 |
 |
MOST RECENT BLOG TOPIC ENTRIES
| |