The IT world is known for its many legends, myths and acronyms. One of the most mentioned terms in the IT industry is “nines” and it is often used to determine standards in reliability. The nines are a way of informing consumers about availability of services. Carrier grade is a standard within the telco world to define performance and reliability. The nines can help consumers to know about probable up-time. On the other hand, we will also be able to determine the probability of failure and the reliability of a product. As a definition, reliability is about whether a system performs correctly and availability is about whether a system is ready for use.
IT equipments manufacturers provide an estimation of products reliability, in form of MTBF and MTTR. Mean Time Between Failure or MBTF is the average time a module will experience failure. Mean Time To Recovery is an estimation of the average time required to repair or recover a module. We should bear in mind that these estimations are provided by manufacturers and we shouldn’t see them as hard and fast standards. Manufacturers may have different quality assurances procedures to ensure that products meet specific standards.
The nines in the industry world are usually defined as the following:
- 1 nine – 90% = Downtime of 36.5 days per year
- 2 nines – 99% = Downtime of 3.65 days per year
- 3 nines – 99.9% = Downtime of 8.76 hours per year
- 4 nines – 99.99% = downtime of 52 minutes per year
- 5 nines- 99.999% = downtime of 5 minutes per year
- 6 nines – 99.9999% = downtime of 31 seconds per year
As we can see, nines are typically units of percentage that defines the availability of a module or service.
These days, telco networks and data centers are required to stay operations 24 x 7 x365. The market and customers expect to obtain this level of availability. Integrators and manufacturers need to make sure they can offer acceptable levels of availability and reliability to meet the demands. Manufacturers always strive to create modules and resources with the highest standard of availability. In doing this, manufacturers need to have boundaries, targets, measures and guides to work to, so these “nines” fit the objectives quite well.
In reality, the availability of cloud services doesn’t increase the availability requirements per se. It also doesn’t reinforce the pressure exerted towards manufacturers and service providers to ensure reliability. The SLA (Service Level Agreement) of Amazon EC2 specifies the 99.95% of availability over the trailing 365 days period. It means, Amazon could suffer about 1 hour of outage per year spread over a few days and it still complies with the standards mentioned in the SLA.
This may not be a problem for individual who rarely purchases products from Amazon, but because the company serves millions of consumers; this level of outage could affect many people. Web server and cloud clients should also consider whether providers have provided them with enough nines. The “2-nine” reliability standard could sound like a good thing, but it is a bad thing if our website goes down for a total of nearly 4 days each year.
