To properly make sense of reliability in the IT industry, we should have a good understanding on resilience. In fact, resilience is considered as a key factor in infrastructure design. Manufacturers use the resilience factor to impress their customers and they often imply that resilience can protect customers. Unfortunately, this isn’t the case. Resilience isn’t about protecting services provided for consumers, it is actually a measure of how infrastructures could fail.
The Oxford English Dictionary defines resilience as the capacity to quickly recover from difficulties. If we have a group of servers that run at 90 percent load all the time, eventually one component will fail due to wear, accumulated dust and generated heat. We can assume that this will affect one hard drive that runs an important operating system partition of hypervisor. This is where the term resilience comes into play. Manufacturers need to provide a guide on the proper level of utilizations.
There should be a guide on the lifecycle and performance standard that we can expect from a component under specific conditions. As an example, we should have a warning on the maximum temperature that hard disk drive should tolerate over a specific period of time. This parameter should indicate the resilience of a component. We may say that resilience could have a limit and we will eventually lose the hard drive due to excessive use and accumulated heat.
However, there could be multiple hard drives that host a hypervisor or an operating system partition. If one fails, others would simply take the load. This arrangement improves resiliency, but these drives also subject to similar factors, such as wear and overheating. One logical outcome is that if the whole server goes down, there’s nothing that can provide the continuing service. If a hypervisor is lost completely, all the VMs will also be lost. Data loss may also ensue.
Resilience could be tied to specific characteristics maintained and created by manufacturers. Components have finite resiliency and nothing can last forever. Clearly, resilience isn’t a direct measurement to availability and it more closely related to reliability. In turn, availability should be considered subject to reliability. So if we want to obtain a system with high degree of availability, we should take into account the reliability of each chosen systems and components. Redundancy is considered the answer in mitigating the less than satisfactory redundancy.
Solution architects who spend time building and designing IT infrastructure should consider whether the existing system is resilient enough. They should check all network stack levels, including applications, hypervisors, operating systems, hardware and the overall data center. Beyond that, they should also determine whether resiliency could extend to the servers, storage and network. There could be a wide range of systems and technologies that need to be available for billions of users around the world. All heavy-traffic websites are protected by high resiliency standard to ensure that consumers will get what they need and want.