A single point of failure (SPOF) describes a system vul­ner­ab­il­ity in the form of a single component. If the component fails, the entire system fails. What are the different types of SPOF and how can you minimise the risk of SPOFs happening?

What is a single point of failure?

A single point of failure (SPOF) describes a type of vul­ner­ab­il­ity within a system. A SPOF exists when the mal­func­tion of a single component causes the failure of the entire system. Several 'failure modes' exist. These can be broadly dis­tin­guished into three cat­egor­ies:

  1. Achilles’ heel or 'weakest link in the chain': The loss of one component leads to a sudden loss of function of the entire system.
  2. Chain reaction or 'domino effect': The failure of one component causes the suc­cess­ive failure of other com­pon­ents leading to the failure of the entire system.
  3. Bot­tle­neck: A component acts as a limiting factor of the overall system. If the limiting component is impaired, the per­form­ance of the system is reduced to the capacity of the component.
Note

A single point of failure doesn’t ne­ces­sar­ily describe a technical component. One of the most frequent cases is human error.

Where do single points of failure occur most often?

SPOFs are common in complex systems with in­ter­con­nec­ted com­pon­ents in multiple layers. Depending on the structure of the system, the failure of one critical component causes the failure of the whole system. The single point of failure takes the form of a critical component.

The com­plex­ity of a multi-layered system can make it difficult to detect SPOFs. There’s no easy way to identify the in­ter­ac­tions of in­di­vidu­al com­pon­ents. Faults or issues are hard to spot. Prin­cip­ally, every non-redundant component critical for operation should be treated as a single point of failure.

Take the human body, for example. We’ve only got one heart or brain – the critical organs are not designed re­dund­antly. If one of these organs fails, the entire body fails. Heart and brain are SPOFs. By contrast, eyes, ears, lungs, and kidneys are du­plic­ated. If necessary, the body com­pensates for the failure of one and continues operating at reduced ef­fi­ciency.

In a data centre, all com­pon­ents critical to operation are potential SPOFs. Therefore, servers are usually equipped with redundant con­nec­tions to the power grid and network. Mass storage is provided re­dund­antly via RAIDs or similar tech­no­lo­gies. The aim is to ensure the system continues to operate should a single, critical component fail.

Tip

Not sure what a server is? Check out our article that explains what a server is.

What are some classic SPOF examples?

There are many different types of single points of failures (SPOFs). After all, SPOFs don’t just affect in­form­a­tion systems. Let’s take a look at some examples.

Death Star destroyed by single point of failure

In the popular 'Star Wars' movies, a single point of failure leads to the de­struc­tion of the dreaded 'Death Star'. A single proton torpedo fired by the prot­ag­on­ist hits a critical spot on the reactor. The explosion causes a cata­stroph­ic chain reaction that destroys the entire Death Star.

Suez Canal paralysed by single point of failure

In 2021, container ship 'Ever Given' got stuck in the Suez Canal. The ship ran aground at a critical section of the canal acting as a single waterway. The blockage paralysed shipping traffic along the entire canal. The single point of failure was the non-redundant waterway.

Boeing 737 MAX crashed by SPOF

In 2018 and 2019 there were two crashes of the 'Boeing 737 MAX' aircraft causing the loss of hundreds of lives. The cause of the crashes was a single sensor feeding erroneous data. Based on the sensor data, the automatic flight control system didn’t perform correctly and ul­ti­mately brought down the planes. Several errors came together, but the single point of failure was the sensor.

High-avail­ab­il­ity systems taken offline by SPOF

Even systems designed for high avail­ab­il­ity aren’t fully protected from SPOFs. In recent years, major cloud services have re­peatedly ex­per­i­enced serious failures. In most cases, the single point of failure was human. The wrong con­fig­ur­a­tion data can quickly paralyse an entire pro­duc­tion system, even if its com­pon­ents are designed re­dund­antly.

DNS as single point of failure in computing systems

Your device is connected to Wi-Fi, but the web browser isn’t working. Then the clock starts auto­mat­ic­ally adjusting the time. Sound familiar? It’s enough to make you tear your hair out, but the answer is simple:

Quote

'It’s always DNS.' – Source: https://taleso­fat­ech.com/2017/03/rule-1-its-always-dns/

The catch­phrase 'It’s always DNS' sounds fun but is a serious de­scrip­tion of the error potential of Domain Name Systems (DNS). After all, when DNS servers don’t answer, websites and services can fail in a variety of ways. The effect is similar to having your con­nec­tion to the Internet cut. However, packet traffic between IP addresses still works in this case.

DNS errors usually occur on the user side if the DNS servers stored in the system are not ac­cess­ible. It’s therefore best practice to store two name server addresses. If the first server is un­avail­able, the second is used. This creates re­dund­ancy and resolves the single point of failure.

Often, both DNS servers belong to the same or­gan­isa­tion. If one of them fails, there’s a high prob­ab­il­ity that the other is also affected. To be safe you can store the addresses of two nameserv­ers from different or­gan­iz­a­tions. A popular com­bin­a­tion is 1.1.1.1 and 9.9.9.9 from Cloud­flare and Quad9 as primary and secondary DNS servers.

Java logging library as single point of failure

By the end of 2021, a large number of Java-based web services were affected by the Log4Shell security gap. The single point of failure was a Java logging library called Log4J. In the worst case, a system attack led to the takeover of an entire vul­ner­able system.

How to avoid SPOFs?

Generally, pre­ven­tion is the best strategy to avoid SPOFs. By defin­i­tion, a single point of failure leads to the loss of function of the entire system. Once that happens, it’s often too late. Limiting the damage may be your only option now.

That’s why pre­vent­ive measures and planning for emer­gen­cies are a better strategy. You can act out credible failure scenarios and analyse risks and possible pro­tect­ive measures. Different types of single points of failure can be prevented by certain features in a system:

System feature Protects against De­scrip­tion Example
Re­dund­ancy Achilles’ heel, bot­tle­neck System can continue to operate without per­form­ance de­grad­a­tion in the event of failure Multiple DNS servers stored in network device
Diversity Chain reaction Lowers risk of redundant com­pon­ents being affected by failure Linux computers not vul­ner­able to Windows Trojans
Dis­tri­bu­tion Chain reaction, Achilles’ heel, bot­tle­neck Lowers risk of redundant com­pon­ents being affected by failure Head of state doesn’t travel on the same plane as his vice
Isolation Chain reaction Disrupts domino effect Fuse protects power grid from overload
Puffer Bot­tle­neck Absorbs load peaks occurring before bot­tle­necks Queue in front of check-in counter at airport
Graceful De­grad­a­tion Achilles’ heel, chain reaction Allows for continued operation of the system without cata­stroph­ic result in case in­di­vidu­al com­pon­ents fail When losing one eye, vision is not entirely lost but depth per­cep­tion is disrupted

Well-prepared, critical systems are subjected to con­tinu­ous mon­it­or­ing to detect errors as early as possible and correct them if necessary.

Minimise single points of failure through re­dund­ancy

One re­com­mend­a­tion to coun­ter­act SPOFs is to build re­dund­an­cies. Several instances of a critical component (e.g., power supply, network con­nec­tion, DNS server) are operated in parallel. If one fails, the system continues to operate without loss of per­form­ance.

Re­dund­ancy also prevents many SPOFs on the software-side. One example is the popular mi­croservice compared to the software monolith. A system of mi­croservices is decoupled and less complex, making it more robust against SPOFs. Since mi­croservices are launched as con­tain­ers making it easier to build re­dund­an­cies.

But how exactly does re­dund­ancy protect a system? Let’s use the es­tim­a­tion of re­li­ab­il­ity of a system known as 'Lusser’s law' to il­lus­trate. Here’s a thought example:

Assume a system has two in­de­pend­ent, parallel con­nec­tions to a power supply. Let us further assume that the prob­ab­il­ity of the con­nec­tion failing within a given period is 1 percent. Then the prob­ab­il­ity of complete failure of the power link can be cal­cu­lated as the product of the prob­ab­il­it­ies:

  1. Prob­ab­il­ity of failure of an instance:

1% = 1 / 100 = 1 / 10 ^ 2 = 0.01

  1. Prob­ab­il­ity of two instances failing in suc­ces­sion:

1% * 1% = (1 / 10 ^ 2) ^ 2 = 1 / 10 ^ 4 = 0.0001

As you can see, the prob­ab­il­ity of a SPOF isn’t halved when running two instances but reduced by two orders of magnitude. That’s a con­sid­er­able im­prove­ment. With three instances running in parallel, a failure of the entire system should be almost im­possible.

Un­for­tu­nately, re­dund­ancy is no panacea. Rather, re­dund­ancy protects a system from SPOFs within certain as­sump­tions. First, the prob­ab­il­ity of failure of an instance must be in­de­pend­ent of the prob­ab­il­ity of failure of the redundant instance(s). That’s not the case where a failure is caused by an external event. If a data centre is on fire, redundant com­pon­ents fail together.

In addition to re­dund­ancy of deployed com­pon­ents, dis­tri­bu­tion of certain com­pon­ents is critical to mitigate SPOFs. Geo­graph­ic dis­tri­bu­tion of data storage and computing in­fra­struc­ture protects from en­vir­on­ment­al disasters. Further, it pays to strive for some het­ero­gen­eity or diversity of critical system com­pon­ents. Diversity reduces the prob­ab­il­ity of redundant instances failing.

Let’s il­lus­trate the advantage of diversity using the example of cy­ber­se­cur­ity. Imagine a data centre with redundant load balancers of the exact same design. A security vul­ner­ab­il­ity in one of the load balancers also presents in the redundant instances. In the worst case, an attack will paralyse all instances. By using different models, the overall system stands a better chance of con­tinu­ing to operate at reduced per­form­ance.

Other strategies to minimise SPOF

Re­dund­ancy isn’t always suf­fi­cient to prevent SPOFs. And some com­pon­ents cannot be designed re­dund­antly. When creating re­dund­ancy isn’t an option, other strategies come into play.

The 'defence in depth' approach is well-known from cyber security. This involves drawing multiple, in­de­pend­ent rings of pro­tec­tion around a system. These must be breached one after another to bring about system failure. The like­li­hood of the entire system failing because of a single component is lower.

With respect to digital systems, special pro­gram­ming languages with a built-in fault tolerance exist. The best-known example is the 'Erlang' language developed at the end of the 1980s. Together with the as­so­ci­ated runtime en­vir­on­ment, the language is suitable for creating highly available, fault-tolerant systems.

The global triumph of the World Wide Web and the spread of web de­vel­op­ment presented a new challenge. Pro­gram­mers were forced to develop websites that would work on a variety of devices. The basic approach used in this process is known as 'graceful de­grad­a­tion'. If a browser or device doesn’t support a par­tic­u­lar tech­no­logy on a page, it’s rendered as good as possible. This is a 'fail-soft' approach:

System status De­scrip­tion
go System operates safely and correctly
fail-op­er­a­tion­al System operates fail-tolerant without per­form­ance de­grad­a­tion
fail-soft System operation ensured, but per­form­ance reduced
fail-safe No operation possible, system security still guar­an­teed
fail-unsafe Un­pre­dict­able system behaviour
fail-badly Pre­dict­ably poor to cata­stroph­ic system behaviour

How to find a SPOF in your IT?

Don’t wait until the system fails to identify single points of failure in your system. You’ll want to proceed pro­act­ively as part of a Risk Man­age­ment Strategy. Analyses from re­li­ab­il­ity en­gin­eer­ing such as fault tree or event tree analysis are used. Failure Mode and Effects Analysis (FMEA) are used to identify the most critical sources of failure.

The general approach to identi­fy­ing single points of failure in en­ter­prise IT is to perform a risk as­sess­ment of the three main di­men­sions:

  • Hardware
  • Software/services/provider
  • Personal

First, create a SPOF analysis checklist to show the general areas for further analysis. Then, a detailed SPOF analysis of the in­di­vidu­al areas is performed:

  • Un­mon­itored devices in the network
  • Non-redundant software or hardware systems
  • Staff and providers who cannot be replaced in an emergency
  • Any data not included in backups

For each system component, the negative effect of failure is analysed. Fur­ther­more, the ap­prox­im­ate prob­ab­il­ity of a cata­stroph­ic failure is estimated. The results are in­cor­por­ated into an over­arch­ing 'disaster recovery' plan to ensure data centre security.

As a basic measure to avoid SPOFs, re­dund­ancy of data storage and computing power should be ensured at three levels:

  • At the server level through redundant hardware com­pon­ents
  • At the system level through clus­ter­ing, i.e. the use of multiple servers
  • At data centre level by using geo­graph­ic­ally dis­trib­uted operating sites.
Go to Main Menu