RapidIO Connections Newsletter - Summer 2005

Executive Director’s Perspective Transitions: RapidIO® technology forms the critical foundation for new and emerging applications in the midst of a world of change.
Design It Fault-Tolerant Systems and RapidIOŽ
Insights What are some of the benefits to having an open standard?
Member Connections Debugging/Analyzing Serial Rapid IO High-Speed Serial Lanes by Chris Shelsky, Project Manager, Nexus Technology, Inc.

RapidIO and Linux by Matt Porter of MontaVista Software, Inc.
RapidIO Product News New RapidIO-based solutions continue to debut in multiple market segments from silicon to board-level products.
The RapidIO Trade Association at Work Munich and Boston Developer Summits, Freescale Technology Forum, Supercomm
In the Spotlight The RapidIO Trade Association and standard continue to be sought after news in the industry.
Where to Network Visit with RapidIO Trade Association members, learn about products and see live demonstrations.
RapidIO Hall of Fame "Did You Know?" email campaign a success!
Changes Some parting thoughts from Dan Bouvier as he steps down from the RapidIO Steering Committee.
RapidIO Reflections: Significant interest in RapidIO technology from a wide cross section of customers

Design It:

Fault-Tolerant Systems and RapidIO®

At one time fault tolerance was considered the domain of the lunatic fringe, telecom central office applications, the space shuttle, and so on. Today, the requirement for high availability is mainstream. It’s easy to see why. When one adds up all of the network elements that a data packet traverses from source to destination, one commonly crosses 20 or more hardware network elements. While the trend is toward a flatter network, significant hierarchy still exists. The problem with this hierarchy is that it adversely affects reliability. Let’s look at a simple example:

  • Our example network element has an availability probability of 0.99. A network made up of two of these elements has an availability of 0.992=0.9801. A network made up of 20 of these elements has an availability of 0.9920=0.82, which most users would consider blatantly unacceptable.

This fact means that each individual network element must achieve sufficiently high availability so that the product of the availabilities yields an acceptable result at the network level. This has become the generally accepted ‘five nines,’ or 0.99999 availability. It is important to note that this availability is measured at the application level. This level of availability equates to 5 min outage downtime per system per year. Let’s examine a typical analysis of fault tolerant network element failures as shown in Figure 1 below. The systems analyzed in the figure achieve five nines of availability.


Fault proportions for software, hardware and procedures

The data for this article was obtained from the US Federal Communications Commission (FCC) ARMIS database by the author. The FCC is an independent U.S. government agency that regulates interstate and international communications by radio, television, wire, satellite, and cable. Access to its ARMIS database is publicly available on the Internet (www.fcc.gov). The raw data used is available in the database, although there are no public reports that match the information presented directly here. The analysis was performed independently by the author who consolidated a range of widely distributed data from the database for the purpose of illustration. It is possible, however, to commission private reports from the FCC that detail data for specific vendors.

The largest proportion of failures is software, for which redundancy is difficult and costly to design. The second largest element is procedural errors. This class of failure is the result of maintenance people making a wrong decision and causing a network outage as a result. The smallest portion of the failure budget is indeed the hardware. Clearly, in order to achieve five nines overall, the hardware needs to be at least an order of magnitude better.

The impact of these outages in terms of outage downtime is affected by the time to repair a given outage. Catastrophic software failures can often be recovered by a system restart. This results in an outage of the order of 2-3 min. However, an outage that requires human intervention to repair the problem can have a significantly longer mean time to repair. Therefore the frequency of hardware outages must be reduced even further to lessen the impact of the statistically longer mean time to repair.

The outage budget can be broken down further into planned and unplanned outages. Some systems are capable of in-service upgrades, but this is rare. This is analogous to replacing the engine on an aircraft while it is in flight. If we allocate 3 min outage downtime to planned outages, then the remaining 2 min can be absorbed by unplanned outage downtime per system per year. This equates to ‘seven nines’ or 99.99994% uptime. This is extremely challenging to achieve. The system designers who deliver this level of product know exactly how hard this is.

Design engineers have traditionally relied on proprietary technology to achieve this kind of performance. However, as companies look to rely increasingly on off-the-shelf technology, industry standards need to incorporate these features. Nowhere is this more important than in the arena of system interconnect. The interconnect architecture can become the weakest link, and also forms the backbone of the system’s fault tolerance architecture.

The need for high availability has expanded to include storage systems, servers, network routers and switches.

RapidIO® was developed from its inception with high-availability applications in mind. This philosophy permeates the architecture and the specifications. However, there are many legitimate architectures for achieving fault tolerance. The different architectures trade off the level of hardware redundancy, system cost, complexity and availability. So the interconnect architecture should not assume any one fault tolerance architecture. It should provide the necessary hooks so that any of these fault tolerance architectures can be supported.

Excerpted from RapidIO: The Embedded System Interconnect by Sam Fuller, John Wiley & Sons, Ltd., 2005, Chapter 13, by Victor Menace with AMCC.