At Heathrow Airport outside of London, more than 600 flights were disrupted or cancelled, and 42,000 pieces of luggage were temporarily lost. In Washington, D.C., acomputer operated by the National Security Agency was offline for three days. In Panama, two dozen patients died after accidentally receiving an overdose of gamma radiation to treat their cancer. Ariane 5, a $7 billion rocket built by the European Space Agency to carry satellites into orbit, exploded less than a minute into its maiden voyage.
What do all of these events have in common? Software bugs and crashes.
With 190 million lines of code, Cisco IOS XE, like any other large software stack, can never be crash-proof. But the software engineering team within Cisco Enterprise Networking has developed techniques to dramatically limit the impact of software crashes. Thosetechniques, written into IOS XE code, addtremendous resilience toevery Cisco enterprise networking device.
When Cisco IOS was first developed, it was a monolithic operating system. Any fault in any module, including upgrades to different versions, could cause the software to crash. It could then take minutes, hours, or even longer to restart Cisco routers and switches.
Moving from IOS to Cisco IOS XE, Cisco developers strived to make sure that the user experience was the same while adding techniques to improve the fault isolation of processes running within the system. As a complete networking software stack running on a Linux kernel, IOS XE was designed with separate fault domains so that a fault in one part of the system did not take the rest of the system down. This is demonstrated in systems with separate line cards and forwarding engines such as the Cisco ASR 1000 Series Aggregation Services Routers and the Cisco 8000 Series Customer Edge Routers. The line cards, route processors, and forwarding processors can be reloaded and upgraded independently without an entire system reload. Today, if a Cisco product running IOS XE suffers a crash, the system does not go down because the faults are isolated to specific domains.
In the latest version of IOS XE, the softwareresiliency is being increased by reducing the fault domains to a single process. This is achieved by creating a process runtime architecture that use three software techniques: work units, transactions, and persistence.
With IOS XE, in the event of a crash or a version upgrade, processes continue operating as if the restart didn't occur. One of the key foundations is that all processes in the system are designed to operate on discrete and independent work units. Crashes