Why do systems fail? Tandem NonStop system and fault tolerance

119 points by PaulHoule 9 months ago

Animats 9 months ago

Tandem was interesting. They had a lot of good ideas, many unusual today.

* Databases reside on raw disks. There is no file system underneath the databases. If you want a flat file, it has to be in the database. Why? Because databases can be made with good reliability properties and made distributed and redundant.

* Processes can be moved from one machine to another. Much like the Xen hypervisor, which was a high point in that sort of thing.

* Hardware must have built in fault detection. Everything had ECC, parity, or duplication. It's OK to fail, but not make mistakes. IBM mainframes still have this, but few microprocessors do, even though the necessary transistors would not be a high cost today. (It's still hard to get ECC RAM on the desktop, even.)

* Most things are transactions. All persistent state is in the database. Think REST with CGI programs, but more efficient. That's what makes this work. A transaction either runs to successful completion, or fails and has no lasting effect. Database transactions roll back on failures.

The Tandem concept lived on through several changes of ownership and hardware. Unfortunately, it ended up at HP in the Itanium era, where it seems to have died off.

It's a good architecture. The back ends of banks still look much like that, because that's where the money is. But not many programmers think that way.

sillywalk 9 months ago

> Databases reside on raw disks. There is no file system underneath the databases.
The terminology of "filesystem" here is confusing. The original database system was/is called Enscribe, and was/is similar to VMS Record Management Services - it had different types of structured files types, in addition to unstructured unix/dos/windows stream-of-byte "flat" files. Around 1987 Tandem added NonStop SQL files. They're accessed through a PATH: Volume.SubVolume.Filename, but depending on the file type, there is different things you can do with them.
> If you want a flat file, it has to be in the database.
You could create unstructured files as well.
> Processes can be moved from one machine to another
Critical system processes are process-pairs, where a Primary process does the work, but sends checkpoint messages to a Backup process on another processor. If the Primary process fails, the Backup process transparently takes over and becomes the Primary. Any messages to the process-pair are automatically re-routed.
> Unfortunately, it ended up at HP in the Itanium era, where it seems to have died off.
It did get ported to Xeon processors around 10 years ago, and is still around. Unlike OpenVMS, HPE still works on it, but as I don't think there is even a link to it on the HPE website* . It still runs on (standard?) HPE x86 servers connected to HPE servers running Linux to provide storage/networking/etc. Apparently it also runs supported under VMWare of some kind.
* Something something Greenlake?
- Animats 9 months ago
  
  > Critical system processes are process-pairs, where a Primary process does the work, but sends checkpoint messages to a Backup process on another processor. If the Primary process fails, the Backup process transparently takes over and becomes the Primary. Any messages to the process-pair are automatically re-routed.
  Right. Process migration was possible, but you're right in that it didn't work like Xen.
  > It still runs on (standard?) HPE x86 servers connected to HPE servers running Linux to provide storage/networking/etc.
  HP is apparently still selling some HPE gear. But it looks like all that stuff transitions to "mature support" at the end of 2025.[1] "Standard support for Integrity servers will end December 31, 2025. Beyond Standard support, HPE Services may provide HPE Mature Hardware Onsite Support, Service dependent on HW spares availability." The end is near.
  [1] https://www.hpe.com/psnow/doc/4aa3-9071enw?jumpid=in_hpesite...
  
  sillywalk 9 months ago
  
  It looks like that Mature Support stuff is all for Integrity i.e. Itanium servers. As long as HPE still makes x86 servers for Linux/Windows, I assume NonStop can tag along.
  
  Animats 9 months ago
  
  Right, that's just the Itanium machines. I'm not current on HP buzzwords.
  The HP NonStop systems, Xeon versions, are here.[1] The not-very-informative white paper is here.[2] Not much about how they do it. Especially since they talk about running "modern" software, like Java and Apache.
  [1] https://www.hpe.com/us/en/compute/nonstop-servers.html
  [2] https://www.hpe.com/psnow/doc/4aa6-5326enw?jumpid=in_pdfview...
  
  lazide 9 months ago
  
  As a side point - that is some amazing lock in.
  
  MichaelZuo 9 months ago
  
  They were pretty much the only game in town, other than IBM and smaller mainframe vendors, if you wanted actual written, binding, guarantees of performance with penalty clauses. (e.g. with real consequences for system failure, such as being credited back X millions of dollars after Y failure)
  At least from what I heard pre-HP acquisition, so it’s not ‘amazing lock in’, just that, if you didn’t want a mainframe and needed such guarantees, there was literally no other choice.
  
  lazide 9 months ago
  
  Notably, that is amazing lock in. What else would it look like?
  
  MichaelZuo 9 months ago
  
  Well if just price/performance alone is enough to qualify… viz. IBM, Then the moment another mainframe vendor decided to undercut them by say 20%, the lock in would evaporate. Of course no mainframe vendor would likely do so, but the latent possibility is always there.
  Facebook is an example of ‘amazing lock in’ where it’s not theoretically possible for any potential competitor to just negate it with the stroke of a pen.
  
  lazide 9 months ago
  
  The reason they are locked in is because they are the only game in town for this use case, done this way. That’s why I’m saying it, yeah?
  It isn’t a price point thing.
  
  MichaelZuo 9 months ago
  
  In that sense IBM offers a better ‘game’ in every way, but at 10x the price point… because they are playing a different, more advanced, ‘game’ that so happens to include Tandem’s ‘game’ as a subset.
  
  lazide 9 months ago
  
  That just means they’re locking in a different segment.
  Do you think folks locked into to MS Access are the same people locked into Oracle databases?
  
  MichaelZuo 8 months ago
  
  Yes but its something that can disappear with the stroke if a pen, which is the critical difference, the durability.
adastra22 9 months ago

> Unfortunately, it ended up at HP in the Itanium era, where it seems to have died off.
My dad continues to maintain NonStop systems under the umbrella of DXC. (Which is a spinoff of HP? Or something? Idk the details.) He worked at Tandem back in the day, and has stayed with it ever since. I think he'd love to retire, but he never ends up as part of the layoffs that get sweet severance packages, because he's literally irreplaceable.
The whole stack got moved to run on top of Linux, IIRC, with all these features being emulated. It still exists though, for the handful of customers that use it.
- Sylamore 9 months ago
  
  Kinda the other way around, the NonStop kernel can present a Guardian personality or an OSS (Open Systems Services) linux-like compatible personality. The OSS layer is basically running on top of the NSK/Guardian native layer but allows you to compile most linux software.
  
  adastra22 9 months ago
  
  No, I meant the other way around. I don’t know to what degree it ever got released, but he spent years getting it to work on “commodity” mainframe hardware running Linux, as HP wanted to get out of the business of maintaining special equipment and OS just for this customer.
kev009 9 months ago

Yes, IBM mainframes employ or have analogous concepts to all of this which may be one of many reasons they haven't disappeared. A lot of it was built up over time whereas Tandem started from the HA specification so the concepts and marketing are clearer.
Stratus was another interesting HA vendor, particularly the earlier VOS systems as their modern systems are a bit more pedestrian. http://www.teamfoster.com/stratus-computer
- sillywalk 9 months ago
  
  I present to you "Commercial Fault Tolerance: A Tale of Two Systems" [2004][0] - a paper comparing the similarities and differences towards reliability/available/integrity between Tandem Nonstop and IBM Mainframe systems,
  and the book "Reliable Computer Systems - Design and Evaluation"[1] which has general info on reliability, and specific looks at IBM Mainframe, Tandem, and Stratus, plus AT&T switches and spaceflight computers.
  [0] https://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers...
  [1] https://archive.org/download/reliablecomputer00siew/reliable...
- mech422 9 months ago
  
  Yeah - Stratus rocked :-) The 'big battle' used to be between Non-Stops more 'software based' fault tolerance VS. Stratus's fully hardware level high availability. I used to love demo'ing our Stratus systems to clients and let them pull boards while the machine was running...Just don't pull 2 next to each other :-)
  Also, I think Stratus was the first (only?) computer IBM re-badged at the time - IBM sold Stratus's as the Model 88, IIRC
spockz 9 months ago

Not to take away from your main point: The only reason it is hard to get ECC in a desktop is because it is used as customer segmentation, not because it if technically hard or because it would drive the actual cost of the hardware up.
- sitkack 9 months ago
  
  ECC should be mandatory in consumer and cpus and memory. This will be seen like cars with fins and not having seatbelts in the future.
  
  Animats 9 months ago
  
  I have a desktop where CPU, OS and motherboard all support it. But ECC memory wa hard to find. Memory with useless LEDs, though, is easily available.
  
  spockz 9 months ago
  
  That is because it doesn’t make sense producing a product that cannot be used at all. It just doesn’t work in consumer boards due to lack of support for it in consumer CPUs. Again due to artificial customer segmentation.
  
  c0balt 9 months ago
  
  Most ryzen CPUs have supported some ECC RAM for multiple years by now. The HED platforms, like Thread Ripper, did too. It just hasn't really been advertised as much because most consumers don't appear to be willing to pay the higher cost.
  
  PhilipRoman 9 months ago
  
  Ok, I'll bite - what tangible benefit would ECC give to the average consumer? I'd wager in the real world 1000x more data loss/corruption happens due to HDD/SSD failure with no backups.
  Personally I genuinely don't care about ECC ram and I would not pay more than $10 additional price to get it.
  
  adastra22 9 months ago
  
  Most users experience data loss due to ECC these days. They just might not attribute it to cosmic rays. It's kinda hard to tell ECC data loss apart from intermittent hardware failure. It can be just as catastrophic though, if the bit flip hits a critical bit of information and ends up corrupting the disk entirely.
  
  immibis 9 months ago
  
  My Threadripper 7000 system with ECC DDR5 and MCE logging reports a corrected bit error every few hours, but I've got no idea if that's normal. I assume it was a tradeoff for memory density.
  
  MichaelZuo 9 months ago
  
  This, memory densities are so high nowadays it’s almost guaranteed that a new computer bought in 2024 will hard fault with actual consequences (crashing, corrupted data, etc…) at least once a year due to lack of ECC.
Sylamore 9 months ago

Speaking of Tandem Databases, HP had released the SQL engine behind SQL/MX[0] as open source (Trafodion) running in front of Hadoop to the Apache Software Foundation but it appears they have shutdown the project[1].
[0]: https://thenewstack.io/sql-hadoop-database-trafodion-bridges...
[1]: https://attic.apache.org/projects/trafodion.html
mannyv 9 months ago

Oracle has had raw disk support for a long time. I'm pretty sure it's the last 'mainstream' database that does.

redbluff 9 months ago

As someone who has worked on nonstops for 35 years (and still counting!) it's nice to see them get a mention on here. I even have two at home, one a K2000 (MIPS) machine from the 90's and an Itanium server from a the mid 10's. I am pretty sure the suburbs lights dim when I fire them up :).

It's an interesting machine architecture to work on, especially the "Guardian 90" personality, and quite amazing that you can run late 70's based programs without a recompilation written for a CPU using TTL logic on a MIPS, Itanium or X86 CPU; not all of them mind you, and not if they were natively compiled. The note on Stratus was quite interesting for a long time the only real direct competitor Nonstop had in a real sense was Stratus. The other thing that makes these systems interesting is they have a unix like personality called "OSS" that allows you to run quite a bit of POSIX style unix programs.

My favourite nonstop story was in the big LA earthquake (89?) a friend of mine was working at a POS processor. When they returned to the building the Tandem machine was lying on its side, unplugged and still operating (these machines had their own battery backup). The righted it, plugged everything back in and the machine continued operating as though nothing happened. The fact that pretty much all the network comms were down kind of made this a moot point, but it was fascinating none the less. Pulling a CPU board, network board or disc controller or disc - all doable with no impact to transaction flow. The discs themselves were both mirrored and shadowed, which back in the day made these systems very expensive.

macintux 9 months ago

10 years ago I used Jim Gray's piece about Tandem fault tolerance in a talk about Erlang at Midwest.io (RIP, was a great conference).

https://youtu.be/E18shi1qIHU

Because it's a small world, a former Tandem employee was attending the talk. Unfortunately it's been long enough that I don't remember much of our conversation, but it was impressive to hear how they moved a computer between data centers; IIRC, they simply turned it off, and when they powered it back on, the CPU resumed precisely where it had been executing before.

(I have no idea how they handled the system clock.)

Jim Gray's paper:

https://jimgray.azurewebsites.net/papers/TandemTR86.2_FaultT...

sillywalk 9 months ago

> I have no idea how they handled the system clock.)
It is or was on the Internet Archive and probably elsewhere -
Tandem Systems Review, Volume 2, Number 1 (February 1986) - "Managing System Time Under Guardian 90"
- throw0101c 9 months ago
  
  > Tandem Systems Review, Volume 2, Number 1 (February 1986) - "Managing System Time Under Guardian 90"
  * https://vtda.org/pubs/Tadem_Systems_Review/
  * https://www.mrynet.com/FTP/os/DEC/www.hpl.hp.com/hpjournal/t...
- macintux 9 months ago
  
  Nice, thanks, will have to look that up.
abrookewood 9 months ago

That is crazy! I assume that all the RAM was battery backed? What about the CPU cache, the OS state etc? I'm struggling to see how this was possible.

082349872349872 9 months ago

at Tandem, even the company coffee mugs had redundancy: https://i.etsystatic.com/33311136/r/il/08fbca/5271808290/il_...

sillywalk 9 months ago

I'm still hoping to find a more detailed article about modern X86-64 NonStop, complete with Mackie Diagrams.

The last one I can find is for the NonStop Advanced Architecture (on Itanium), with ServetNet. I gather that this was replaced with the NonStop Multicore Architecture (also on Itanium), with Infiniband, and I assume x86-64 is basically the same but on x86-64, but in pseudo big-endian.

hi-v-rocknroll 9 months ago

A hypervisor (software) approach is one way to accomplish it far cheaper and much more configurable and reusable than having to rely on dedicated hardware. VMware's virtualization method of x86_64 fault tolerant feature runs 2 VMs on different hosts using the lockstep method. Either fails, then the hypervisor moves the (V)IP over with ARP to the running one and spawns another to replace it. More often than not, it's a way to run a critical machine that cannot accept any downtime and cannot otherwise be (re)engineered in a conventional HA manner with other building-blocks. In general, one should never do this and prefer to use always consistent quorum 2 phase commit transactions at the cost of availability or throughput, or eventual consistency through gossip updates at the cost of inconsistency and potential data loss.
adastra22 9 months ago

What do you want to know?
- sillywalk 9 months ago
  
  What has changed since Itanium? What counts as a logical NonStop CPU now? As I (mis?)understand it, under Itanium a physical server blade was called a slice. It had multiple CPU sockets (called Processing Elements) and memory on the was partitioned with MMU mapping and Itanium security keys so each Processing Element could only access a portion of it. All IO on a Processing Element went out over ServerNet (or Infiniband) to a pair of Logical Sync Units, and was checked/compared with IO from another Processing Element running the same code on a different physical server blade. The 2 (or 3) processing elements combined to form a single logical CPU. I wonder if this is still the case? I believe there was a follow on (I assume when Itanium went multi-core) called NonStop Multicore Architecture, but I haven't found a paper on it.
  Also, I'm curious how the Disk Process fits in with Storage Clustered IO Modules(CLIMs)? Do CLIMs just act as a raw disk, with the Disk Process talking to it like they would use to talk to a locally attached disk? Or is there more integration with the CLIM - like a portion of the Disk Process has been ported to Linux, or has Enscribe been ported to run on the CLIMs.
  The same thing with how Networking CLIMs fit in.

lostemptations5 9 months ago

So if Tandem is so out of favour these days, what do people and organizations use? AWS availability zones, etc?

vivzkestrel 9 months ago

completely unrelated to the topic written but i wanted to point it out. there is some accessiblity issue with this page. The arrow keys up and down do not scroll the page on Firefox 131.0.2 M1 Mac

hi-v-rocknroll 9 months ago

Stanford's Forsythe DC had a Tandem mainframe just inside the main floor area. It was a short beast standing on its own about 1.5m / 4' tall, and not in a 19" rack.

exikyut 9 months ago

[flagged]