Difference between revisions of "Digital Woes"
Jump to navigation Jump to search
(Created page with "<pre> This document is from the WELL Gopher Server, gopher.well.sf.ca.us For information about the WELL gopher, send e-mail to [email protected] Feb 1994 -------...")
Latest revision as of 14:40, 29 July 2020
This document is from the WELL Gopher Server, gopher.well.sf.ca.us For information about the WELL gopher, send e-mail to [email protected] Feb 1994 --------------------------------------------------------------------------- This document is the text and Notes for Chapter One of _Digital Woes: Why We Should Not Depend on Software_, by Lauren Ruth Wiener, Addison-Wesley 1993, ISBN # 0-201-62609-8. From DIGITAL WOES: Why We Should Not Depend on Software Copyright 1993 BY LAUREN RUTH WIENER <[email protected]> Book published by Addison-Wesley 1993, ISBN # 0-201-62609-8 Chapter 1: Attack of the Killer Software (abbreviated) I'm sitting in an airplane, looking out the window at the tops of fluffy white clouds. Once upon a time, this sight was not vouchsafed to humans. It comes to me courtesy of the commercial airline industry, one of the twentieth century's impressive achievements. Computers and software have contributed a lot toward this experience: In the cockpit, the pilot is using more software right now than I use in a year. Software helps determine our position, speed, route, and altitude; keeps the plane in balance as fuel is consumed; interprets sensor readings and displays their values for the pilot; manages certain aspects of the pilot's communications; translates some of the pilot's gestures on the controls into movements of the wing and tail surfaces; raises or lowers the landing gear. The pilot is following a route, a path through the three-dimensional air space that blankets North America eight miles thick. A lot of other airplanes are buzzing around up here with us, and a collision would be calamitous; this airplane alone has four hundred passengers. The air traffic controllers depend on software to assign our path through airspace. The transponder on our airplane broadcasts its identification, and near an airport its signal is trans- lated into a little tag on a radar screen that includes our ID and altitude. Altitude appears as a number because the screen is two-dimensional and cannot reflect altitude directly. The software has to move the tag around on the screen in ways that reflect the airplane's movements through the air in two dimensions, but the air traffic controllers have to reconstruct three- dimensional reality in their heads, quickly and coolly, using the altitude numbers. The system involves a lot of pieces--transponders, radar, radar screens, air traffic controllers, a computer--separated widely in space. Some of them are moving all the time. The action on the radar screen must keep up with the action in the air. This is a complex problem. Knowing this, I am grateful for our safe progress through the sky today, as I enjoy the sunny cloud tops. The airplane is holding up pretty well, too. Computers and software were used extensively to design it, and to design the process by which it was manu- factured. Figuring out how to make something like a jet is an underappreciated problem. You have to design a machine that can fly, and also one that can be manufactured and maintained. Sky and runway, the world is a rough place, and hundreds of thousands of parts may need replacing. You also have to design the process that produces those parts, and that will get them where they are needed. An enterprise such as Boeing's is an enormous consumer of software, computers, and programmers. Then we have amenities. Meals, including the kosher one for seat 3B and the vegetarian ones for 12A and 22E and F, just like it said in the database. For each of us, our own personal copy of Wings & Things, the in-flight magazine. Bland and predictable it may be, but it took quite a system to get it there, and software again played an extensive role. Articles had to be commissioned and written; the faxes flew. It was laid out nicely on a Macintosh screen, using an expensive page-layout software product. The underpaid talent who performed this task is now listening to a compact disc through wireless headphones while making the daily backup diskette. None of us passengers would even be here, of course, without the airline reservation system. The network of computers linking travel agents with airlines is a Byzantine example of economic cooperation and competition in uneasy truce. The economic tension between travel agents seeking good deals and airlines seeking to maximize profits has led to amazing software wars. Anyway, it got us here, and I paid $379, and the guy over there pecking at his PowerBook paid $723. The two women in front of me are taking a trip that includes this Saturday night, so they paid $119 each. Software wars make for Byzantine price-setting mechanisms, it seems. To make our reservations, we all used the phone. You lift the receiver off the hook and get a dial tone. Press eleven buttons, and one phone rings out of 151 million. Just the person you want to speak to is on the other end (or her answering machine or voice mail, but lets not get into that). An amazingly complex, richly connected net of switches opens and closes just for you, and it is quick about it--another impressive achievement of the century, whose most recent frill appears on the seatback in front of me, the AirFone. Using the infrastructure of the cellular phone system, you can now send and receive phone calls on an airplane. No computer invented the cute spelling of AirFone, but the embedded capital letter is brought to you by computer programmers, anyway. Sometimes in a program, a programmer wants to name something--a variable, say--to suggest what it's being used for. For example, a commodities tracking program might have a variable called "the price of eggs in China." Computers want these things to be typed without spaces in them; compilers are fussy that way and must be catered to. But programmers would like to perceive the individual words, not an undifferentiated smear of letters. This problem gets solved several ways, and it is a commentary on the essential humanity of programmers that it's a matter of taste and, occasionally, pseudoreligious dispute. Folks programming in C tend to use underscores, thus: the_price_of_eggs_in_China. It gets awfully long to type. To shorten it, some programmers will make mysterious secret names using rules they invent, such as the first letter of each word: tpoeic. This gets them enthusiastically loathed by anyone who has to come along later and figure out how their programs work. Another approach embeds capital letters: thePriceOfEggsInChina. Marketing types think this looks stylish and name the products that way. Software is infiltrating on all fronts. We're descending into Portland now--the view out the window goes woolly gray with the famous Portland clouds, and then I see cars whizzing below us on Marine Drive. These days, they whiz along aided by a wide variety of software: computerized braking, fuel injection, suspension, cruise control, transmission, four-wheel steering, and maybe even navigation systems use software. It astonishes me that software has so thoroughly colonized the car with practically no public discussion. Software makes it cheaper to manufacture items--cars, for example--because it eliminates many specific little pieces that must be machined to precise tolerances. But what a decision to leave to the manufacturers! Below us now I see the runway. On our behalf, a guardian angel peers into the radar screen, while another in the tower communicates with our pilot over the radio. The complex, software-intensive system has worked again--thanks, everyone. We are down. I am home. It is raining. THIRTEEN TALES OF DIGITAL WOE This book is about how things can go wrong. In the next few pages, I'll tell you thirteen stories of things going wrong. Sometimes the outcome is comic; sometimes it's tragic; sometimes it's both. It isn't that the people who design, build, and program computers are any less careful, competent, or conscientious than the rest of us. But theyre only human, and digital technology is unforgiving of human limitations. So many details must be tracked! Even the tiniest error can have an enormous effect. Of course the stuff is tested, but the sad truth is that a properly thorough job of testing a software program could take decades . . . centuries . . . sometimes even millennia. Frankly, developing software is not the easiest way to make money. A careful job is expensive, and even the most careful process can leave that tiny, disastrous error. So even after the product is finished and for sale, developers issue constant upgrades to correct some of the mistakes they've been hearing about. Unfortunately, sometimes the upgrade is late. Failures happen all the time. The consequences of failure depend on what we were using the flaky machine for in the first place. We put computers into all kinds of systems nowadays. Here are some of the things we choose to risk: * reputations, * large sums of money, * democracy, * human lives, * the ecosystem that sustains us all. And yet, some of these systems don't really benefit us much. Some of them are solutions to the wrong problem. The truth is that digital technology is brittle. It tends to break under any but stable, predictable conditions--and those are just what we cannot provide. Life is frequently--emphatically--unpredictable. You can't think of everything. 1. Tiny Errors Can Have Large Effects On July 22, 1962, a program with a tiny omission in an equation cost U.S. taxpayers $18.5 million when an Atlas-Agena rocket was destroyed in error.(1) The rocket carried Mariner I, built to explore Venus. The equation was used by the computerized guidance system. It was missing a bar: a horizontal stroke over a symbol that signified the use of a set of averaged values, instead of raw data. The missing bar led the computer to decide that the rocket was behaving erratically, although it was not. When the computer tried to correct a situation that required no correction, it caused actual erratic behavior and the rocket was blown up to save the community of Cocoa Beach. (This unhappy duty falls on the shoulders of an unsung hero called the range safety officer, and we are all glad he's there.) Mariner I, all systems functioning perfectly, surprised the denizens of the Atlantic Ocean instead of the Venusian. 2. Thorough Testing Takes Too Long Because such tiny errors can have such large effects, even the best efforts can miss something that will cause a problem. In late June and early July of 1991, a series of outages affected telephone users in Los Angeles, San Francisco, Washington, D.C., Virginia, W. Virginia, Baltimore, and Greensboro, N.C. The problems were caused by a telephone switching program written by a company called DSC Communications. Their call-routing software had several million lines of programming code. (Printed at 60 lines to a page and 500 pages to a volume, one million lines equals about 33 volumes of bedtime reading.) They tested the program for 13 weeks, and it worked. Then they made a tiny change--only three lines of the millions were altered in any way. They didn't feel they needed to go through the entire 13 weeks of testing again. They knew what that change did, and they were confident that it did nothing else.(2) And presumably, the customer wanted it now. So they sent off their several-million-line program that differed from the tested version by only three lines, and it crashed. Repeatedly. Sic transit software. 3. Developing Software Is Not the Easiest Way to Make Money Sometimes it's the software, and sometimes it's the process itself that is buggy -- developing software can be a nightmare even for someone who has succeeded at it before. Mitch Kapor is a gentleman whose name never appears in the software press without the words "industry veteran" in front of it. It's a fine title, and he wears it well: Mitch Kapor is the fellow who wrote the first spreadsheet for the IBM PC. He founded Lotus Development Corporation and made a fortune on Lotus 1-2-3. Then he left Lotus "because the company had gotten too big."(3) In early 1988, he started another company called On Technology and began work on an ambitious project to make personal computers easier to use. He got some venture capital, hired a bunch of bright young programmers and a former Lotus associate to oversee their work, and spent $300,000 a month for thirteen months. When it became obvious that they were years away from a product, Mr.Kapor scaled back his ambitions considerably, starting development instead on a nice little product called On Location. On Location is not a major paradigm shift in personal computing, but it is handy. It would have been hard to write this book without it; it provides me with meaningful access to over twelve megabytes of information. To show you what I mean, Figure 1-1 shows a little bit of my computers desktop.(4) Each one of those little pictures represents a file; the text underneath is the name of the file. They all look the same, don't they? Yet each contains all kinds of different information. If I am searching for a snippet about Mitch Kapor, for example, how do I know where to look? The answer is that I use this product. It allows me to search through the whole computer for any word or words, and it will tell me which files contain those words. Some of the files on the list still turn out to be irrelevant, but the haystack in which I search for my needle of information is now much, much smaller. This is handy, but it isn't earthshaking; it's only a modest application of computer technology. And if anyone was in a position to appreciate how long development of this product was going to take, it ought to have been Mr. Kapor. He started development in April 1989 and expected to ship the product in November. The target date for shipping was revised three times; the third time involved a full-blown management crisis, with the head of engineering storming out the door. The product finally shipped at the end of the following February, but only by throwing four more bodies at it. Several people quit. Twice they changed product direction, and twice they abandoned a feature they had planned. A seven-month schedule stretched to eleven months and involved an extra visit to the venture capitalists. Software development projects are notorious for cost overruns, missed schedules, and products that do less than originally specified. A lot of corporate ships have foundered on theocks. Their captains can feel a bit better now, though, because they're in excellent company. 4. Even a Careful Process Can Leave a Problem Bugs are troublesome, but so is removing them. The process can leave detritus that will cause serious problems when the system is in use. On July 20, 1969, in the critical final seconds of landing the first manned spacecraft on the moon, Neil Armstrong was distracted by just such leftover detritus--two "computer alarms."(5) Apollo 11 was a software-intensive undertaking for its time, and it suffered its share of development problems. To debug the software running on the onboard computer, programmers inserted extra bits of computer code which they called "alarms" (nowadays theyd be "debugging aids") to help them determine what happened inside the computer when their programs misbehaved. As preparations for launch approached maximum intensity, a programmer happened to mention the computer alarms to the fellow who programmed the simulations that trained the mission controllers. The alarms had never been intended to come to the attention of anyone other than the programmers, and the mission controllers had never heard of them. "We had gone through years of working out how in the world to fly that mission in excruciating detail, every kind of failure condition, and never, ever, did I even know those alarms existed," said Bill Tindall, in charge of Mission Techniques. Nevertheless, the Apollo personnel had an understandable passion for thoroughness. Even though the alarms could not reasonably be expected to occur during an actual mission, the mission controllers were promptly given simulations that included them. This turned out to be fortunate; sometimes the backup works. The onboard computer had several functions. Its primary function was to help land the lunar module on the moon, but it also helped it meet and dock with the command module in lunar orbit after leaving the moon. Obviously, it was not going to perform both functions at once, so original procedures called for flipping a switch to disable the rendezvous radar during descent. However, about a month before launch, it was decided to leave the switch in a different setting to allow the rendezvous radar to monitor the location of the command module during the descent. The programmers felt it would be safer for the crew if the rendezvous software could take over immediately, in case the landing had to be aborted for any reason. But one change to a complex, delicately balanced system leads to others. In this case, it led to too many others, too close to the launch date. When the extent of the changes became apparent, the software engineers decided to return to the original procedures. But the appropriate software changes had already been loaded into the lunar modules computer, and it was a ticklish job to back them out. They decided instead to withhold the radar data from the rendezvous software, figuring that therefore it wouldnt track the command module during descent. It seemed like the simplest solution. But computers don't know that no angle has both a sine and cosine of 0. As the lunar module approached the surface of the moon, the computer gamely attempted the impossible task of tracking the command module with mathematically impossible data and landing the lunar module at the same time. Both tasks proved to call for more processing than it could perform simultaneously, so it issued an alarm indicating an overflow. Moments before the historic landing, a twenty-six-year-old mission controller had to decide whether to abort the mission. It was a tough call--some alarms indicated a serious problem, others could safely be ignored, but other factors complicated the picture. The mission controller had nineteen loonds to think it over before deciding to continue. Then a new alarm occurred and he had to make the decision all over again. Meanwhile, during crucial moments in the lunar lander, the astronauts were distracted from seeing that their chosen landing site was strewn with boulders. With twenty-four seconds of fuel, they were maneuvering around rocks. They landed with no margin for error. 5. Sometimes the Upgrade Is Late On February 25, 1991, an Iraqi Scud missile killed twenty-eight American soldiers and wounded ninety-eight others in a barracks near Dhahran, Saudi Arabia. The missile might have been intercepted had it not been for a bug in the software running on the Patriot missile defense system.(6) The bug was in the software that acquired the target. Heres how the system is supposed to work: 1. The Patriot scans the sky with its five thousand radar elements until it detects a possible target. Preliminary radar returns indicate how far away the target is. 2. Subsequent radar data indicate how fast the target is going. 3. Now the Patriot must determine its direction. The Patriot starts by scanning the whole sky, but it can scan more accurately and sensitively if it can concentrate on just a small portion, called the tracking window. Now it needs that improvement. It calculates where the Scud is likely to be next, using calculations that depend (unsurprisingly) on being ultraprecise. Then it draws the tracking window--a rectangle around the key portion of the sky--and scans for the Scud. If it sees the Scud, it has acquired the target, and can track its progress. If it does not see the Scud, it concludes that the original blip was not a legitimate target after all. It returns to scanning the sky. The equation used to draw the tracking window was generating an error of one 10-millionth of a second every ten seconds. Over time, this error accumulated to the point where the tracking window could no longer be drawn accurately, causing real targets to be dismissed as spurious blips. When the machine was restarted, the value was reinitialized to zero. The error started out insignificant and gradually began to grow again. Original U. S. Army specifications called for a system that would shut down daily, at least, for maintenance or redeployment elsewhere. The Army originally did not plan to run the system continuously for days; it was only supposed to run for fourteen hours at a stretch, and after fourteen hours the error was still insignificant. But successful systems nearly always find unanticipated uses, and by February 25th, the Patriot missile defense installation near Dhahran had been running continuously for a hundred hours-- five days. Its timing had drifted by 36-hundredths of a second, a significantly large error. The bug was noticed almost as soon as the Gulf War began. By February 25th, it had actually been fixed, but the programmers at Raytheon also wanted to fix other bugs deemed more critical. By the time all the bugs had been fixed -- --and a new version of the software had been copied onto tape, --and the tape had been sent to Ft. McGuire Air Force Base, --and then flown to Riyadh, --and then trucked to Dhahran, --and then loaded into the Patriot installation-- --well, by that time it was February 26th, and the dead were already dead, and the war was just about over. 6. We Risk Our Reputations In the summer of 1991, a company called National Data Retrieval of Norcross, Georgia, sent a representative to Norwich, Vermont, looking for names of people who were delinquent on their property taxes. National Data Retrieval wanted this information to sell to TRW, a large credit-reporting agency. The town clerk showed the representative the towns receipt book. Because of a misunderstanding, the representative copied down all the names in it--all the taxpayers of Norwich. Back in Georgia, the names were keypunched in and supplied to TRW, which then began to report: "delinquent on his/her taxes" in response to every single query regarding a Norwich property owner.(7) Credit information is not routinely sent to those most keenly affected, such as these maligned property owners. So the information spread from computer to computer, trickling into many tiny rivulets of the Great Data Stream. The town clerk began receiving a series of suspiciously similar inquiries, asking for confirmation of imaginary tax delinquencies. It did not take her long to trace these queries to TRW. After a mere week or so of phone calls, and only one story planted in the local newspaper, someone at TRW undertook to correct their records. Now suppose that TRW promptly and faithfully does so. The barn door swings slowly shut. In the meantime, how many computers have queried the credit status of how many Norwich residents? Applications for loans, credit card transactions, even actions taken months previously can spark such queries. Due to one error, other computers have already received the false reports of tax delinquencies. TRW may correct its own records, but no Proclamation of Invalidity will be sent to those other computers. Probably, no one even knows where the data went. Though officially dead, the zombie information stalks the data subjects, besmirching their data shadows and planting time bombs in their lives, maybe forever. 7. We Risk Financial Disaster On Wednesday, November 20, 1985, a bug cost the Bank of New York $5 million when the software used to track government securities transactions from the Federal Reserve suddenly began to write new information on top of old.(8) The event occurred inside the memory of a computer; the effect was as if the (digital) clerk failed to skip to a new line before writing down each transaction in an endless ledger. New transaction information was lost in the digital equivalent of one big, inky blotch. The Fed debited the bank for each transaction, but the Bank of New York could not tell who owed it how much for which securities. After ninety minutes they managed to shut off the spigot of incoming transactions, by which time the Bank of New York owed the Federal Reserve $32 billion it could not collect from others. A valiant effort by all concerned got them up to a debt of only $23.6 billion by the end of the business day, whereupon a lot of people probably phoned home to say: "Honey, I wont be home for dinner tonight... Well, uh, probably really late..." Pledging all its assets as collateral, the Bank of New York borrowed $23.6 billion from the Fed overnight and paid $5 million in interest for the privilege. By Friday, the database was restored, but the bank also paid with intangibles: for a while, it lost the confidence of investors. Another consequence, however slight, is that an unknown number of econometric models received incorrect data for a couple of days, thereby possibly skewing whatever decisions were based on them. 8. We Risk Democracy In the spring of 1992, the Liberal Party of Nova Scotia, Canada, held a convention. They used a computerized telephone voting system to allow any convention delegate with a touch-tone phone to vote from home by dialing the telephone number for the candidate of his or her choice.(9) (Those without touch-tone phones could go to any of several locations where banks of phones were set up.) All registered Liberals received a PIN which, when entered, verified that they were entitled to vote. A thank-you message verified that their votes had been recorded. Maritime Tel & Tel, the local telephone company, persuaded the convention organizers that a backup voting system using paper ballots was unnecessary. After all, they handled hundreds of thousands of calls a day. What could go wrong? Everything. The software turned out to be too slow to handle the volume of calls, so many votes were not recorded. In the ensuing confusion, voting was suspended and resumed, then the deadline was extended--twice. Some people reported that their PINs were rejected. Others were able to vote more than once. Adding the final touch to this election-day chaos, a kid with a scanner called up the Canadian Broadcasting Corporation and announced that he had recorded a cellular telephone conversation between the telephone company and the party, giving the results so far. Representatives of the CBC, uncertain whether this was a hoax, discussed whether to air his story with an executive producer--also over a cellular telephone. When the kid called back with a recording of _that_ conversation, the CBC decided to run the story. Needless to say, this did not improve matters. A week or so later, the dust settled and the Liberal Party decided to try again. This time they required the telephone company to post a $350,000 performance bond. They also made available a backup system that allowed people to vote with paper ballots. The backup system turned out to be unnecessary--the second time around, voting by phone worked fine. 9. We Risk Death In the spring and summer of 1986, two cancer patients in Galveston, Texas died from radiation therapy received from the Therac-25, a computer-controlled radiation therapy machine manufactured by Atomic Energy of Canada, Ltd.(10) AECL was hardly a fly-by-night outfit--it was a crown corporation of the dominion of Canada, charged with managing nuclear energy for the nation. A machine such as the Therac-25 can deliver two different kinds of radiation: electrons or X-rays. To deliver electrons, the target area on the patient's body is irradiated directly with a beam of electrons of relatively low intensity. This works well for cancerous areas on or near the surface of the body, such as skin cancer. For cancers of internal organs, buried under healthy flesh, a shield of tungsten is placed between the patient and the beam. An electron beam one hundred times more intense bombards the tungsten, which absorbs the electrons and emits X-rays from its other side. These X-rays pass part of the way through the patient to strike the internal cancers. What you want to avoid is the hundred-times-too-strong electron beam accidentally striking the patient directly, without the intervening tungsten shield. This unsafe condition must be forestalled. But it was not--under these circumstances: The operator selected X-rays as the desired procedure, the tungsten shield interposed itself, and the software prepared to send the high-intensity electron beam. Then the operator realized that she had made a mistake: electrons, not X-rays, were the required therapy. She changed the selection to electrons and pressed the button to start treatment. The shield moved out of the way, but the software had not yet changed from the high- to the low-intensity beam setting before it received the signal to start. Events happened in the wrong order. Previous radiation therapy machines included mechanical interlocks--when the tungsten target moved out of the way of the beam, it physically moved some component of a circuit, opening a switch and preventing the high- intensity beam from turning on. On the Therac-25, the target sensor went from the tungsten directly to the computer. Both the target position and the beam intensity were directly and only under software control. And the software had a bug. As if in a bad science fiction movie, the Therac-25 printed "Malfunction 54" on the operator's terminal as it gave each man a painful and lethal radiation overdose. One died within a month of the mishap; the other became paralyzed, then comatose, before he finally died several months later. These were not isolated events. Of eleven Therac-25 machines in use, four displayed these or similar problems. Over a two- to three-year period, three other serious incidents involving the software occurred in clinics in Marietta, Georgia; Yakima, Washington; and Ontario, Canada. 10. We Risk the Earth's Ability to Sustain Life It's not surprising that erroneous data can cause problems, but correct data doesn't guarantee that there will be _no_ problems. As anyone can tell you who has spent time in public policy think tanks, computer models can be created to provide any answer you want. One way to do it is to specify ahead of time the answers you do not want. In the 1970s and 1980s, NASA satellites orbiting the earth observed low ozone readings. The readings were so low that the software for processing satellite results rejected them as errors.(11) Checking to determine whether a value is in an expected range is a common form of sanity check to include in a program. Such a check would be useful in a grading program, for example: if student grades are expected to be within the range of 0.0 to 4.0, inclusive, then checking for that range can help you find places where someone made a mistake entering a grade. It's easy to incorporate a sanity check for a grade, because those limits are set by people, using terms that are straightforward and unambiguous to a computer--real numbers. People sure don't set the ozone levels--well, not directly. As it turns out, we do, in a way, but that was precisely the question NASA was investigating, and they weren't prepared to believe the answer. In 1986, a team of earthbound British scientists reported the decline in ozone levels. NASA reexamined its old data and confirmed their findings. The world may have missed a chance to get a jump on the ozone problem by a decade or more--without an independent source for data, it is risky to reject a reading because it doesn't meet your preconceptions. (On the other hand, maybe we just missed an additional decade's worth of argument.) 11. We May Not Gain Much In February 1990, an article appeared describing a seeming reversal of progress: the Washington State ferry system announced that it planned to replace the electronic control systems of the large, Issaquah-class ferries with pneumatic controls.(12) Ferries with electronic controls had rammed the dock, left before being told to do so, or unexpectedly shifted from forward to reverse. The folks in charge had had enough. Washington State Ferries is the largest ferry transportation system in the United States; thousands of people in western Washington live on the Olympic Peninsula or the beautiful islands across Puget Sound from Seattle, and take the ferries daily to and from work. Under the circumstances, Washington State responsibly decided it did not need to run a poorly controlled experiment with the latest technology. Older pneumatic control systems, which require a physical connection from the control cabinet to the propellors and engine governors, had been doing the job before, and they'd been more reliable. 12. We May Be Solving the Wrong Problem In the early 1980s, General Motors embarked upon an enormous investment in automation. In 1985, it opened its showcase: the new Hamtramck factory in Detroit, Michigan, had 50 automatic guided vehicles (AGVs) to ferry parts around the plant, and 260 robots to weld and paint.(13) It turned out not to be such a hot idea. "...Almost a year after it was opened, all this high technology was still so unreliable that the plant was producing only about half the 60 cars per hour that it was supposed to make... ". . . The production lines ground to a halt for hours while technicians tried to debug software. When they did work, the robots often began dismembering each other, smashing cars, spraying paint everywhere, or even fitting the wrong equipment.... AGVs ... sometimes simply refused to move." In his headlong rush to beat Japan to the twenty-first century, GM chairman Roger Smith failed to notice that GM's biggest problems lay not with its production processes, but with the way it treated its employees. A thoughtful look at his Japanese competitor revealed that the training, management, and motivation of workers was the source of their successes, not high technology. Not only was the technology expensive and unreliable, it was a solution to the wrong problem. 13. Life Is Unpredictable Despite President Reagan's pledge to get government off their backs, the folks living near March Air Force Base in California had to tolerate his interference with their garage door openers from January 1981 to December 1988.(14) Air Force One had some powerful and inadequately shielded electronics. The consequences to nearby communities were never adequately explored, so when Ronald Reagan rode into town, his neighbors always knew. UNRELIABLE SYSTEMS, UNMET DESIRES As a society, we have our strengths. Most houses have electricity and indoor plumbing, most roads are in pretty good shape, and the bridges ordinarily don't fall. (Well, there was that one on the Connecticut Turnpike.) People sometimes suggest that the problems suffered by digital systems are so extensive because we have been building them for so short a time. We've been building physical systems such as roads and bridges for centuries, the argument runs. When we first started, doubtless people experienced this same level of failure and frustration. This theory is hard to test, if only because it's impossible to remember a time when we didn't know _something_ about, for example, building bridges. But we do have a good historical and photographic record of the process of learning to build bridges from concrete, chiefly as documented in the life, work, and passion of a Swiss engineer named Robert Maillart (1872-1940).(15) In fact, the first bridges to be built of this new material developed cracks after a few years, but they didn't come crashing down. By the end of his life, Maillart had learned how to engineer such bridges reliably. Many of them are still in use in Switzerland today. His later bridges are not only sturdy and reliable, they are graceful and beautiful as well. The software industry hasn't fared so well. Before the end of the 1990s, we will be able to celebrate the golden anniversary of developing commercial software. In 1968, at a NATO Conference on Software Engineering, software professionals coined the term "the Software Crisis" to describe the difficulties they were having in building reliable systems.(16) Since then, a lot of books and papers have been written about it, and a lot of seminars with titles such as "Managing Complexity" have been held. The crisis itself will soon be celebrating its silver anniversary. This isn't the way we thought it was going to be. We once had a far more optimistic view of our ability to build and maintain complex systems. This view is nicely illustrated by two science fiction stories written about sixty years ago. They purport to describe two opposite events--the computer that crashes and the one that doesn't. Yet they both describe systems far more ambitious than any we could now hope to build, maintained for far longer than we could ever dream of. And as we shall see, they have other things in common as well. The Machines Will Take Care of Us The first story, "Twilight," was written by the patron saint of modern science fiction, John W. Campbell, Jr. It was first published in 1936. In "Twilight", a time traveler visits a city somewhere on earth in the far future. He finds no one--the population of the earth is now considerably diminished, and whole cities have been abandoned. Nevertheless, in an inspirational display of system robustness and reliability, the machines are up and running: "I don't know how long that city had been deserted. Some of the men from the other cities said it was a hundred and fifty thousand years.... The taxi machine was in perfect condition, functioned at once. It was clean, and the city was clean and orderly. I saw a restaurant and I was hungry....(17)" The protagonist eats the millennia-old food, which is still wholesome, and cruises around the city in the taxi. In true Campbell fashion, he stops next at a subterranean level to watch the machinery: "The entire lower block of the city was given over to the machines. Thousands. But most of them seemed idle, or, at most, running under light load. I recognized a telephone apparatus, and not a single signal came through. There was no life in the city. Yet when I pressed a little stud beside the screen on one side of the room, the machine began working instantly." (By the way, that phone system has an excellent user interface. Campbell's hero knows exactly which button starts the system, despite missing millions of years of cultural history. Likewise, will those engineers who worked on the taxi service please call their offices? Raytheon, Boeing, the Bank of New York...they'd all like to talk to you.) But Campbell's man doesnt stay forever down among the machines: "Finally I went up to the top of the city, the upper level. It was a paradise. There were shrubs and trees and parks, glowing in the soft light that they had learned to make in the very air. They had learned it five million years or more before. Two million years ago they forgot. But the machines didn't, and they were still making it."(18) It should be evident by now that a system that can operate for five million years, maintaining itself without human help for two million years, is simply miraculous. But its a fruitless miracle. The machines still function, but the people can hardly manage to. They are declining, energies sapped, vision spent, victims of their success. " The men knew how to die, and be dead, but the machines didn't," Campbell wrote sadly. The Machines Won't Take Care of Us In 1928, E. M. Forster, a writer of deeper insight, wrote a wonderful tale of system breakdown called, "The Machine Stops." The story depicts a society in which each person lives in a small underground room. All needs and wants are furnished by the Machine, so there is no need ever to leave one's room. The Machine provides ventilation, plumbing, food and drink, movable furniture, music, and literature. Secondhand, machine-mediated experiences of all kinds are universally available through a worldwide multimedia communication network. Automated manufacturing and transportation systems provide a stunning array of commodities. This has gone on for so long that people remember no other life; they are utterly dependent; even breakdowns have been repaired automatically by the mending apparatus. However, one day the Machine begins to disfigure its musical renditions with "curious gasping sighs." Soon thereafter, the fruit gets moldy; the bath water starts to stink, the poetry suffers from defective rhymes. One day beds fail to appear when sleepy people summon them. Finally: "It became difficult to read. A blight entered the atmosphere and dulled its luminosity. At times Vashti could scarcely see across her room. The air, too, was foul. Loud were the complaints, impotent the remedies, heroic the tone of the lecturer as he cried, `Courage, courage! What matter so long as the Machine goes on? To it the darkness and the light are one.' And though things improved again after a time, the old brilliancy was never recaptured, and humanity never recovered from its entrance into twilight."(19) Ultimately, the Machine breaks down completely, and with it, the entire society on which it is based. But on the earth's surface, homeless, half-barbaric rebels still live. Humanity is not wholly lost, after all. Forster's story may seem to take a more modern, skeptical view of technology, but both stories assume a degree of reliability and robustness far beyond anything we can seriously imagine achieving. We build systems representing only a tiny fraction of this size and complexity, and they break all the time. Either Way, We Won't Like It The point is not, however, that Forster or Campbell were lousy futurists. Everyone is a lousy futurist; Real Life(TM) is too chaotic, complex, and rich in detail to predict. These writers' concern is not accurate prediction, but the human soul; their stories are about our primal yearning to be cared for. The temptation to make machines our caretakers is a modern form of a basic human desire. These stories warn of the consequences of succumbing to this temptation: we lose touch with our natures. Our bodies continue living, but our souls die. The scenarios are far-fetched, but the warning isn't. The urge to let the machines take care of us is still with us; software, we feel, can do it. Software is flexible, it responds to us, it adapts to the situation. The digital systems we are now building really were unimaginably complex just a decade ago. Some of them perform functions that have never before been performed, because they could be accomplished no other way. With the advent of digital systems, we seem at last to be on the verge of building machines big and complicated and smart enough to take care of things. This is an illusion. Digital systems are capable of a lot of flexibility. Many are even capable of reasonable robustness. They can add a lot to our lives. But they are not 100 percent reliable, nor will they become soe foreseeable future. Of course, perfection isn't always required. It's nice to get a weather report, even if you know you can't count on it. But nothing less than perfection will do for running a nuclear power plant; the consequences of even a small failure could be just too disastrous. Of all the software we rely on daily, none of it is bug-free. It's natural to want the machines to take care of us. But it isn't wise. As well see in the next chapter, it is not in the nature of software to be bug-free. >From reed.edu!lauren Sat Oct 23 12:03:56 1993 Received: from nkosi.well.sf.ca.us ([184.108.40.206]) by well.sf.ca.us with SMTP id <14029-2>; Sat, 23 Oct 1993 12:03:45 -0700 Received: from reed.edu (reed.edu [220.127.116.11]) by nkosi.well.sf.ca.us (8.6.1/8.6) with SMTP id MAA23331 for <[email protected]>; Sat, 23 Oct 1993 12:03:39 -0700 Received: from 127.0.0.1 by reed.edu (/\==/\ Smail18.104.22.168 #25.21) id <[email protected]>; Sat, 23 Oct 93 12:03 PDT Message-Id: <[email protected]> To: Howard Rheingold <[email protected]> Subject: Digital Woes Chapter 1 Notes In-reply-to: Your message of "Thu, 21 Oct 93 21:29:08 PDT." <[email protected]> Date: Sat, 23 Oct 1993 12:03:37 -0700 From: Lauren Wiener <[email protected]> Status: R >From DIGITAL WOES: Why We Should Not Depend on Software Notes to Chapter 1 (abbreviated) Copyright 1993 BY LAUREN RUTH WIENER NOTES 1 Ceruzzi, Paul. Beyond the Limits: Flight Enters the Computer Age. Cambridge, MA: MIT Press, 1989, pp.20-23. 2 Andrews, Edmund L. "String of Phone Failures Perplexes Companies and U.S. Investigators." New York Times, July 3, 1991. "Theories Narrowed in Phone Inquiry." New York Times, July 4, 1991, p.10. Markoff, John. "Small Company Scrutinized in U.S. Phone Breakdowns." New York Times, July 5, 1991, p.C7. Andrews, E. "Computer Maker Says Flaw in Software Caused Phone Disruptions." New York Times, July 10, 1991, p. A10. Rankin, Robert E. "Telephone Failures Alarming." Oregonian, July 11, 1991, p. A13. Science News "Phone glitches and software bugs, Aug. 24, 1991, p.127. Also, comp.risks, 12:2, 5, 6, and more. 3 Carroll, Paul B. "Painful Birth: Creating New Software Was Agonizing Task for Mitch Kapor Firm." The Wall Street Journal, May 11, 1990, pp. A1, A5. 4 What I see when I turn on my computer (a Macintosh, as you may have guessed) and open a few windows. Many of you have already guessed that my desktop is showing me comp.risks archives. Thank you again, Peter G. Neumann, for the incomparable service this forum provides. 5 Murray, Charles, and Catherine Bly Cox. Apollo: The Race to the Moon. New York: Simon and Schuster, 1989, pp.344-55. The quote is from p.344. The story and additional analysis can be found in Ceruzzi, op. cit. pp.212-218. 6 Hughes, David. "Tracking Software Error Likely Reason Patriot Battery Failed to Engage Scud," Aviation Week and Space Technology, June 10, 1991, pp.25-6. 7 Schwartz, John. "Consumer Enemy No. 1" and "The Whistle-Blower Who Set TRW Straight." Newsweek, Oct. 28, 1991, pp.42 and 47. Miller, Michael W. "Credit-Report Firms Face Greater Pressure; Ask Norwich, Vt., Why." The Wall Street Journal, Sept.23, 1991, pp.A1 and A5. Also reported in comp.risks, 12:14, Aug. 19, 1991. 8 Berry, John M. " Computer Snarled N.Y. Bank: $32 Billion Overdraft Resulted >From Snafu,"Washington Post, Dec. 13, 1985, p. D7, as reported in comp.risks, 1:31, Dec. 19, 1985. Zweig, Phillip L. and Allanna Sullivan. "A Computer SnafuSnarls the Handling of Treasury Issues." Wall Street Journal, Nov. 25, 1985, reprinted in Software Engineering Notes, 11:1, Jan. 1986, p.3-4. Also Hopcroft, John E. and Dean B. Krafft. "Toward better computer science," IEEE Spectrum, Dec. 1987, pp.58-60. 9 comp.risks 13:56 and 58, Jun. 9 and 15, 1992. Items contributed by Daniel McKay, to whom thanks is due for his thorough reporting and thoughtful analysis. 10 Jacky, Jonathan. "Programmed for Disaster: Software Errors That Imperil Lives." The Sciences, Sept/Oct. 1989, pp. 22ff; also "Inside Risks: Risks in Medical Electronics." Communications of the ACM, 33:12, December, 1990, p.136; also personal communication, Seattle, WA, Jan. 14, 1991. An excellent and thorough technical report covering all aspects of the subject is: Leveson, Nancy G., and Clark S.Turner. An Investigation of the Therac-25 Accidents. Univ. of Washington Technical Report #92-11-05 (also UCI TR #92-108), Nov. 1992. 11 Forester, Tom, and Perry Morrison. Computer Ethics: Cautionary Tales and Ethical Dilemmas in Computing. Cambridge, MA: MIT Press, 1990, p.75. Also, New York Times Science section, July 29, 1986, p.C1. 12 Fitzgerald, Karen. "Faults and Failures: Ferry Electronics Out of Control." IEEE Spectrum, 27:2, Feb. 1990, p. 54. 13 "When GM's robots ran amok." The Economist, Aug.10, 1991, p.64-65. 14 Forester and Morrison, op. cit. p. 73. 15 See Billington, David P. Robert Maillart's Bridges: The Art of Engineering. Princeton, N.J.: Princeton University Press, 1989. 16 Dijkstra, Edsger W. "Programming Considered as a Human Activity" in Classics in Software Engineering, Edward Yourdon, ed. New York: Yourdon Press, 1979, pp. 39. 17 "Twilight" John W. Campbell, copyright 1934 by Street and Smith Publications, Inc. First published in February 1935 Astounding Stories, and reprinted in The Best of John W. Campbell. Garden City, N.Y.: Nelson Doubleday, Inc. 1976, pp.28-29. 18 ibid. 19 Forster, E. M. "The Machine Stops." From The Eternal Moment and Other Stories, Orlando, Fla.: Harcourt Brave Jovanovich, Inc., 1928.