stp-2008-01 by Morten Kristiansen

Publication

: ST ES BE CTIC ting A s PR it Te Un

VOLUME 5 •

JANUARY 2008 • $8.95 • www.stpmag.com

The Future of Software Testing...

Save $300 With Our Super-Early Bird Discount Register Online by January 11 February 26â&#x20AC;&#x201C;27, 2008 New York Hilton New York City, NY

A BZ Media Event

Brian Behlendorf on Open Source

Rex Black on ROI

Jeff Feldstein on Test Teams

Robert Martin on Craftsmanship

Gary McGraw on Security

Alan Page on Centers of Excellence

Stretch your mind at FutureTest 2008 â&#x20AC;&#x201D;

Robert Sabourin on Just-In-Time Testing

an intense two-day conference for executive and senior-level managers involved

Joel Spolsky on Software Testing

with software testing and quality assurance. Our nine visionary

Tony Wasserman on the Future

keynotes and two hard-hitting panel discussions will inform you, challenge you

Alan Zeichick moderates two panel discussions on Test Tools and on the Application Life Cycle

and inspire you.

www.futuretest.net

VOLUME 5 • ISSUE 1 • JANUARY 2008

Contents

Publication

COV ER STORY

Lights, Camera, ALM 2.0!

When it comes to life cycle management, with its newly automatic synchronization of metadata and development artifacts, ALM 2.0 is already By Brian Carroll a star—and the tester is the director. Roll ‘em!

You Can Gauge Performance Without Requirements

Hang onto your Web app’s users by analyzing their acceptable wait limits. By Alexander Podelko Here’s how.

Depar t ments

Let Not Your Project Become A Tragedy of Errors

To bypass critical errors in testing, learn from the flight control industry and mind your out-of-range values. By Yogananda Jeppu and Ambalal Patel

JANUARY 2008

7 • Editorial Why we use software to test software.

8 • Contributors Get to know this month’s experts and the best practices they preach.

Offshore Playbook: Quality Audit 101

9 • Feedback

Your company could be looking overseas to cut costs and deadlines. Put the odds in your favor with a quality audit, so that your team’s best practices will be on the agenda—wherever you test. By Steve Rabin and John Bradway

New products for testers.

It’s your chance to tell us where to go.

10 • Out of the Box 35 • Best Practices Will unit testing go mainstream? Based on my editorial experience, never! By Geoff Koch

38 • Future Test Sync up development with quality assurance to boost productivity. By Adam Kolawa

www.stpmag.com •

Ed N otes VOLUME 5 • ISSUE 1 • JANUARY 2008 Editor Edward J. Correia +1-631-421-4158 x100 ecorreia@bzmedia.com

EDITORIAL Editorial Director Alan Zeichick +1-650-359-4763 alan@bzmedia.com

Copy Editor Laurie O’Connell loconnell@bzmedia.com

Contributing Editor Geoff Koch koch.geoff@gmail.com

ART & PRODUCTION Art Director LuAnn T. Palazzo lpalazzo@bzmedia.com

Art /Production Assistant Erin Broadhurst ebroadhurst@bzmedia.com

SALES & MARKETING Publisher

Ted Bahr +1-631-421-4158 x101 ted@bzmedia.com Associate Publisher

List Services

David Karp +1-631-421-4158 x102 dkarp@bzmedia.com

Lisa Fiske +1-631-479-2977 lfiske@bzmedia.com

Advertising Traffic

Reprints

Phyllis Oakes +1-631-421-4158 x115 poakes@bzmedia.com

Lisa Abelson +1-516-379-7097 labelson@bzmedia.com

Director of Marketing

Accounting

Marilyn Daly +1-631-421-4158 x118 mdaly@bzmedia.com

Viena Ludewig +1-631-421-4158 x110 vludewig@bzmedia.com

READER SERVICE Director of Circulation

Agnes Vanek +1-631-443-4158 avanek@bzmedia.com

Customer Service/ Subscriptions

+1-847-763-9692 stpmag@halldata.com

Cover Illustration by P. Avlen

President Ted Bahr Executive Vice President Alan Zeichick

BZ Media LLC 7 High Street, Suite 407 Huntington, NY 11743 +1-631-421-4158 fax +1-631-421-4130 www.bzmedia.com info@bzmedia.com

Software Test & Performance (ISSN- #1548-3460) is published monthly by BZ Media LLC, 7 High Street, Suite 407, Huntington, NY, 11743. Periodicals postage paid at Huntington, NY and additional offices. Software Test & Performance is a registered trademark of BZ Media LLC. All contents copyrighted 2008 BZ Media LLC. All rights reserved. The price of a one year subscription is US $49.95, $69.95 in Canada, $99.95 elsewhere. POSTMASTER: Send changes of address to Software Test & Performance, PO Box 2169, Skokie, IL 60076. Software Test & Performance Subscribers Services may be reached at stpmag@halldata.com or by calling 1-847-763-9692.

JANUARY 2008

Using Software To Test Software software? The whole issue Happy New Year! As we reminds me of an absurdity enter 2008, it’s time once I came upon while working again to contemplate our as a support technician for accomplishments of the year Windows magazine in the gone by, to consider our 1990s. We were gathering goals for the year ahead, and requirements for the editofor some, to ponder the rial and production network many paradoxes that vex from the staff, which insistour existence. ed on using Windows-based Why, for instance, do we hardware and software to build software to test other Edward J. Correia publish the magazine. software? This question has “We’re a magazine about Windows, and never before occurred to me, nor does it we need to be published on Windows” parallel such mysteries as people who are was the philosophy. financially wealthy but short on values. And while that kind of idealism might But it does bear some discussion. have looked good on pages of the magaThe idea was brought to me by testing zine, the state of desktop publishing on consultant Elfriede Dustin, who credits a the Windows platform at the time was conference-goer with reminding her of a immature, to be generous. Their operatconcept she had pondered many times ing system was Windows for Workgroups, before. So why is it that we develop softthe first such installation in the company. ware to test software? My belief at the time was the same as it The practice of automating software is today. The magazine should have used testing itself involves a software developthe best tool available at the time, regardment life cycle, complete with its own set less of its content. The parent company, of requirements, a design and the actual now called CMP Media, made its bones development and testing. And as Dustin publishing dozens of periodicals, which points out, the major advances in testing at one time all used a mainframe-style tools since the 1990s include the ability to publishing system called Atex. And not a recognize object properties beyond their x,y coordinates. This has made automasingle one of its publications was about tion more viable because scripts can be Atex. more useful and less fragile. Why? Because one has nothing to do The open source community also has with the other. “We want to eat our own emerged in the last 15 years as a prolific dog food,” they might have said. And for source of high-quality test automation Windows magazine to use Macintosh tools. As evidence, consider the FitNesse computers was unthinkable. “Ridic(fitnesse.org) acceptance testing frameulous,” I would have said (and probably work, Watir (wtr.rubyforge.org) Rubydid, in private). I lost that battle and based automated browser testing librarwould ultimately not be involved in the ies, and Python and Perl. deployment. It was just as well, because Just last month, this magazine ran an the team struggled mightily. excellent tutorial on building your own Software is very good at automating XML-based test automation framework. things. So when automated testing is the Other open source test automation need, why not use the best tool for the frameworks are available, such as job? For the practice of automating softSTAF/STAX, which provides useful servware testing, the best tool happens to be ices you don’t have to build from scratch. more software. Sometimes the best tool is Why shouldn’t software be used to test staring you right in the face. ý www.stpmag.com •

Contributors BRIAN CARROLL leads the Application Lifecycle Framework project at Eclipse. His article, which begins on page 12, delves into ALF and its implementation of ALM 2.0, pertaining to the tester’s role. Brian has been developing software professionally for 40 years, the last 25 of which have been focused on developer tools and infrastructure. Brian has been a popular speaker and project leader at GUIDE, the OMG and Eclipse. He co-authored the original OASIS Web Services Distributed Management (WSDM) specification. He’s also a Fellow at Serena Software, which develops and markets ALM solutions.

ALEXANDER PODELKO’s engagements as an application performance expert have included Aetna, Hyperion and Intel. He is currently serving as a consulting member of the technical staff at Oracle, part of Performance Engineering for the Hyperion line of products, which Oracle acquired in March. Alexander’s article, which begins on page 18, draws from his experience in the roles of performance tester, performance analyst, performance architect and performance engineer to explain how a tester may establish performance requirements in the absence of documented ones. YOGANANDA JEPPU and AMBALAL PATEL are scientists at IFCS Aeronautical Development Agency in Bangalore, India. Beginning on page 25, the colleagues take a Shakespearean approach to reduction of defects to help keep your project from becoming a Tragedy of Errors. Yogananda has many published works, mostly relating to real-time systems test methodologies for performance and quality assurance in aeronauticsindustry control systems. In his current post since 2000, Ambalal holds degrees in mechanical engineering and instrumentation and a Ph.D. in fuzzy control systems from the Indian Institute of Technology, Khargapur.

STEVE RABIN has more than 20 years of experience designing and delivering enterprise software. He’s currently serving as CTO of software and Internet venture capital firm Insight Venture Partners. He and JOHN BRADWAY, development manager at project management software maker Primavera Systems, explain how organizations can test effectively offshore by keeping the foreign processes in sync with those at home. Their article begins on page 30. TO CONTACT AN AUTHOR, please send e-mail to feedback@bzmedia.com.

• Software Test & Performance

JANUARY 2008

Feedback DEVELOP AND TEST SIMULTANEOUSLY I would like to respond to Prakash Sodhani’s article “Navigating Without Requirements” (Software Test & Performance magazine, Dec. 2007) to suggest a much simpler and more effective way for tester to test applications being developed using the agile approach: namely, be an active participating member of the development team and translate the requirements into test cases at the exact same time as the developers are developing, using the exact same conversations with the customers.The developers will be able to fix their bugs immediately instead of waiting weeks to be told what the bugs are. Just because test cases are traditionally developed after the code is complete

humans when it comes to the creative process and methods such as exploratory testing. We need software to test software simply because, once it is written properly, it increases the amount of testing we can do with the limited amount of resources we have. Comparing hardware testing to software testing is an invalid comparison. Hardware adheres to fixed rules (laws of physics, etc.). If you think about it, a simulator is a combination of both hardware and software. Therefore, you’re testing hardware with hardware tools, just as you test software with software tools. Dale L. Perry Tampa, FL

does not mean that this is the only or best way to develop test cases. What better way to build quality in than to develop the tests in parallel instead of afterward? Another advantage of the tester being actively involved from the beginning is that they will sometimes spot issues with the requirements in the discussions with the customer that the developers miss. Steven Gordon, Ph.D.

Scottsdale, AZ

SOFTWARE FOR THE TESTING OF SOFTWARE? “The Software Testing Paradox” by Edward J. Correia (Test & QA Report, Dec. 4, 2007) is an interesting article. I was not so startled by the question “Why develop software to test software?” We developed a software tool to inject messages into our system under test, and I do not think this is uncommon in other companies/projects. And as for the hardware world, think of an oscilloscope; Isn’t that a piece of hardware used to test hardware? And further, think of a hardware production line, and consider custom-built hardware measuring tools, to check that the item is within spec. Mike Arnold Cambridge, MA

SOLD OUT BY OUTSOURCING Regarding Edward J. Correia’s “Offshoring Strategy: Trust, but Verify” (Test & QA Report, Nov. 20, 2007), cancel me. I’m not the CEO selling out his country—I’m the EE who got sold out. Outsourcing only benefits the business execs, not the people who made the business. Outsource to China. They are still JANUARY 2008

Communist. They can still grab and nationalize all our fabs. We don’t actually “own” anything in China. Now if only they could outsource feeble-brained editors. That would be justice. Bill Baka Via e-mail

THE HUMAN ADVANTAGE This is in response to Edward J. Correia’s article “The Software Testing Paradox.” There is a simple answer to this question “Why do we build software to use for testing other software?” We use software (automated testing tools) to test software because it is impossible for people to approach effective testing manually. Manual testing requires lots of time and resources, and is unreliable and can be very tedious and tiresome. Automated testing has the advantage of being very effective at repetitive tasks. Many of the techniques we use to test software are mathematical in nature and can be easily automated, making the task of test design simpler and more efficient. However, with all their advantages, automated testing tools cannot replace

SOA’S BRAVE NEW WORLD? Regarding “Changing Tires on a Moving Car” (Test & QA Report, Nov. 27, 2007), capture and playback strategies have been the basis of performance testing tools for years. There is nothing new here. The only difference is you are now capturing and playing back a different stream— SOAP messages instead of HTML, for example. Rational and Mercury (and others) have had this technology in the market for years. They have also had the ability to vary the data to make it more realistic. Just playing hundreds of copies of exactly the same message is hardly a good way to validate an increased load/what-if scenario. The statement “not having to understand the semantics of correctness” is just crap. Of course you do. If someone has changed the code you have to know if your test failed because it was because the code had changed and it was now correctly doing something different, or the code had changed and the test has caught an incorrect behavior. Why is it every time anyone puts SOA in a sentence, we are supposedly transported to some brave new world? Mark McLaughlin Sydney, NSW, Australia FEEDBACK: Letters should include the writer’s name, city, state, company affiliation, e-mail address and daytime phone number. Send your thoughts to feedback@bzmedia.com. Letters become the property of BZ Media and may be edited for space and style. www.stpmag.com •

Out of t he Box

Hoping to Stir The Heart of the Lonely Exception Hunter Red Gate, which specializes in tools for Microsoft developer technologies, in December released Exception Hunter, a US$295-per-user analysis tool for .NET that it claims can predict exceptions likely to be thrown by methods before they’re part of a finished application. “Until now, developers have had to wait until an error happened to find out which method throws which exception,” said Exception Hunter lead developer Bart Read. “Exception Hunter finds and reports exceptions early in the development process, so they don’t cause downstream problems for developers and end users.” The tool works by permitting the developer or tester to select an assem-

Exception Hunter from Red Gate can identify exceptions in .NET code before the app is finished, allowing handlers to be written.

bly to analyze, drilling down to the method level and reviewing a list of the different types of exceptions that can potentially be thrown. With this information, the developer can then write the appropriate exception handlers. “It’s all about an ounce of prevention, rather than the frustrating cure of

patching after application development,” said Read. Exception Hunter 1.0 is available now for Windows 2000 and later (including Vista) and requires the .NET Framework 2.0 and IE 6 or later. It also couples with the code editing capabilities of Visual Studio 2002 or later.

Coverity’s Prevent SQS Now Detects Race Conditions as Cores Continue to Multiply With a totally rewritten user interface, ally leads to a system freeze or crash. ning operation and blocks other threads Coverity’s flagship Prevent SQS static Performance-sapping thread blocks hapfrom making any progress. code-analysis tool now can detect race pen when one thread calls a long-runAs the number of processor cores conditions in software writcontinues to rise in target systen for multicore and multems, these hard-to-detect and tiprocessor systems. impossible-to-predict probReleased in December, lems will become an increasthe latest version of the ingly common issue for scanning tool for C/C++ testers. “Race conditions are and Java can now detect the particularly difficult for develthree major types of code opers because they are hard flaws: race conditions, deadto test for [and] nearly imposlocks and thread blocks, the sible to replicate,” said company claims. Coverity CTO Ben Chelf in a Race conditions occur news release. “The consewhen multiple threads quences of a race condition in access the same shared data the field can be disastrous.” In without suitable locks in some instances, they have place to prevent accidental even been blamed for deaths overwrite, loss or corrupand widespread catastrophes. tion. Deadlocks are defined Prevent SQS is available as two or more threads that now for all major platforms chase each other’s locks in and compilers; pricing is Coverity’s Prevent SQS identifies race conditions, now through a new UI. a circular chain, which usubased on project size.

• Software Test & Performance

JANUARY 2008

Understand Your Role as a Tester When ALM Is On the Scene

Here’s Your Close-Up: Get Ready for ALM 2.0

• Software Test & Performance

By Brian Carroll

f your organization is seeking an application life cycle management tool, you’d do well to adopt one that supports ALM

2.0, which improves on prior versions by automatically synchronizing the metadata and development artifacts among tools, while still allowing each tool to manage much of its own data storage. Automatic synchronizations could include, for example, adjusting the test plan to reflect changes in the status of a requirement or an issue, or launching a series of activities to deploy a build into staging or production once that deployment is approved. Think of the automation of ALM as replacing the “cut and paste” that occurs today to keep development tools in sync, though the newer spec adds a lot more than that. ALM 2.0 helps all the stakeholders in the software development process communicate and collaborate more efficiently. ALM frameworks provide the glue that integrates stakeholders and their development tools across the life cycle and opens opportunities for process improvement and quality improvement through the use of better tools. For more on ALM 2.0, see the sidebar “ALM’s Second Coming.”

Eclipse Already There The Eclipse development community is often quick to exploit new specifications or development ideas. For example, the Eclipse Application Lifecycle Framework (ALF) project (www.eclipse .org/alf) already implements an open source ALM 2.0 framework. Though still in the incubation phase, ALF provides event-driven tool integration and orchestration using standards such as Web services and BPEL, the OASIS Web Services Business Process Execution Language OASIS Web Services Business Process Execution Language. Brian Carroll is lead developer on the Eclipse Application Lifecycle Framework (ALF) project and a Serena Fellow.

JANUARY 2008

ALF also provides a single sign-on capability that allows a user’s identity to be passed securely to all the tools that participate in an orchestration. This article explains how ALM systems—using ALF as a model—can give development teams, and specifically testers, the tools they need to perform most effectively. To provide development teams with flexibility and responsiveness, ALF employs the notion of Event Driven Architecture (EDA), which allows

• ALM 2.0 helps all the stakeholders in the software development process communicate and collaborate more efficiently.

• tools to emit an event when something changes within a tool that may affect related data in another tool (for example, once a build has been completed or a bug has been approved to be fixed in the current release). The events go through the ALF Event Manager, which filters the events of interest and routes them to the appropriate BPEL process. A BPEL process is like a flowchart that indicates the sequence and data to be passed among tools to keep them in sync. ALF also contains a standards-

based identity capability (using WSSecurity, WS-Trust, WS-Federation and SAML Assertions) that conveys the identity of the user making the change that triggered an event to be passed through the BPEL process to all the relevant tools. As a result, all the tools are updated to reflect the change. ALF also provides common services that make gluing the tools together easier. These services include engines to send e-mail notices to team members when certain events occur, logging services to create an audit trail of the development activities, and relationship services to establish relationships among artifacts stored within different tools. Finally, ALF provides a set of best practices for tool integration and vocabularies. Vocabularies define the core data structures, events and services that tools should expose in different stages of the life cycle. An ALF vocabulary describes the essential events and services for configuration management, requirements, testing, etc. Vocabularies make it simpler to define orchestrations and substitute development tools with minimal disruption to the tool integrations. However, ALF also will work if tools do not support the vocabularies. Based on standards and implemented as open source, ALF eliminates the nightmare that happens when an organization upgrades a tool in its development stack and the proprietary point-to-point integrations break. This can often bring development to a grinding halt. Not only is the underlying source code of the interoperability platform available to the shop, but the logic of the integrations is expressed in the high-level standard workflow language of BPEL. The development team (most likely the configuration management team) can make the fixes themselves without having to wait for the vendor to get a patch into the next maintenance release cycle.

ALM on Testing Another positive byproduct of ALM 2.0 is that it will make testers’ lives eas-

www.stpmag.com •

ALM 2 IN ACTION

es pro nologi SDL h c e t t SOA hen W ies tha concepts, t and implet i l i b a of OA cap t nceive rn the ning S opmen 5. Lea art by lear g able to co your devel rhaps t r in pe vide. S n BPEL. Be egration” fo easier and er d i s t n e b n o and th e “killer i ke your jo . this, c a th on actice iew to partic r p y ment ation will m e recogniti d to vellrea d its purv e a d z m t i ’ e ritical s o n r n s c a a o n s g u d i w r e o t t t o f y u x n n o o e s o ge e er 1. If y he QA team your shop’s ALM soluti e surem which chan le n garn a e e v e m , t n f r g es mp having the design o erwise, you ss, such as o e itiativ and knowin porating si to n i l l b e a h n c r h s o ols ipate i process. Ot flawed pro changes to As wit the proces t ar t by inc n PPM to are . 6 a t n g n S QA r t. ving blow opme automatin ish betwee an review. impro d which hu orate full- ocess and e m u r p n p r a o t could b esn’t disting that need hu help , then inc pmen ting o e r ts develo ropaga ust that d ed and thos o p r p u y e b r o y y at how eam m g to ficienc autom track ng. ase ef ut the QA t ke sense. ocessin e r r P c n t i i n v b n ma ls, impro cess. ex Eve 2.0 ca ng too re they Compl lopment pro g 2. ALM hanges amo ools to ensu n i y l e c t of er app he dev ool simple links among Consid quality of t ecution de like. T . x e 7 t ’ n e r h o e o t o d o e th monit s you d not-s ch as c tomate improv h tool of great an est-ofcan au ting tools, su eric code t i w .0 2 k “b ix uc ra en LM g to get st in a m to incorpo ause A es to more g Don’ t uently cont a es switchin rs that tr y 3. Bec ols, consider anners and d the resourc . 8 o k q se d c o a s a e t o n r h m h e f f v sets o vulnerability ay not have suites ools. ALM ar y of yourself, “Wamew e m y B t it u . k secur ols, that yo great tools easier latform. As priet ar y fr s. to p ro roduce p t i quality reviously. b r e e d ” i n t o t h e i r ve d b y a p s be tem t s ’ p y n u t s r o o r e y o e D s p h t d. sup lock rest is planne well as level; st inte ess as ons work as a crooked y’re e c b o r p the the rati sing ” 4. Test re the integ ilds walls u s behave as g and work? n u n u i o b s l i d t o e n a h k r a t a h w s r e M n to erro ur orch e maso like th nsure that yo cial attention e e test to to, giving sp d e s o propagates changes (for example, a supp ns. onditio requirement is deleted) once those edge c

ier. With the new spec, tool integration is based on the development process, such that updates to project requirements are synchronized across all the tools used in the development cycle. So when a requirement is deferred to a later release, the test management tool will have been automatically notified to disable tests related to that require-

• Software Test & Performance

ment. The expected improvement in system quality resulting from that synchronization is what sells the technology to development managers. ALM 2.0 can eliminate the rote work needed to synchronize data across tools, but tool integrations are generally driven by mechanical transformation rules and don’t yet understand the semantics (or meaning) of development artifacts. Perhaps such understanding will come in ALM 3.0. For now, the current spec

links are established, but if there are changes to the wording of a requirement that subtly change its meaning, the automated transformation rules won’t catch that change. So the human intelligence of a QA team is still needed to interpret a requirement and determine which tests are appropriate to verify the requirement has been satisfied (Director’s Note 1). Certain changes in requirements or bug reports will require changes in JANUARY 2008

ALM 2 IN ACTION

application code and the tests themselves. Improved communication between tools—and therefore all the roles in development—may make metadata more accessible to testers, but crafting a flexible test and identifying edge conditions to be tested still requires the skills of a journeyman tester. However, with ALM 2.0, the QA department will have more time to design and craft tests, and spend less time on creating custom automations and reporting on the tests. ALM integrations will perform those tasks (Director’s Note 2). ALM also impacts how integrations among tools are expressed. Today, scripting with command-line interfaces is the common way to tie together a sequence of tools. However, it’s difficult to integrate tools that run on different platforms (for example, Linux or the mainframe), or expose their interfaces over the Web rather than the command line. With ALM 2.0, BPEL can sequence and integrate tools that run on different platforms and that don’t expose command-line interfaces at all. And the orchestration can be expressed using a BPML or BPEL editor rather than as script code using a text editor.

Unit and Component Testing The development of unit tests will be largely unaffected by ALM 2.0. However, unit tests will be incorporated into much larger sequences of testing as part of the continuous integration movement. Why run unit tests continuously when you can’t incorporate code scanning, security scanning and other types of quality assurance into the process? An early ALF prototype incorporated security scanning with traditional test management, and combined the reporting from both into a single “Deploy to Production” issue and report. It has long been a strange irony of our industry that the automated test suite is run manually. With ALM 2.0, the automated test suite can run automatically after, say, a successful build (Director’s Note 3).

Integration vs. Application Testing While the focus of testing will be on the application delivered to the end user, the development processes and its integrations will become more inclusive and, therefore, will also need JANUARY 2008

LM 2.0: THE SECOND COMING The following is an excerpt from Carey Schwaber’s August 2006 Forrester report that introduced the term ALM 2.0. Tomorrow’s ALM is a platform for the coordination and management of development activities, not a collection of life-cycle tools with locked-in and limited ALM features.These platforms are the result of purposeful design rather than rework following acquisitions. The architectural ingredients of ALM 2.0 are: · Practitioner tools assembled out of plug-ins. An à la carte approach to product packaging provides customers with simpler, cheaper tools. Far from being a pipe dream, this approach is a reality today. IBM has done the most to exploit this concept, currently providing many different grades of development and modeling tools that are all available as perspectives in Eclipse, as well as the ability to install only selected features packs in each of these tools. This approach has not yet been successfully applied outside of development and modeling tools. For example, today, customers must choose between defect management that’s too tightly coupled with test management and software configuration management (SCM) or defect management in a stand-alone tool. · Common services available across practitioner tools. Vendors are identifying features that should be available from within multiple practitioner tools—notably, collaboration, workflow, security and reporting and analytics—and driving them into the ALM platform. Telelogic has started to make progress on this front for administrative functionality like licensing and installation. Microsoft has gone even further: Visual Studio Team System leverages SharePoint Server for collaboration and Active Directory for authentication, and because it uses SQL Server as its data store, it can leverage SQL Server Analysis Services and SQL Server Report Builder for reporting and analytics. · Repository neutrality. At the center of ALM 2.0 sits not one repository, but many. Instead of requiring use of the vendor’s SCM solution for storage of all life cycle assets, tomorrow’s ALM will be truly repository-neutral, with close to functional parity, no matter where assets reside. IBM, for example, has announced that in coming years, its ALM solution will integrate with a wide variety of repositories, including open source version-control tools like Concurrent Versions System (CVS) and Subversion. This will drive down ALM implementation costs by removing the need to migrate assets—a major obstacle for many shops—and will bring development on mainframe, midrange, and distributed platforms into the same ALM fold. · Use of open integration standards. Two means of integration—use of Web services APIs and use of industry standards for integration—will ease and deepen integration between a single vendor’s tools, as well as between its tools and third-party tools. Many vendors still don’t offer Web-services-based APIs, but this will change with time. In addition, new standards for life-cycle integration, including Eclipse projects like the Test and Performance Tools Project (TPTP) and Mylar, promise to simplify tools integration. One case in point: SPI Dynamics reports that its integration with IBM Rational ClearQuest Test Manager took one-third of the time it would have taken if both tools didn’t leverage TPTP. · Microprocesses and macroprocesses governed by externalized workflow. The ability to create and manage executable application development process descriptions is one of the big wins for ALM 2.0. When processes are stored in readable formats like XML files, they can be versioned, audited and reported upon.This facilitates incremental process improvement efforts and the application of common process components across otherwise discrete processes. For example, Microsoft Visual Studio Team System process templates are implemented in XML and contain work-item-type definitions, permissions, project structure, a project portal and a version control structure. There is no solution on the market that possesses all of these characteristics, but this is the direction in which most vendors are moving. However, it will be at least two years before any vendor offers a solution that truly fulfills the vision of ALM 2.0. Source: Forrester Research, Inc.

www.stpmag.com •

ALM 2 IN ACTION

was scattered in silos maintained by each tool. The ALM 2.0 framework can create a central log of activities that can be fertile ground for insights that can improve your processes. Tools exist that use Complex Event Processing to look for patterns in event logs. For example, you may be able to identify a pattern in which changes made by a certain developer or using a certain tool led to a failure when running automated tests. Analyzing the logs for such patterns may lead to insight that can improve the overall process quality (Director’s Note 7).

Easier Tool Changes

Management Visibility

to be tested (Director’s Note 4).

New Skills Needed SOA is perhaps the most commonly flaunted buzzword in the industry today. And for good reason: Beyond the hype, it’s clear that SOA provides a platform-independent way to integrate different software systems that previously were difficult or impossible to integrate. Those systems could be enduser applications or even systems of development and test tools. ALF uses Web services as the primary interface description and communication mechanism. That means that the WSDL describes the events and services of tool interfaces, and BPEL is used to describe process automations. And why not? Software development has been building SOA-based systems for the business; why not apply SOA to improve the way software development operates? Shouldn’t the shoemaker’s children have shoes? (Director’s Note 5)

ALM 2.0’s automation capabilities provide the potential for providing QA and management with far greater visibility into the development process. Orchestrations can gather data from tools and refine them for consumption by Project Portfolio Management (PPM) tools for

ALF vocabularies provide another dimension to the flexibility that ALM 2.0 provides. The idea is that switching tools today can be a painful experience—history can get lost, not all data can be migrated, and integrations will have to be redone. ALF vocabularies are developed in an open and transparent process by sets of companies agreeing on the essential data that each tool exposes, clearly defining the meaning of that data and segregating essential, common data from tool-specific data. Integrations can then be expressed as a core using the common data elements and extended to leverage any tool-specific data. Such an approach increases the plug-ability of tools. Switching tools then involves only changes to the tool-specific data, and then only if that data is consumed by other tools (Director’s Note 8).

Get Started Now

•

Open source ALM 2.0 is available today in the Eclipse ALF project. It provides infrastructure for event routing, guidance for process automation using the BPEL engine of your choice and authentication, as well as conveying identity among tools integrated with Web services. It also provides some vocabularies—and you can participate in developing new vocabularies for the tools you use, or try your hand at extending some existing ones. Download ALF from www.eclipse.org/alf and get a head start on improving your development organization’s quality and productivity today. ý

Open source ALM 2.0 is available today in the Eclipse ALF project.

• Software Test & Performance

• presentation by QA and development managers (Director’s Note 6).

Use Complex Event Processing Prior to ALM 2.0, logged information

JANUARY 2008

By Alexander Podelko

efining performance requirements is an important part of system design

and development. If there are no written performance requirements, it means that they exist in the heads of stakeholders, but nobody bothered to write them down and make sure that everybody agrees on them. Exactly what is specified may vary significantly depending on the system and environment, but all requirements should be quantitative and measurable. Performance requirements are the main input for performance testing (where they are verified), as well as capacity planning and production monitoring. There are several classes of performance requirements. Most traditional are response time (how fast the system can handle individual requests) and throughput (how many requests the system can handle). All classes are vital: Good throughput with a long response time often is unacceptable, as is good response time with low throughput. Response time (in the case of interactive work) or

processing time (in the case of batch jobs or scheduled activities) defines how fast requests should be processed. Acceptable response times should be defined in each particular case. A time of 30 minutes could be excellent for a big batch job, but absolutely unacceptable for accessing a Web page in a customer portal. Response time depends on workload, so you must define conditions under which specific response times should be achieved; for example, a single user, average load or peak load. Significant research has been done to define what the response time should be for interactive systems, mainly from two points of view: what response time is necessary to achieve optimal userâ&#x20AC;&#x2122;s performance (for tasks like entering orders) and what response time is necessary to avoid Web site abandonment (for the Internet). Most researchers agreed that for most interactive applications, there is no point in making the response time faster than one to two seconds, and itâ&#x20AC;&#x2122;s helpful to provide an indicator (like a progress bar) if it takes more than

â&#x20AC;˘ Software Test & Performance

eight to 10 seconds. The service/stored procedure response-time requirement should be determined by its share in the end-to-end performance budget. In this way, the worst-possible combination of all required services, middleware and presentation layer overheads will provide the required time. For example, with a Web page with 10 drop-down boxes calling 10 separate services, the response time objective for each service may be 0.2 seconds to get three seconds average response time (leaving one second for network, presentation and rendering). Response times for each individual transaction vary, so use some aggregate values when specifyAlexander Podelko is a software consultant currently

engaged by Oracle. JANUARY 2008

ing performance requirements, such as averages or percentiles (for example, 90 percent of response times are less than X). Maximum/timeout times should be provided also, as necessary. For batch jobs, remember to specify all schedule-related information, including frequency (how often the job will be run), time window, dependency on other jobs and dependent jobs (and their respective time windows to see how changes in one job may impact others). Throughput is the rate at which incoming requests are comJANUARY 2008

pleted. Throughput defines the load on the system and is measured in operations per time period. It may be the number of transactions per second or the number of adjudicated claims per hour. Defining throughput may be pretty straightforward for a system doing the same type of business operations all the time, processing orders or printing reports. It may be more difficult for systems with complex workloads: The ratio of different types of requests can change with the time and season. Itâ&#x20AC;&#x2122;s also important to observe how throughput varies with www.stpmag.com â&#x20AC;˘

GAUGING PERFORMANCE

time. For example, throughput can be defined for a typical hour, peak hour and non-peak hour for each particular kind of load. In some cases, you’ll need to further detail what the load is hourby-hour. The number of users doesn’t, by itself, define throughput. Without defining what each user is doing and how intensely (i.e., throughput for one user), the number of users doesn’t make much sense as a measure of load. For example, if 500 users are each running one short query each minute, we have throughput of 30,000 queries per hour. If the same 500 users are running the same queries, but only one query per hour, the throughput is 500 queries per hour. So there may be the same 500 users, but a 60X difference between loads (and at least the same difference in hardware requirements for the application—probably more, considering that not many systems achieve linear scalability).

Response Times: Review of Research As long ago as 1968, Robert B. Miller’s paper “Response Time in ManComputer Conversational Transactions” described three threshold levels of human attention1. J. Nielsen believes that Miller’s guidelines are fundamental for human-computer interaction, so they are still valid and not likely to change with whatever technology comes next 2. These three thresholds are: • Users view response time as instantaneous (0.1-0.2 second): They feel that they directly manipulate objects in the user interface; for example, the time from the moment the user selects a column in a table until that column highlights or the time between typing a symbol and its appearance on the screen. Miller reported that threshold as 0.1 seconds. According to P. Bickford, 0.2 second forms the mental boundary between events that seem to happen together and those that appear as echoes of each other 3. Although it’s a quite important threshold, it’s often beyond the reach of application developers. That kind of interaction is provided by operating system, browser or interface libraries, and usually happens on the client side without interaction with servers (except for dumb terminals, that is rather an excep-

• Software Test & Performance

tion for business systems today). • Users feel they are interacting freely with the information (1-5 seconds): They notice the delay, but feel the computer is “working” on the command. The user’s flow of thought stays uninterrupted. Miller reported this threshold as one second. Using the research that was available to them, several authors recommended that the computer should respond to users within two seconds 1, 4, 5. Another research team reported that with most data entry tasks, there was no advantage of having response times faster than one second, and found a linear decrease in productivity with slower

• An animated watch cursor was good for more than a minute, and a progress bar kept users waiting until the end.

• response times (from one to five seconds)6. With problem-solving tasks, which are more like Web interaction tasks, they found no reliable effect on user productivity up to a five-second delay. The complexity of the user interface and the number of elements on the screen both impact thresholds. Back in 1960s through 1980s, the terminal interface was rather simple, and a typical task was data entry, often one element at a time. Most earlier researchers reported that one to two seconds was the threshold to keep maximal productivity. Modern complex user interfaces with many elements may have higher response times without adversely impacting user productivity. According

to Scott Barber, even users who are accustomed to a sub-second response time on a client/server system are happy with a three-second response time from a Web-based application7. P. Sevcik identified two key factors impacting this threshold8: the number of elements viewed and the repetitiveness of the task. The number of elements viewed is the number of items, fields, paragraphs etc. that the user looks at. The amount of time the user is willing to wait appears to be a function of the perceived complexity of the request. Users also interact with applications at a certain pace depending on how repetitive each task is. Some are highly repetitive; others require the user to think and make choices before proceeding to the next screen. The more repetitive the task, the better the expected response time. That is the threshold that gives us response-time usability goals for most user-interactive applications. Response times above this threshold degrade productivity. Exact numbers depend on many difficult-to-formalize factors, such as the number and types of elements viewed or repetitiveness of the task, but a goal of three to five seconds is reasonable for most typical business applications. • Users are focused on the dialog (8+ seconds): They keep their attention on the task. Miller reported this threshold as 10 seconds. Anything slower needs a proper user interface (for example, a percent-done indicator as well as a clear way for the user to interrupt the operation). Users will probably need to reorient themselves when they return to the task after a delay above this threshold, so productivity suffers.

A Closer Look At User Reactions Peter Bickford investigated user reactions when, after 27 almost instantaneous responses, there was a twominute wait loop for the 28th time for the same operation. It took only 8.5 seconds for half the subjects to either walk out or hit the reboot. Switching to a watch cursor during the wait delayed the subject’s departure for about 20 seconds. An animated watch cursor was good for more than a minute, and a progress bar kept users waiting until the end. JANUARY 2008

GAUGING PERFORMANCE

Bickford’s results were widely used for setting response times requirements for Web applications. C. Loosley, for example, wrote, “In 1997, Peter Bickford’s landmark paper, ‘Worth the Wait?’ reported research in which half the users abandoned Web pages after a wait of 8.5 seconds. Bickford’s paper was quoted whenever Web site performance was discussed, and the ‘eight-second rule’ soon took on a life of its own as a universal rule of Web site design.” A. Bouch attempted to identify how long users would wait for pages to load 10. Users were presented with Web pages that had predetermined delays ranging from two to 73 seconds. While performing the task, users rated the latency (delay) for each page they accessed as high, average or poor. Latency was defined as the delay between a request for a Web page and the moment when the page was fully rendered. The Bouch team reported the following ratings: Good Up to 5 seconds Average From 6 to 10 seconds Poor More than 10 seconds In a second study, when users experienced a page-loading delay that was unacceptable, they pressed a button labeled “Increase Quality.” The overall average time before pressing the “Increase Quality” button was 8.6 seconds. In a third study, the Web pages loaded incrementally with the banner first, text next and graphics last. Under these conditions, users were much more tolerant of longer latencies. The test subjects rated the delay as “good” with latencies up to 39 seconds, and “poor” for those more than 56 seconds. This is the threshold that gives us response-time usability requirements for most user-interactive applications. Response times above this threshold cause users to lose focus and lead to frustration. Exact numbers vary significantly depending on the interface used, but it looks like response time should not be more than eight to 10 seconds in most cases. Still, the threshold shouldn’t be applied blindly; in many cases, significantly higher response times may be acceptable when appropriate user interface is

implemented to alleviate the problem.

Not-So-Traditional Performance Requirements While they’re considered traditional and absolutely necessary for some kind of systems and environments, some requirements are often missed

•

When resource requirements are measured as resource utilization, it’s related to a particular hardware configuration.

JANUARY 2008

• or not elaborated enough for interactive distributed systems. Concurrency is the number of simultaneous users or threads. It’s important: Connected but inactive users still hold some resources. For example, the requirement may be to support up to 300 active users, but the terminology used to describe the number of users is somewhat vague. Typically, three metrics are used: • Total or named users. All registered or potential users. This is a metric of data the system works with. It also indicates the upper potential limit of concurrency. • Active or concurrent users. Users logged in at a specific moment of time. This is the real measure of concurrency in the sense it’s used here. • Really concurrent. Users actually running requests at the same time. While that metric looks appealing and is used quite often, it’s almost impossible to measure and rather confusing: the number of “really concurrent” requests depends on the processing time for this request. For example, let’s assume that we got a requirement to support up to 20 “concurrent” users. If one request takes 10 seconds, 20 “concurrent” requests mean throughput of 120

requests per minute. But here we get an absurd situation that if we improve processing time from 10 to one second and keep the same throughput, we miss our requirement because we have only two “concurrent” users. To support 20 “concurrent” users with a one-second response time, you really need to increase throughput 10 times to 1,200 requests per minute. It’s important to understand what users you’re discussing: The difference between each of these three metrics for some systems may be drastic. Of course, it depends heavily on the nature of the system.

Performance and Resource Utilization

The number of online users (the number of parallel session) looks like the best metric for concurrency (complementing throughput and response time requirements). Finding the number of concurrent users for a new system can be tricky, but information about real usage of similar systems can help to make the first estimate. Resources. The amount of available hardware resources is usually a variable at the beginning of the design process. The main groups of resources are CPU, I/O, memory and network. When resource requirements are measured as resource utilization, it’s related to a particular hardware configuration. It’s a good metric when the hardware the system will run on is known. Often such requirements are a part of a generic policy; for example, that CPU utilization should be below 70 percent. Such requirements won’t be very useful if the system deploys on different hardware configurations, and especially for “off-the-shelf” software. When specified in absolute values, like the number of instructions to execute or the number of I/O per transaction (as sometimes used, for example, for modeling), it may be considered as a performance metric of the software itself, without binding it to a particular hardware configuration. In the mainframe world, MIPS was often used as a metric for CPU consumption, but I’m not aware of such a www.stpmag.com •

GAUGING PERFORMANCE

widely used metric in the distributed systems world. The importance of resource-related requirements will increase again with the trends of virtualization and serviceoriented architectures. When you depart from the “server(s) per application” model, it becomes difficult to specify requirements as resource utilization, as each application will add only incrementally to resource utilization for each service used. Scalability is a system’s ability to meet the performance requirements as the demand increases (usually by adding hardware). Scalability requirements may include demand projections such as an increasing number of users, transaction volumes, data sizes or adding new workloads. From a performance requirements perspective, scalability means that you should specify performance requirements not only for one configuration point, but as a function, for example, of load or data. For example, the requirement may be to support throughput increase from five to 10 transactions per second over the next two years, with response time degradation not more than 10 percent. Most scalability requirements I’ve seen look like “to support throughput increase from five to 10 transactions per second over next two years without response time degradation”—that’s possible only with addition of hardware resources. Other contexts. It’s very difficult to consider performance (and, therefore, performance requirements) without context. It depends, for example, on hardware resources provided, the volume of data operated on and the functionality included in the system. So if any of that information is known, it should be specified in the requirements. While the hardware configuration may be determined during the design stage, the volume of data to keep is usually determined by the business and should be specified.

The Difference Between Goals And Requirements

•

One issue, as Barber notes, is goals versus requirements11. Most response time “requirements” (and sometimes other kinds of performance requirements) are goals (and sometimes even dreams), not requirements: something that we want to achieve, but missing them won’t necessarily prevent deploying the system. You may have both goals and requirements for each of the performance metrics, but for some metrics/systems ,they are so close that from the practical point of view, you can use one. Still, in many cases, especially for response times, there’s a big difference between goals and requirements (the point when stakeholders agree that the system can’t go into production with such performance). For many interactive Web applications, response time goals are two to five seconds, and requirements may be somewhere between eight seconds and one minute. One approach may be to define both goals and requirements. The problem? Requirements are very difficult to get. Even if stakeholders can define performance requirements, quite often go/no-go decisions are based not on the real requirements, but rather on second-tier goals. In addition, using multiple performance metrics that only together provide the full picture can complicate your process. For example, you may state that you have a 10-second requirement and you took 15 seconds under full load. But what if you know that this full load is the high load on the busiest day of year, that the max load for other days falls below 10 seconds, and you see that it is CPU-constrained and may be fixed by a hardware upgrade? Real response time requirements are so environment- and businessdependent that for many applications, it’s cruel to force people to make hard decisions in advance for each possible

Using multiple performance metrics that only together provide the full picture can complicate your process.

• Software Test & Performance

•

combination of circumstances. Instead, specify goals (making sure that they make sense) and only then, if they’re not met, make the decision about what to do with all the information available.

Knowing What Metrics to Use Another question is how to specify response time requirements or goals. For example, such metrics as average, max, different kinds of percentiles and median can be used. Percentiles are more typical in SLAs (service-level agreements). For example, “99.5 percent of all transactions should have a response time less than five seconds.” While that may be sufficient for most systems, it doesn’t answer all questions. What happens with the remaining 0.5 percent? Does this 0.5 percent of transactions finish in six to seven seconds or do all of them time out? You may need to specify a combination of requirements: for example, 80 percent below four seconds, 99.5 percent below six seconds, 99.99 percent below 15 seconds (especially if we know that the difference in performance is defined by distribution of underlying data). Other examples may be average four seconds and max 12 seconds, or average four seconds and 99 percent below 10 seconds. Things get more complicated when there are many different types of transactions, but a combination of percentile-based performance and availability metrics usually works fine for interactive systems. While more sophisticated metrics may be necessary for some systems, in most cases sophistication can make the process overcomplicated and difficult to analyze. There are efforts to make an objective user-satisfaction metric. One is Application Performance Index (www.Apdex.org). Apdex is a single metric of user satisfaction with the performance of enterprise applications. The Apdex metric is a number between 0 and 1, where 0 means that no users were satisfied, and 1 means all users were satisfied. The approach introduces three groups of users: satisfied, tolerating and frustrated. Two major parameters are introduced: threshold response times between satisfied and tolerating users T, and between tolerating and frustrated users F 12. There probably is a relationship between T and the JANUARY 2008

GAUGING PERFORMANCE

TABLE 1: THE SEVCIK METHODS 1. Default value (the Apdex methodology suggests 4 seconds) 2. Empirical data 3. User behavior model (number of elements viewed/task repetitiveness) 4. Outside references 5. Observing the user

response time goal and between F and the response time requirement.

Where Do Performance Requirements Come From? If you look at performance requirements from another point of view, you can classify them into business, usability and technological requirements. Business requirements come directly from the business and may be captured very early in the project life cycle, before design starts. For a requirement such as ”A customer representative should enter 20 requests per hour, and the system should support up to 1,000 customer representatives,” requests should be processed in five minutes on average, throughput would be up to 20,000 requests per hour, and there could be up to 1,000 parallel sessions. The main trap here is to immediately link business requirements to a

6. Controlled performance experiment 7. Best time multiple 8. Find frustration threshold F first and calculate T from F (the Apdex methodology assumes that F = 4T) 9. Interview stakeholders 10. Mathematical inflection point

saved separately. We have the same business requirements, but response times per page and the number of pages per hour would be different. Usability requirements (mainly related to response times) also figure into the performance equation. Many researchers agree that users lose focus if response times are more than eight to 10 seconds, and that response times should be two to five seconds for maximum productivity. These usability considerations may influence design choices (such as using several Web pages instead of one). In some cases, usability requirements are linked closely to business requirements; for example, make sure that your sys-

•

Small anomalies from expected behavior are often signs of bigger problems.

• t e m ’ s response times aren’t worse than response times of similar or specific design, technology or usability requirement, thus limiting the number of available design choices. If we consider a Web system, for example, it’s probably possible to squeeze all the information into a single page or have a sequence of two dozen screens. All information can be saved at once, or each page of these two dozen can be JANUARY 2008

the chosen design. For example, if we need to call 10 Web services sequentially to show the Web page with a three-second response time, the sum of response times of each Web service, the time to create the Web page, transfer it through the network and render it in a browser should be below three seconds. That may be translated into response-time requirements of 200-250 milliseconds for each Web service. The more we know, the more accurately we can apportion overall response time to Web services. Another example of technological requirements can be found in the resource consumption requirements. In its simplest form, CPU and memory utilization should be below 70 percent for the chosen hardware configuration. Business requirements should be elaborated during design and development, and merge together with usability and technological requirements into the final performance requirements, which can be verified during testing and monitored in production. The main reason that we separate these categories is to understand where the requirement comes from: Is it a fundamental business requirement or a result of a design decision that may be changed if necessary? Determining specific performance requirements is another large topic that is difficult to formalize. Consider the approach suggested by Sevcik for finding T, the threshold between satisfied and tolerating users. T is the main parameter of the Apdex (Application Performance Index) methodology, providing a single metric of user satisfaction with the performance of enterprise applications. Sevcik defined 10 different methods (see Table 1). The idea is to use several (say, three) of these methods for the same system. If all come to approximately the same number, they give us T. While the approach was developed for production monitoring, there is definitely a strong correlation between T and the response time

competitor systems. The third category, technological requirements, comes from chosen design and used technology. Some technological requirements may be known from the beginning if some design elements are used, but others are derived from business and usability requirements throughout the design process and depend on

www.stpmag.com •

GAUGING PERFORMANCE

goal (having all users satisfied sounds as a pretty good goal) and between F and the response time requirement. So the approach probably can be used for getting response time requirements with minimal modifications. While some specific assumptions like four seconds for default or the F = 4T relationship may be ip for argument, the approach itself conveys the important message that there are many ways to determine a specific performance requirement, which, for validation purposes, is best derived from several sources. Depending on your system, you can determine which methods from the above list (or maybe some others) are applicable, calculate the metrics and determine your requirements.

Requirements Verification: Performance vs. Bugs

Usually you have no idea what caused the observed symptoms or how serious it is.

Requirement verification presents another subtle issue: how to differentiate performance issues from functional bugs exposed under load. Often, additional investigation is required before you can determine the cause of your observed results. Small anomalies from expected behavior are often signs of bigger problems, and you should at least to figure out why you get them. When 99 percent of your response times are three to five seconds (with the requirement of five seconds) and 1 percent of your response times are five to eight seconds, it usually isn’t a problem. But it probably should be investigated if 1 percent fail or have strangely high response times (for example, more than 30 seconds, with 99% three to five seconds) in an unrestricted, isolated test environment. This isn’t due to some kind of artificial requirement, but is an indication of an anomaly in system behavior or test configuration. This situation often is analyzed from a requirements point of view, but it shouldn’t be, at least until the reasons for that behavior

• Software Test & Performance

•

become clear. These two situations look similar, but are completely different in nature: 1.) The system is missing a requirement, but results are consistent: This is a business decision, such as a cost vs. response time tradeoff; and 2.) Results aren’t consistent (while requirements can even be met): This may indicate a problem, but its scale isn’t clear until investigated. Unfortunately, this view is rarely shared by development teams too eager to finish the project, move it into production, and move on to the next project. Most developers aren’t very excited by the prospect of debugging code for small memory leaks or hunting for a rare error that’s difficult to reproduce. So the development team becomes very creative in finding “explanations.” For example, growing memory and periodic long-running transactions in Java are often explained as a garbage collection issue. That’s false in most cases. Even in the few instances when it is true, it makes sense to tune garbage collection and prove that the problem is gone. Teams can also make fatal assumptions, such as thinking all is fine when the requirements stipulate that 99 percent of transactions should be below X seconds, and less than 1 percent of transactions fail in testing. Well, it doesn’t look fine to me. It may be acceptable in production over time, considering network and hardware failures, OS crashes, etc. But if the performance test was run in a controlled environment and no hardware/OS failures were observed, it may be a bug. For example, it could be a functional problem for some combination of data. When some transactions fail under load or have very long response times in the controlled environment and you don’t know why, you’ve got one or more problems. When you have an unknown problem, why not trace it down and fix it in the controlled environment? What if

•

these few failed transactions are a view page for your largest customer, and you won’t be able to create an order until it’s fixed? In functional testing, as soon as you find a problem, you usually can figure out how serious it is. This isn’t the case for performance testing: Usually you have no idea what caused the observed symptoms or how serious it is, and quite often the original explanations turn out to be wrong. Michael Bolton described the situation concisely 13: As Richard Feynman said in his appendix to the Rogers Commission Report on the Challenger space shuttle accident, when something is not what the design expected, it’s a warning that something is wrong. “The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way.” When a system is in an unpredicted state, it’s also in an unpredictable state.

Raising Performance Consciousness We need to specify performance requirements at the beginning of any project for design and development (and, of course, reuse them during performance testing and production monitoring). While performance requirements are often not perfect, forcing stakeholders just to think about performance increases the chances of project success. What exactly should be specified— goal vs. requirements (or both), average vs. X percentile vs. Apdex, etc.— depends on the system and environment, but all requirements should be both quantitative and measurable. Making requirements too complicated may hurt here. You need to find meaningful goals/requirements, not invent something just to satisfy a bureaucratic process. If you define a performance goal as a point of reference, you can use it throughout the whole development cycle and testing process, tracking your progress from a performance engineering viewpoint. Tracing this metric in production will give you valuable feedback that can be used for future system releases. ý REFERENCES 1. Miller, R. B. Response Time in User-system Conversational Transactions, In Proceedings of the AFIPS Fall Joint Computer Conference, 33, 1968.

JANUARY 2008

Photo Courtesy of The Library of Congress

Drink Deep or Taste Not The Reliability Pool By Yogananda Jeppu and Ambalal Patel

f Shakespeare were to write a play about software development and testing activities,

he would perhaps call it “The Tragedy of Errors.” And with his use of the language, he would no doubt receive poor marks for usability. Nevertheless, his “Comedy of Errors” offers several parallels to the Tragedy of Errors that is many of today’s software development projects. That’s true at least in JANUARY 2008

the field of safety-critical development, where we do most of our work.

Flight Control Software “Prove it before these varlets here, thou honourable man; prove it” may sound dramatic, but those words from Shakespeare’s Yogananda Jeppu and Ambalal Patel are scientists at IFCS Aeronautical Development Agency in Bangalore, India. www.stpmag.com •

TRAGEDY OF ERRORS

tem. The market is full of automated tools, but most fall short. Sometimes the only solution is built by hand.

FIG. 1: TYPICAL IRON BIRD SETUP

Flight Dynamic Simulator

Engineering Test Stand With Input Output Rack & Signal Processing

Actual Actuators & Hydraulic Systems

“Measure for Measure” are a fact of life for the tester and the project team. Proving the safety-critical flight control software on various platforms is a taxing task for the whole project team. The safety-critical nature of the software necessitates a strict discipline in the software development process. Automated code generators, test case generators and various low-level tests using automated tools verify the software and ensure an error-free development. This ends the software verification task, which is often carried out by an independent team. The process of validating the software and the system demands an end-to-end test against the specifications. In a flight control application, validating the quadruplex redundancy built into the computer and the

Digital Flight Control Computer With Control Laws & Airdata

Avionics & Pilot Station

embedded software requires a specialized setup to generate typical flight scenarios. The complete aircraft hardware is placed in a large room connected to custom hardware that emulates the flight conditions. This setup is normally called the Iron Bird. What follows is a narration of our experiences with automatic testing on this platform as we offer you a look into the facts of the process. What should be the automatic pass/fail criteria or bounds for the results in this noisy (both electronically and audibly) environment? “To Pass or to Fail: That Is the Question.” We’re sure you’ll be quoting us after we bid adieu! The test procedures detailed here can’t be found in any textbook. They were developed by following our instincts and the knowledge of our sys-

FIG. 2: AN EVALUATION TOOL

Input Generator

Tolerance Bound Generator

Control Law & Airdata Software with Tolerance Processing

Events

• Software Test & Performance

Output

The Iron Bird Friar John, go hence; Get me an iron crow, and bring it straight unto my cell. – “Romeo and Juliet” A setup like Iron Bird isn’t unique to the aerospace industry. We’ve visited automobile plants that use similar setups for their traction units. What these and other such systems all have in common is that the various embedded controllers test actual scenarios in real time. A typical Iron Bird setup consists of actual aircraft actuators with their associated electrical and hydraulic setup (see Figure 1). This equipment produces audible noise, which is easily controlled by earplugs. The actuators are connected to an Engineering Test Stand (ETS). The ETS has all the electronics to emulate a four-channel sensor environment and is electronically noisy. This noise—the equivalent of what remains after being reduced to whatever extent is possible technologically—is the more dangerous type to applications. However, it can’t be completely removed and must therefore be factored in and accommodated. The Flight Dynamic Simulator (FDS) is where the aircraft is simulated in software. This unit generates the sensor signals as seen during a typical flight. The Avionics and Pilot Station enables the engineer to verify that the messages displayed to the pilot appear correctly. It also enables the pilot to fly the simulator while the test engineer creates in-flight failure scenarios by plucking wires and switching off systems. This is the final frontier for the Digital Flight Control Computer (DFCC). The two critical software components, namely Control Laws and the Airdata system, get thoroughly validated in the presence of simulated errors in this platform. Any bugs travelling out of this setup will be caught flying by the pilot.

The Aircraft Is Unstable! The embedded controller is a software component that takes in the aircraft sensor inputs and generates actuator commands to stabilize the aircraft and enable flight. This so-called fly-by-wire JANUARY 2008

TRAGEDY OF ERRORS

technology is seen in action in all modern fighter aircraft and the jumbo jets that ferry us around the world today. The Airdata system computes aircraft speed and altitude, and the controller uses this information to provide uniform operation over the entire flight envelope. Now these software components are thoroughly tested at various levels. But this has to be proved in an actual environment with induced errors. What if the wire gets cut? Can your software handle it? These questions have to be answered satisfactorily. This is more like browbeating the software—and it really gets a good beating at Iron Bird. Many tests are carried out at this platform and classified into static and dynamic tests. We’re restricting our scope to static testing for the discussion here. Thousands of static tests are conducted on this platform to validate the embedded controller. This isn’t possible without automation. We give the computer the pass/fail criteria by generating automatic error bounds specific to each test case. Any numerical value of the output variables falling out of these numerical limits will cause a test to fail. The error bounds are generated taking into consideration the electronic noise in the sensor inputs and the actuator outputs. Here we consider all the noise in the electronic circuits from the sensor until the point where the software takes over at the input end. We also take into account the noise in electronics from where the software generates the digital commands until the actuator begins to drive. The noise here is from the sensor electronics, the signal conditioners, the linear/rotary variable differential transformers, analog-to-digital converters and digital-to-analog converters. The effects of hardware characteristics such as offset, gain and biases come into the picture here. If we inject a rate signal of 10.0 degrees per second, we’re likely to get, say, 9.123, 11.843 or any random value between these two bounds. The noise is also dependent on the amplitude of the signal. Higher amplitude yields

FIG. 3: THE THREE WITCHES

The various processes act on the three witches as they fly through the tube. When they come out, they are of different sizes. The three witches represent the limits Upper, Nominal and Lower for the Input and Output Signals.

more problems. The pass/fail criteria should take into consideration all these noise factors and the fact that the embedded controller is going to generate some noise of its own due to processing. It’s essential that the width

•

defined test case. The sensors include body rates and accelerations, static and dynamic pressures, pilot inputs, etc. These input signals are constants, since we’re considering static tests. A set of inputs would be to set pitch rate to 10 deg/s, yaw rate to 0.0 deg/s and normal acceleration to 3g, etc. The set of inputs, selected for a test, is injected into a tolerance bound generator. Based on the particular hardware characteristics and the senor noise, bias offset and gain, three values are generated representing the upper, nominal and lower limits of the sensor output. The three values of each sensor variable are injected into the Control Law and Airdata module, which includes an algorithm provided by the designers. The Control Law and Airdata module defines the embedded controller functionality in the form of various paths interconnected, with signals getting added, subtracted, multiplied or divided, as the case may be. Each path has controller elements such as saturation limits, nonlinear blocks, gains and switches. Since we’re testing the system in a static mode, we don’t take into account the filters and rate limiters. They’re considered as unity gains for constant inputs. Neither will we go into laplace trans-

As the witches fly through the long tube, they’re battered around based on specific logic, and come out in a different size.

JANUARY 2008

• of the bounds be optimal to catch bugs. This requires a tool tailor-made for the application.

A Comparison Tool A specialized tool has been developed for the test activity. We call it the Evaluation Tool, or EVTOOL for short. It’s a simple name for complex software. The EVTOOL block schematic is shown in Figure 2. An input generator generates a set of values for all the sensors based on the

www.stpmag.com •

TRAGEDY OF ERRORS

forms, etc. Event triggers such as aircraft switch inputs are injected separately. The Control Law and Airdata system outputs are three values of the output variables that define the pass/fail bound. When the test case is executed on the Iron Bird, it should lie within the upper and lower bounds of these output values for the test case to pass. Any anomaly is automatically declared as a fail requiring further analysis.

The Three Witches The Tolerance Bound Generator generates three values of the input signal. Let’s say we want to test the controller with a pitch rate input of 10 deg/s. The output of the module would be something like 10.897, 10.0 and 9.456. Imagine these as the three witches in “Macbeth” who are flying through the Airdata System and Control Law modules (as in a game of Quidditch). Now imagine the Control Law and

FIG. 4: ADD OR SUBTRACT SIGNALS

Signal B

Signal A

Output Y = A+B

example, say that signal A has three witches (components): aL the lower bound, a0 the nominal value and aU the upper bound. Similarly, the signal B has its bounds defined by [bL , b0 , bU ]. The output Y with its three components is computed by the following equations: Y = [yL, y0, yU] =[(aL + bL), (a0 + b0), (aU + bU)] for addition Y = A+B Y = [yL, y0, yU] =[(aL - bU), (a0 - b0), (aU - bL)] for subtraction Y = A-B

FORMULA 1: OUTPUT BOUNDS

yL = min {zL1 , zL2 , zU 1, zU 2 } y0 = x0 g0

where

yU = max {zL1 , zL2 , zU 1 , zU 2 }

z L1 = y0 + z L 2 = y0 − zU 1 = y0 + zU 2 = y0 −

Airdata module as a factory that bangs on the items passing through it or stretches them with tongs. As the witches fly through the long tube, they’re battered around based on specific logic, and come out in a different size (see Figure 3). It’s possible that the lower limit would come out as the upper limit. We’ll discuss this in detail as we go along.

Do the Math A drop of water in the breaking gulf And take unmingled that same drop again Without addition or diminishing. —“Comedy of Errors” In the Control Law, two signals can add up or subtract, as shown in Figure 4. The output is equal to the sum of signal A and signal B. The bounds for the output signals are computed from the bounds of the signals A and B. For

• Software Test & Performance

(x0 ⋅ (g0 − g L )) + (g0 ⋅ ( x0 − xL ))  2 2  (x0 ⋅ (g0 − g L )) + (g0 ⋅ (x0 − xL ))   2 2 (x0 ⋅ ( gU − g0 )) + (g0 ⋅ (xU − x0 ))  2 2 (x0 ⋅ (gU − g0 )) + (g0 ⋅ (xU − x0 )) 

Please note here that for the difference of two signals (A-B), the upper limit of signal B is subtracted from the lower limit of signal A to give the lower limit of the output signal. These worstcase bounds defined above, however, may give you a wider estimate of the bound, thus passing a case that should have failed. This occurs even more frequently in cases where random noise is present. We find that using a Root Mean Square (RSS) representation helps in these cases. For the case Y = A+B, where the variables are defined as above, the RSS is defined as:

y L = y0 −

y0 = a0 + b0 yU = y0 +

     2 + (bU − b0 ) 

(a0 − aL ) + (b0 − bL ) (aU

− a0 )

In case the signals are subtracted Y = (A-B), you can change the sign of b to -b and interchange the upper and lower limits of B. A “wide band” or “narrow band” is the question; you’ll have to decide which formula to use for your specific problem. Two signals can divide or multiply; for example, Y = A/B or Y = AxB. In cases like this, a Kronecker tensor product is computed. Leopold Kronecker incidentally believed that “God made the integers, all else is the work of man.” The product, a human work, is basically a combination of the upper, lower and nominal values of signal A and B, as given below. Z = A ƒ B = [aLbL, aLb0, aLbU, a0bL, a0b0, a0bU, aUbL, aUb0, aUbU] Z = A ƒ (1/B) = [aL/bL, aL/b0, aL/bU, a0/bL, a0/b0, a0/bU, aU/bL, aU/b0, aU/bU] The bounds of the output are now defined as the minimum, nominal and maximum of the product computed above. Y = [yL, y0, yU] = [min (Z), {a0b0 or a0/b0}, max (Z)] This is the absolute worst-case tolerance bound, which is quite wide.

Signal and Gain The multiplication of a gain with a signal can be considered a multiplication of two signals (see Figure 5). This gives a wider bound. An RSS approach gives a better result in such cases. Let the signal X be defined by the three components, and the gain G be similarly defined by the three components. The bounds for the output are then defined as shown in Formula 1.

Linear Interpolation Nonlinear blocks are normally specified in a Control Law as two sets of variables; JANUARY 2008

TRAGEDY OF ERRORS

say, X and Y. Each of this is a vector of {x1, x2, x3 ..}, {y1, y2, y3 ..}, points known as the breakpoints of the nonlinearities. These nonlinear blocks can form a positive slope or a negative slope, as shown in Figure 6. The output is dependent on the characteristics of this slope. In case of a positive slope, you’ll observe that the lower limit of the output corresponds to the lower limit of the input. But this isn’t true for a negative slope, where the upper limit of the output corresponds to the lower limit of the input. Notice how the limits were interchanged, as indicated earlier. Notice also that the slopes shown in the figures are linear. There could be two or three different slopes in the nonlinear block (see Figure 7). In such cases, the components would have different ranges for the output. A large slope for the upper limit of the input could increase the upper limit of the output drastically. This would widen the bounds or reduce it, changing the shapes of the witches as they fly through the tube.

SHAKESPEAREAN TRAGEDY If Shakespeare had written a play about software development and testing, the story might open like this: ACT I, SCENE 1: A Day in the Software House Narrator: Diseased Nature oftentimes breaks forth: In strange eruptions. – “Henry IV” Certification Agent: All is not well; I doubt some foul play. – “Hamlet” Project Manager: Find out the cause of this effect, Or rather say, the cause of this defect, For this effect defective comes by cause. – “Hamlet” Prove it before these varlets here, thou honourable man; prove it – “Measure for Measure” Test Lead: I will prove it legitimate, sir, upon the oaths of judgment and reason. – “Twelfth Night” Tester A: O hateful error, melancholy’s child, Why dost thou show to the apt thoughts of men: The things that are not? – “Julius Caesar” Tester B: When sorrows come, they come not single spies, But in battalions. – “Hamlet” Tester C: The nature of bad news infects the teller. – (“Antony and Cleopatra” Whispers in Coffee Room: What’s this, what’s this? Is it her fault or mine? The tempter, or the tempted, who sins most, ha? - “Measure for Measure” Project Manager: Condemn the fault, and not the actor of it? – “Measure for Measure” Test Lead: But since correction lieth in those hands, Which made the fault that we cannot correct... – “Richard II” Test Team: Unarm, Eros; the long day’s task is done, And we must sleep. – “Antony and Cleopatra” Just don’t get caught sleeping on the job, or your project might end up like the Ariane 5’s Flight 501.

Two-Dimensional Lookups The “G” gain defined in the section “Signal and Gain” section above could be a constant or an output of a twodimensional lookup table similar to Table 1. These feedback gains are used in aircraft to ensure performance over the flight envelope. The gains are “scheduled” based on the speed and altitude. Refer to “Thou Shalt

The two input signals to the lookup table would be altitude and speed, each signal having its three components. The gain thus computed would also have a bound. In these cases we normally compute the gains for all combinations of the two signals.

interpolated gain output Y = [yL, y0, yU] with tolerance bounds is computed as described below. We consider the set of all combinations of inputs with their three components as below: C = {(aL, bL), (aL, b0), (aL, bU), (a0, bL), (a0, b0), (a0, bU), (aU, bL), (aU, b0), (aU, bU)}

TABLE 1: A TABLE OF GAINS Altitude/Speed

100m

1000m

5000m

10000m

0.0

2.354

4.363

3.456

3.567

0.1

3.235

5.347

4.575

3.567

0.2

4.354

6.474

5.374

3.879

Consider two signals A and B as inputs to the lookup table. A = [aL, a0, aU] and B = [bL, b0, bU] are defined with their tolerance bounds. The

Experiment With Thy Software” in the June 2007 issue of Software Test & Performance magazine for an article on the testing of these gain tables.

FIG. 5: SIGNAL MULTIPLICATION

Signal X

JANUARY 2008

We then compute the gain output using the lookup table for all these combinations of inputs as: Z = {z1, z2, z3, z4, z5, z6, z7, z8, z9} The nominal value is z5 corresponding to the input combination (a0, b0). The upper and lower bounds for the computed gains are then given by taking the maximum and minimum of Z as: Y = [yL, y0, yU] =[min (Z), z5, max (Z)]

Non-Monotonic Nonlinearities

Output Y = X*G

We have to take a little care when special nonlinear blocks are present in the path. These nonlinearities can exist as shown in Figure 8. In the convex nonlinearity (a), we see that the output bounds are not around the nominal. In cases like this, continued on page 37 > www.stpmag.com •

When Your Company Resorts to Sending Its Testing Overseas, a Quality Audit Can Help Ensure a Winning Season 30

â&#x20AC;˘ Software Test & Performance

JANUARY 2008

By Steve Rabin and John Bradway

Photos by Stefan Klein

JANUARY 2008

utting costs is the goal of every company. And software testing is a practice that has been ripe for the offshore harvest. Given the inherent risks of using offshore

resources for testing and adapting offshore teams to agile practices, it’s important to ensure that the best practices of the onshore team are being replicated elsewhere. Through an analysis of the approach used by one company—Primavera Systems—to align software quality efforts around the globe, you can learn how to perform a quality audit that is equally applicable to teams inside the four walls of an organization. We considered Primavera a good case study because it has internal development centers in the U.S. (Bala Cynwyd, Pa. and San Francisco), Israel and London, as well as offshore development/QA centers in India and Eastern Europe. The idea behind a quality audit is to perform a systematic examination of the practices a team uses to build and validate software. This is important, as issues of quality are represented across the software development life cycle. The audit aims not to uncover software defects, but to understand how well a team comprehends and executes the defined quality practices. The quality audit can be used to assess if a process is working and if things are being done the way they’re supposed to be. The audit is also an excellent way of measuring the effectiveness of implemented procedures. Management can use audit results as a tool to identify weaknesses, risks and areas of improvement. The audit is as much about people as it is about the procedures in place. The more team members understand their roles and how that relates to quality, the more likely the team will grasp and adhere to the defined practices. The ultimate objective, of course, is to deliver high-quality software (as defined by the organization). Quality audits are typically performed at regular intervals. The initial audit develops a quality procedure baseline and determines areas that need remediation. Subsequent audits are monitoring-oriented to help teams identify and address gaps in the quality process. Remediation involves articulating effective actions that address deficiencies in the daily process of conducting quality practices. A quality audit can be aided by the use of software tools but, as stated above, it’s as much about people mentoring and teaching as anything else. Any team, offshore or otherwise, will best respond to an approach that focuses on helping individuals meet expectations and generate improvement. Steve Rabin is CTO at Insight Venture Partners. John Bradway is development manager at Primavera Systems.

The audit as described throughout this document consists of five phases: Pre-assessment planning. This phase includes setting expectations, creating a timeline and getting executive sponsorship for the project audit. The deliverable for this phase is agreement with and buy-in of the audit process and a follow-up commitment for improvement. Typically this is capped by a meeting with the audit sponsor and key stakeholders on the audit process and objectives. Data gathering. This phase involves developing interview questions and surveys and gathering all documentation (bug reports, test cases) prior to the interview process. Assessment. The assessment phase involves conducting interviews and developing preliminary findings. Confidentiality is crucial, and team members must clearly understand the process. A meeting explaining the process and the reasons behind it should be held with the entire team. Post audit. After reviewing documents and interview notes, the analyzed information is synthesized into a list of findings and prioritized remediation steps. Presentation of findings with sponsor and team. Findings are presented and agreement is reached on highest-priority improvement areas. In Primavera’s case, the quality audit and the entire software development quality management life cycle are tied to the ISO 9001:2000 standard. There are a variety of reasons for this, as explained below.

Aligning Scrum With ISO 9001 Primavera Systems, which has practiced Scrum since June 2003, found itself with an interesting dilemma in the beginning of 2005. Increasingly, perspective customers were inquiring about the ISO certification level of Primavera’s development processes. In fact, quality auditing is an important element in ISO’s quality system standard. The most recent ISO standard has shifted the focus of audits from procedural adherence only to measuring the effectiveness of quality practices to delivered results. While this makes sense, implementing and assessing the usefulness of ISO in an agile environment is a challenge. After all, the Agile Manifesto declares “working software over comprehensive documentation” and “people and interactions over process and tools.” Many

OFFSHORE PLAYBOOK

agile thought leaders don’t consider ISO a priority. Primavera addressed this issue internally by focusing the team on common core objectives and the firm’s mission to deliver the best possible quality software. The firm sells software to highly regulated markets, so the need to support customer requirements, including ISO, is an important consideration. Bottom line is that the quality audit made sense and provided the opportunity to better align the on- and offshore teams. As a first step, Primavera engaged an outside consultant to assess the current software development life cycle and information technology procedures as they relate to alignment with ISO 9001:2000 standards. Associated with this was the delivery of a gap analysis between the current SDLC and the standards. The results were a little surprising. While Primavera wasn’t producing all of the documentation to meet the letter of the law as it relates to the ISO standard, existing processes were highly aligned with the spirit and intent of the standard. The consultant felt that by organizing many of the existing artifacts into a quality management system and creating a limited amount of additional documentation, the team could safely declare, “We’re aligned with the ISO 9001:2000 standard” and provide enough documentation to back this up. An important side benefit was the ability to use the assembled information as a valuable reference for developers and a great resource for helping new employees ramp up. The documents, created by the consultant, described key processes used in the software development life cycle: software testing, configuration management, defect tracking, internationalization, product maintenance, requirements management and

• Software Test & Performance

release management. In other words, the documentation described Primavera’s agile methods across development domains. During the same time period, Primavera was engaged in creating two offshore development Scrum teams in Bangalore. This emerging documentation, which articulated development processes, proved to be a highly useful resource when ramping up the offshore team. Since the offshore team was unfamiliar with Scrum methodologies in general and Primavera’s practices in particular, being pointed to relevant sections of the documentation helped orient them and made them productive more quickly. It also eased communication, since everyone was using the same practices, metrics and terminology.

Preparing for the Audit The quality policy described in the quality management system called for annual audits of all quality processes, on- and offshore. Regardless of the documentation, development management considered it important to ensure that the quality practices being used offshore duplicated those being used at home. Better to address quality issues before there’s an actual problem. Given the number of development locations, the audit team needed to develop a repeatable process for the project audit. For example, what exactly did the team want to measure, what data was needed, how would the data be collected and what rules would be used to generate the metrics? Also, would it be necessary to normalize the data? The team developed a checklist to qualitatively measure how well the offshore teams were conforming to agile practices. Audit criteria were selected from various sections of the quality manual along with input from members representing multiple domains. This resulted in a 43-point audit covering requirements management, design and implementation, configuration management, testing, defect tracking and release management.

•

The team decided it was more important to audit the quality practices themselves vs. the quality of the required artifacts. While both are important, the quality of the artifacts can be improved through training and education, but if the processes aren’t in place, not much can be done. This approach is less threatening to the teams being audited if the quality of the work is not the primary focus. The audit is meant to be foremost a fact-finding and learning experience. With a baseline in place, the ongoing review process can look at the quality of the artifacts and determine ways to help the team make improvements. The team was most interested in objectively looking for evidence that the quality process was being followed. Throughout the process, the team made sure to consult with management, initially to acquire and maintain leadership buy-in. Meetings were held with development managers and executives, both on- and offshore, to discuss the audit and its goals and to solicit input. The team discovered that the word audit evoked some uncomfortable responses in offshore teams. Because of this, it was important to set the context for the audit as an exercise in self-improvement and as a retrospective tool to help drive positive change. Positioning the audit in this light helped ensure the team’s cooperation and active participation. Since agile methods generally provide a high degree of transparency and are constantly being refined, the teams were comfortable with (and appreciated) the focus. With management support and the offshore team’s understanding of the context of the audit, the audit team began gathering and examining some of the related artifacts, documents that had been delivered by the offshore team over time. The idea was to look at existing requirements and attempt to trace them through the quality process. In total, this involved looking for the code, test plans, automation and test results— both automated and manual, related to

What exactly did the team want to measure, what data was needed, and how would it be collected?

•

JANUARY 2008

OFFSHORE PLAYBOOK

the delivered requirement. There were several reasons for doing this. First, it was valuable to gain an understanding of how transparent the process was, based on the exchange of materials from teams distributed around the world. This was also a learning exercise to become familiar with the team’s work processes and communication style. The team was able to identify a variety of specific issues that required detailed examination during the on-site audit.

The Audit Checklist And Evaluation Scheme The audit checklist was developed to measure how effectively a team has implemented and follows the quality guidelines. There are a number of ways Primavera used the results of the audit, so the checklist had to be synced to the objectives. This included: To baseline the current state of development within an organization prior to introducing Primavera’s development process. To monitor the progress of adoption of Primavera’s documented Quality Management System and identify areas where the implemented practices were satisfactory as well as areas that could be improved. The criteria used in the audit are grouped under the following high-level headings: • Requirements management • Design and implementation • Configuration management • Testing • Defect tracking • Release management While the overall goal of conducting a project audit is to provide a qualitative view of the suitability of and adherence to the development process, a quantitative view is also necessary. For example, the auditor conducting the evaluation may find it useful to flag certain criteria or results that he feels the need to emphasize. Similarly, the team being audited may have certain concerns that they want to have the auditor scrutinize. The Primavera audit team adopted the following scheme for measuring each of the 43 audit points: 1.00 If the process fully satisfies the criterion 0.75 If the testing process largely satisfies the criterion 0.50 If the testing process partially satisfies the criterion 0 If the testing process does not JANUARY 2008

satisfy the criterion This scheme must be used with some caution and is provided only as a guideline. The reason for this is simply that auditors interpret things differently, making it difficult to determine the precise meaning of any particular score. A degree of subjectivity occurs during the audit process that must be taken into account along with the objective measures. The audit criteria are listed below. This is not the full, detailed spreadsheet used by auditors on-site, but rather the higher-level categories. Keep in mind that this checklist was developed to address the specific needs of Primavera and the objectives of the audit. The terms used below along with the referenced tools are also specific to Primavera’s use of agile practices. Requirements management • Primavera PM used to track requirements • Collaboration with Product Owner • Requirements estimated using accepted techniques (Ideal Team Days) • Sprint 0 to refine estimates • Sprint -1 to determine initial requirement estimates Design and implementation • Feature specifications (created as design documents and updated “as built” when requirement is completed) • Code tested and passes unit tests prior to check-in • Formal code reviews requested by programming manager for appropriately complex areas • Peer or buddy review of code for code check-ins • Requesting schema changes through schema change process • Technical designs where appropriate • Designs and coding consider internationalization • Online help and printed manuals updated for new requirements Configuration management • Builds automated • Builds automatically deployed twice daily • Automated process replicates all client server, Web and Group Server code to all four ClearCase servers • After a compile of the merged code,

JUnit and FitNesse tests are run. In the event of a failure, an e-mail notification is sent and the merge is rejected and deferred to the next day. • FitNesse tests run automatically with daily builds Testing • Acceptance tests cover requirements and are automated • Test procedures documented in Mercury Quality Center • Developer unit tests written • Automated Silk tests run to validate code integration • Test cases and test results can be traced to requirements • Internationalization cases considered during testing • Performance testing • Test results records stored • Test cases peer reviewed • Active system tests conducted Defect tracking • In-process defects “scrummed” appropriately • Defect reporting using Mercury Quality Center • Defect threshold counts used for sprint entry/exit criteria Release management • Sprint review meetings held • Release management team reviews • Sprint retrospectives • Sprint closeout processes (backlogs updated) • Tracking progress with burn-downs • Attend Scrum Master meetings • Daily team meetings • Product Owner sets sprint priorities • Co-located teams • Obstacles removed daily • Sprint planning meetings • Task granularity (e.g., no more than 16 hrs.)

Performing the Audit It’s worth noting that, except for the largest organizations, it’s not necessary or desirable to employ specific auditors. It’s best to choose auditors from within the team and across disciplines. In fact, there are good arguments for cycling the auditing team so many resources get the opportunity to interact with their peers and view the quality process from a different perspective. Auditor training is recommended so that the results can be as normalized as possible. www.stpmag.com •

OFFSHORE PLAYBOOK

By choosing a broad spectrum of auditors from all levels of the organization, you’ll ensure that everyone has commitment to the Quality System and gain a better understanding of the overall management process. It’s also a good way for people from various areas to gain an understanding of how their department fits into the organization. The audit can be done formally or informally, with every resource or selected resources. The format is based on the objectives, and in Primavera’s case ,the approach chosen was informal. Four developers, two business analysts and the entire QA staff participated in the first offshore audit. The team met with these individuals over a two-day period focused on looking at individual processes in addition to how requirements traced through the different artifacts. The mantra for the audit was not only “tell me,” but “show me.” Listening to individuals describe the process and what was done is important, but so is viewing the actual artifacts. Since a quality process is interconnected across the development life cycle, time was spent looking at related, adjacent links to other areas of the process. It was important to set and then manage expectations when reviewing people’s work. Reminding everyone that the focus was on improvement helped us achieve a balanced auditor/auditee environment. Making sure that all participants gain value from the experience is also important. Everyone involved learned something new about the processes in use, the rationale behind them and how providing traceability adds value. Being able to take a feature and trace it back through test results, automation, unit tests and requirements makes obtaining a detailed understanding of a feature much easier than walking in cold. Teams shouldn’t expect to achieve a perfect score, so this is another area that requires expectation management. In Primavera’s case, the teams fared very well. Most of the shortcomings were not unexpected, since the on- and offshore teams worked closely together on a daily basis. Interestingly, several valuable insights into other areas that needed improvement were exposed. This was unexpected, but a

good benefit. At the conclusion of the audit, a meeting was held with the entire staff to review the preliminary results of the audit prior to presenting the results to management. In each audit, each line item that did not receive a 1 was analyzed, and a series of next steps was discussed. A final remediation plan was developed and approved based on the conclusions drawn from analyzing all of the results.

•

perform defined practices consistently, and it should be noted when this is not the case. Bear in mind that the audit has two modes of operation: appraisal and analysis. Both of these modes involve a combination of subjective and objective influences. Appraisal generally involves resource issues, while analysis is more procedure-oriented. Identifying issues across both of these spectrums is an important audit skill and needs to be taken into account and documented. After all, the remediation plan involves changes to both how people act and what the procedure accomplishes. The results of the audit along with associated documentation were added to Primavera’s quality management repository. A meeting was held to discuss the results and remediation plan with the same team that met to kick off the audit process; this included stakeholders and management. As in the on-site review, each area that didn’t score a 1 was discussed.

Audit findings need to be documented, and problems reported for further action.

• Software Test & Performance

• Wrapping Up the Audit Audit findings need to be documented and problems reported for further action. A date should be established for the correction, and the next audit should ensure that issues were remedied. The audit documentation doesn’t need to be complicated, but should include the audit plan, the audit notes and the audit report. The notes are the items the auditor wrote down during the audit, and can include specific findings, responses to questions, key documents reviewed and comments. The audit report is the “official” document used to report the findings of the audit. A template for this document should be prepared by the audit team, amended as necessary and consistently used by all auditors. The document should include details of the audit, date, auditors’ names and findings. Once agreed to, the audit report should include the remediation plan. The process audit is concerned with both the validity and overall reliability of the process. For example, does the process consistently produce the intended results? It’s important to identify nonvalue-added steps that the team may have added to the process. Once identified, this needs to be documented. The team should demonstrate the ability to

Lessons Learned Setting the stage for the audit as an indepth retrospective aligned with the agile goal of continuous improvement helped Primavera secure individual buyin for the audit. Since it seemed that people weren’t prepared for the “show me” mentality employed during the interview process, better communication and expectation settings with the team makes sense. Discussing the style to be used during the audit also helps make those involved more comfortable. The scoring system is a work in process and will be improved over time. It remains subjective, which is acceptable, but the addition of a weighting system is under consideration. The use of weighting for areas that are fundamental to the development process will provide better results. Since each area of the quality management process isn’t equal in importance, weighting will expose those areas of concern more visibly. The bottom line is that the entire development organization now realizes the value this kind of audit can bring to the goal of continuous improvement. ý JANUARY 2008

Best Prac t ices

Unit Testing Tools Unlikely To Transform Life or Industry Count me among those rupt-driven work and home skeptical that unit testing lives that most of us lead. will ever go mainstream. And, software traditionalists Granted, I don’t write be damned, I suspect the code. However, I do churn same can be said of most out a goodly amount of code that’s written in today’s tech-related prose. And fast-changing, forever-iteralike most science and techtive world of development. nology journalists I know, I At the risk of being too more or less loathe the solipsistic, dear reader, you process of revising and should know that most Geoff Koch editing what I write. months I find myself scramWhy is this relevant? Because the bling to file this column with some semwrestling that journalists do with parablance of respect for the deadline. graphs, sentences and individual words (Though for this particular screed, I is the rough analog to unit testing in failed miserably.) Then, a day or two after software development. dashing this off to the Long Island–based It’s been said that all journalists are editors, I invariably think of several ways frustrated novelists. While I don’t have an in which I could have more adroitly made unfinished manuscript in my desk drawer, my point. I’m guessing the feeling is quite I am familiar with traditional lore about close to those experienced by the vast great writers. Namely, most of them spend majority of developers when the code the lion’s share of their time redoing or base they’ve labored to contribute to outright ripping up their drafts and startfinally goes from beta into production. ing anew. Here’s former U.S. Poet What Would Alberto Do? Laureate Billy Collins on the subject, from While reporting this column, I heard his poem “Royal Aristocrat.” from many people with unit testing “I was a single monkey / trying to experience, and almost all of them type out the opening lines of my own made the same two points. The first is Hamlet / but often doing nothing more that unit testing undeniably makes than ironing pieces of paper in the platsense. It’s cheaper and more efficient, en / then wrinkling them into balls to the pros say, to have developers do their flick into the wicker basket.” own testing and to automate the process But folks, journalism ain’t poetry or of running a steadily growing library of any other type of literature, though I did these tests. The second is that the ranks have a professor at Stanford who argued of developers and teams that actually as much. (No doubt he had that unfinpractice unit testing are modest indeed, ished manuscript in his desk drawer.) if not vanishingly small. Journalism is much more akin to the In his Jan. 24, 2007, blog entry process of writing code than to anything “Testing Genes, Test Infection, and the in the so-called higher literary arts. A Future of Developer Testing,” Alberto journalistic article is quick, approximate Savoia declares that developers who and often destined to have a shelf life— resist unit testing far outnumber those particularly when the subject is technolowho practice the craft, zealously or othgy—measured in weeks or even days erwise. “This spells trouble because I rather than years. Generally, the ideal of have found that when developers continuous backspacing, tweaking and become managers, they bring with them polishing is forever yielding to the interJANUARY 2008

their attitudes about testing,” says Savoia, founder and chief technology officer of Agitar Software in Mountain View, Calif. Savoia is worth listening to on the subject of testing and software generally. He has contributed to products that have won a bevy of awards, including JavaOne’s Duke Award and Software Development magazine’s Productivity Award. And his résumé includes a list of senior technical positions at Sun and Google to go along with that cherished Silicon Valley merit badge: starting and then selling a company for a minor fortune. Savoia’s load testing company Velogic, founded in 1998, was sold to Keynote just two years later for $50 million. Beyond his experience, Savoia brings one of the most candid and engaging personalities in the software industry to an interview. It’s clear after just a few minutes that the guy is having fun and has a well-developed sense of humor about the often stilted and obtuse world of testing and development. One example of this is the chosen moniker, Crap4j, of the open source project Savoia spawned with Bob Evans. The tool helps developers find code that’s crap to maintain, either because it’s too complex, doesn’t have enough associated tests or both. I can’t help but think as Savoia enthusiastically describes his career that in many ways, he’s a case study as to why unit testing will never reach its tipping point. We don’t talk specific numbers, but clearly that résumé of his implies an enviable degree of financial security. So why start yet another company in the ultra-niche testing market? Because, he explains, he loves this stuff. In the hyper-logical, left-brained fashSeen any good poetry related to programming? Best Practices columnist Geoff Koch wants to know: gkoch at stanfordalumni dot org. www.stpmag.com •

Best Practices ion that typifies most successful techies, Savoia says we’re all constantly faced by a more or less binary set of choices about our lives that can be described by the equation X + Y = total well-being, financial and otherwise. X, usually a responsible position at an established company, is associated with guaranteed income. Y is associated with choice imbued with those elusive qualities of emotional engagement, satisfaction of intellectual curiosity and overall happiness. Once financial security is achieved, he continues, it always makes sense to pursue Y when confronted with an either-or choice about what to do next. The logic, which is unassailable, is not what’s interesting here. Rather, it’s that for Savoia, happiness is more about mucking about testing lines of code than, say, fishing in Montana, going to cooking school in the south of France, or in my case, buying courtside seats and losing myself in Big 10 basketball from November to March. Savoia’s choice is sort of akin to a journalist hit-

ting the jackpot with a best-selling book and then opting to stay at the rewrite desk for years thereafter—for fun. I’ve interviewed enough developers to suspect that even among the I-dream-incode crowd, Savoia is an outlier.

Limited Commercial Appeal More concrete reason for skepticism comes from several of the other interviewees for this column, including one at industry giant Cisco Systems. “Right now Cisco is in a latency curve with unit testing,” says Andy Chessin, a technical leader at the San Jose, Calif.based company. “Unit testing has a huge barrier to entry. If the group really isn’t passionate about it, doesn’t understand it or doesn’t have the time to get started with it, they probably won’t be driven to take the plunge.” “Right now, few people really understand what [API-level] unit testing involves or how to get started with it,” he says. “But I think they will soon catch on.” Maybe, though I wouldn’t bet on it.

“Simplicity is essential to a core unit testing framework; for this reason, the most successful unit test frameworks tend to stay small, in code size and in team size,” says David Saff, an MIT doctoral student who is now one of the lead maintainers of JUnit. “In comparison to other successful open source projects, this promotes a proliferation of thirdparty extensions, while limiting contributions to the core framework. I think this somewhat limits the chance for a company to arise that would do the equivalent for, say, JUnit, that Covalent has done for Apache.” In other words, unit testing is destined to remain by and for a modest group of undoubtedly smart specialists. So while the worlds of writing and programming will always have a small priesthood obsessed with quality and constant revision, most of the rest of us will continue to muddle along as best we can—at least until we can hang up our reporter’s notebooks and text editors for other pursuits. ý

Index to Advertisers Advertiser

URL

Page

Automated QA

www.testcomplete.com/stp

Checkpoint Technologies

www.checkpointech.com/BuildIT

FutureTest 2008

www.futuretest.net

Hewlett-Packard

www.hp.com/go/securitysoftware

iTKO

www.itko.com/lisa

Seapine

www.seapine.com/ttstech

Software Test & Performance

www.stpmag.com

Software Test & Performance

www.stpcon.com

www.stpmag.com

2-3

Conference Spring 2008

Software Test & Performance White Papers

• Software Test & Performance

JANUARY 2008

TRAGEDY OF ERRORS

< continued from page 29 we must take care to see that the apex point is taken for computing the bounds. It’s possible that the convex nonlinearity may be a curve instead of the inverted V shape shown here. In such cases, we compute the apex point as the maximum value that is possible in the region of interest. In case of convex nonlinearity, the apex point defines the upper bound. But in case of the concave nonlinearity (b), the apex point defines the lower limit. We have had numerous problems with such blocks, with cases failing when they should have passed. So take care of these blocks!

Tragedy of an Error The disastrous failure and tragic selfdestruction of the European Ariane 5 expendable launch system was the result of a software bug. During Flight 501, its maiden voyage, the data conversion of a 64-bit floating point number was too large to be represented by a 16-bit signed integer value. This outof-range value caused a hardware exception, resulting in total system failure. In India, we too have had a space vehicle fail due to out-of-range value. Because of these experiences, we now test our systems with 10 percent higher values than the maximum possible value of any signal. In such cases, if the nominal value considered is the maximum value, the upper bound is set to 10 percent higher than the maximum value. The lower bound is as given by the sensors. In case of the minimum possible value selected as the nominal, 10 percent lower value is taken as the lower bound. For example, if the maximum possible pitch rate signal is 60.0 deg/s, this is selected as the nominal.

FIG. 6: POSITIVES AND NEGATIVES

yU y0 yL

yU y0 yL X0

xL , x0 , xU

Let’s say the upper and lower bounds are 62.4 and 57.5, respectively, as given by the software module. The upper bound is taken instead as 66.0, and the lower bound is retained at 57.5 deg/s.

FIG. 7: NON-LINEAR SLOPES

Slope=S2

yU y0 yL

Slope= S1 X=[xL , x0 , xU ]

Piecewise Linear (S1 < S2)

Special Events For events triggered by a particular signal, the trigger is set based only on the nominal value. This is important because when a trigger further triggers the output of a second system, we can have problems if we don’t set the events based on the nominal value.

FIG. 8: NON-MONOTONICS

a. Convex Nonlinearity

JANUARY 2008

Negative slope

Positive slope

xL , x0 , xU

a. Concave Nonlinearity

For example, let’s say that Trigger 1 is set to True based on a speed signal of less than 70.0 km/h. When True, this trigger further sets the output value of a second system to 13.0 (a constant), and when False, it computes a value based on an algorithm. Let the signal value X = [xL, x0, xU] = [69.5, 71.3, 73]. Then Trigger 1 will take the values [1, 0, 0]. This will cause the second system to have the values [13.0, 4.0, 5.35]. The lower bound is 13.0 as compared to, say, 3.23 if the trigger was set based only on the nominal value. The upper bound is 5.35. This would cause a great deal of confusion indeed!

Avoid Tragedy Software validation is always associated with system validation, the most important issue being whether the software has been designed to cope with system failures. Proof must be provided for this in any safety-critical application, and test cases must be designed to cover these aspects. Iron Bird is the final testbed, and maintaining the software at this level is difficult. Always experiment with your system. The techniques detailed here have evolved after several experiments in the initial phase of the project. We modeled the system characteristics to develop the EVTOOL software. This has paid rich dividends and continues to do so as new versions keep coming. All the effort we put in results in something fruitful. The challenge of finding errors in the system gives the tester a pleasant feeling that is described beautifully by Shakespeare in “Much Ado About Nothing”: The pleasant’st angling is to see the fish Cut with her golden oars the silver stream, And greedily devour the treacherous bait. ý www.stpmag.com •

Future Future Test

Test

Get Development In Sync With QA Development is writing implemented functionality code for the next release of actually works. This has a the application. QA has a couple of implications. regression test suite that One is that team leaders tests the previous release, must allocate sufficient but they’re waiting for the budget and time for regresend of the current develsion test suite development opment iteration before and maintenance. I’ve they start to update the test found that the best results suite to cover the new and are achieved when there’s modified functionality. As roughly a 50/50 distribuAdam Kolawa a result, the code base is tion of effort between writevolving significantly during the develing code that represents the functionalopment phase, but the regression test ity of the application and writing code suite is not… so by the time that QA that verifies that functionality. receives the code from development, The other implication is that the the code and the regression test suite team needs to modify their workflow so are totally out of sync. that QA works in parallel with developOnce QA has the new version in ment: updating the regression test suite hand, they try to run the old regresas the developers update the code base. sion test suite against the new version To achieve this, QA must become more of the application. It runs, but an overtightly integrated with development. To whelming number of test case failures start, development has to build test are reported because the code has cases as they write code. This means that changed so much. the team needs to define and enforce a At that point, QA often thinks, policy that every time development “Instead of trying to modify these test implements a feature or use case, they cases or test scripts for the new version, add a test case to check that it’s funcwe might as well go ahead and test it by tioning correctly. hand because it’s the same amount of The role of QA, then, is to review work, and even if I update it now, I’ll still these test cases as soon as they’re writhave to update it all over again for the ten, as part of the code review procenext version.” So they end up testing by dure. Their goal here is to verify hand, and typically come to the concluwhether the test case actually represion that automation is overrated. sents the use case that is implemented That’s how automation goes to hell in the code. If QA and development in QA. work together in this manner, the test suite is constantly updated in sync with Keep in Sync the application. As a result of the divide between QA To keep this up, you need to ensure and development, the regression test that the test suite is constantly tested suite is treated as an afterthought. To against the application. This means it keep the test suite in sync with the must become part of the nightly build code, the team needs to treat the and test process. Every night, after the regression test suite like it’s a key part application is built, the regression test of the application—the part of the suite executes. If test failures are reportapplication that verifies whether the ed, then the test suite might be growing

• Software Test & Performance

out of sync with the application. Whenever that happens, the developers need to spend just a few minutes reviewing any test failures reported for their code, and then either updating the test cases (if the test failed because the functionality changed intentionally) or fixing the code (if the test failed because a modification introduced a defect). In my humble opinion, this is the only way that QA can be automated.

Looking Back, Looking Ahead Having been in this industry for 20 years now, I’ve witnessed many changes. Languages have come and gone, the level of programming abstraction has elevated, and development processes have grown increasingly compressed and iterative. Yet, from assembly code to SOA, one thing remains the same: the need for an effective, reliable way to determine if code changes negatively impact application functionality. One way of helping organizations overcome this challenge is to provide technologies that enable development teams to quickly build a regression test suite that includes unit tests, static analysis tests, load tests and anything else that can be used to identify changes. Our goal here is to help teams identify and evaluate functionality changes with as little effort as possible so that keeping the test suite in sync with the application is not an overwhelming chore. The other part is to optimize the development “production line” to support these efforts. This involves implementing infrastructure components (including source control systems, nightly build systems, bug tracking systems, requirements management systems, reporting systems and regression testing systems) to suit the organization’s existing workflow, linking these components together to establish a fully automated building/testing/reporting process, and mentoring the organization on how to leverage this infrastructure for process improvement. The result is greater productivity and fewer software defects. ý Dr. Adam Kolawa is founder and CEO of Parasoft. JANUARY 2008

April 15-17, 2008 San Mateo Marriott San Mateo, CA

SPRING

SUPERB SPEAKERS! Michael Bolton • Jeff Feldstein

TERRIFIC TOPICS!

Michael Hackett • Jeff Johnson

• Agile Testing

• Test Automation

Bj Rollison • Rob Sabourin

• UI Evaluation

• Java Testing

• Security Testing

• Testing Techniques

Mary Sweeney • Robert Walsh

AND DOZENS MORE!

• Improving Web Application Performance • Optimizing the Software Quality Process

The eXtreme Early-Bird Rate!

www.stpcon.com

• Developing Quality Metrics • Testing SOA Applications • Charting Performance Results • Managing Test Teams

A L T E R N A T I V E T H I N K I N G A B O U T A P P L I C A T I O N S E C U R I T Y:

Hone Your Threat Detection (To A Telepathic Level). Alternative thinking is attacking your own Web applications, finding vulnerabilities and destroying them with precision and vengeance— throughout the life of the application. It’s looking at application security through the eyes of a hacker to identify threats to your system and risks to your business. It’s harnessing the power of SPI Dynamics, recently acquired by HP, to redefine and expand your security abilities. (Please note: positive effects on your bottom line.) It’s assessing security the right way, from development to QA to operations—without slowing down the business. (Cue elated cheers.)