A
Publication
: ST ES BE CTIC .NET A g PR stin Te
VOLUME 4 • ISSUE 10 • OCTOBER 2007 • $8.95
www.stpmag.com
Pipeline Traffic Cop or Multicore Mensch?
Give Errors Some Class Dollars To Donuts You’re Not Being Served
True Performance Cases From Life on the Streets
The days of
‘Play with it until it breaks’ are over!
Learn how to thoroughly test your applications in less time. Read our newest white paper: “All-pairs Testing and TestTrack TCM.” Download it today at www.seapine.com/allpairs4
Seapine ®
TestTrack TCM
Software for test case planning, execution, and tracking You can’t ship with confidence if you don’t have the tools in place to document, repeat, and quantify your testing effort. TestTrack TCM can help you thoroughly test your applications in less time. Seapine ALM Solutions: TestTrack Pro Issue & Defect Management
TestTrack TCM Test Case Planning & Tracking
Surround SCM Configuration Management
QA Wizard Pro
!UTOMATED &UNCTIONAL 4ESTING
In TestTrack TCM you have the tool you need to write and manage thousands of test cases, select sets of tests to run against builds, and process the pass/fail results using your development workflow. With TestTrack TCM driving your QA process, you’ll know what has been tested, what hasn't, and how much effort remains to ship a quality product. Deliver with the confidence only achieved with a well-planned testing effort.
s %NSURE ALL STEPS ARE EXECUTED AND IN THE SAME order, for more consistent testing. s +NOW INSTANTLY WHICH TEST CASES HAVE BEEN EXECUTED WHAT YOUR COVERAGE IS AND HOW MUCH testing remains. s 4RACK TEST CASE EXECUTION TIMES TO CALCULATE HOW much time is required to test your applications. s 3TREAMLINE THE 1! &IX 2E TEST CYCLE BY pushing test failures immediately into the defect management workflow. s #OST EFFECTIVELY IMPLEMENT AN AUDITABLE QUALITY assurance program.
Download your fully functional evaluation software now at www.seapine.com/stptcm or call 1-888-683-6456. ©2007 Seapine Software, Inc. Seapine, the Seapine logo and TestTrack TCM are trademarks of Seapine Software, Inc. All Rights Reserved.
VOLUME 4 • ISSUE 10 • OCTOBER 2007
Contents
12
A
Publication
COV ER STORY The Names Have Been Changed To Protect the Innocent
Performance, load and reliability testing pose challenges that are best met early in the design phase. Test your projects with true-life environments By Rex Black and scenarios, and make savvy use of tools and models.
22
The Prince Of The Pipeline
As processor architectures change and software adapts, your testing methodology must follow. Try these techniques to master the challenge of
26
Depar t ments
Make Child’s Play of Equivalence Class Partitioning
ECP decomposes data into discrete subsets of valid and invalid classes. But it’s only as useful as you make it—the tester’s knowledge and skill are still paramount. By Bj Rollison
7 • Editorial Looking forward to STPCon and beyond.
8 • Contributors Get to know this month’s experts and the best practices they preach.
31
Are Developers Serving You With Testing Hooks?
What would you say if a developer approached you and asked how to modify an app to make it easier to do your job? Here’s how to get your testing needs served. By Torsten Zelger
9 • Feedback It’s your chance to tell us where to go.
10 • Out of the Box New products for testers.
36 • Best Practices Trolling for best practices in .NET testing? Move past Microsoft minutiae. By Geoff Koch
38 • Future Test To test SOA apps, take the user’s perspective and keep communication open. By Hon Wong
OCTOBER 2007
www.stpmag.com •
5
4AKE THE
HANDCUFFS OFF
QUALITY ASSURANCE
Empirix gives you the freedom to test your way. Tired of being held captive by proprietary scripting? Empirix offers a suite of testing solutions that allow you to take your QA initiatives wherever you like. Download our white paper, Lowering Switching Costs for Load Testing Software, and let Empirix set you free: www.empirix.com/freedom.
See us at the Software Test & Performance Conference 2007, booth #401.
Ed N otes VOLUME 4 • ISSUE 10 • OCTOBER 2007 Editor Edward J. Correia +1-631-421-4158 x100 ecorreia@bzmedia.com
EDITORIAL Editorial Director Alan Zeichick +1-650-359-4763 alan@bzmedia.com
Copy Editor Laurie O’Connell loconnell@bzmedia.com
Contributing Editor Geoff Koch koch.geoff@gmail.com
ART & PRODUCTION Art Director LuAnn T. Palazzo lpalazzo@bzmedia.com
Art /Production Assistant Erin Broadhurst ebroadhurst@bzmedia.com
SALES & MARKETING Publisher
Ted Bahr +1-631-421-4158 x101 ted@bzmedia.com Associate Publisher
List Services
David Karp +1-631-421-4158 x102 dkarp@bzmedia.com
Lisa Fiske +1-631-479-2977 lfiske@bzmedia.com
Advertising Traffic
Reprints
Phyllis Oakes +1-631-421-4158 x115 poakes@bzmedia.com
Lisa Abelson +1-516-379-7097 labelson@bzmedia.com
Director of Marketing
Accounting
Marilyn Daly +1-631-421-4158 x118 mdaly@bzmedia.com
Viena Isaray +1-631-421-4158 x110 visaray@bzmedia.com
READER SERVICE Director of Circulation
Agnes Vanek +1-631-443-4158 avanek@bzmedia.com
Customer Service/ Subscriptions
+1-847-763-9692 stpmag@halldata.com
Cover Photograph from Big Stock Photo
President Ted Bahr Executive Vice President Alan Zeichick
BZ Media LLC 7 High Street, Suite 407 Huntington, NY 11743 +1-631-421-4158 fax +1-631-421-4130 www.bzmedia.com info@bzmedia.com
Software Test & Performance (ISSN- #1548-3460) is published monthly by BZ Media LLC, 7 High St. Suite 407, Huntington, NY, 11743. Periodicals postage paid at Huntington, NY and additional offices. Software Test & Performance is a registered trademark of BZ Media LLC. All contents copyrighted 2007 BZ Media LLC. All rights reserved. The price of a one year subscription is US $49.95, $69.95 in Canada, $99.95 elsewhere. POSTMASTER: Send changes of address to Software Test & Performance, PO Box 2169, Skokie, IL 60076. Software Test & Performance Subscribers Services may be reached at stpmag@halldata.com or by calling 1-847-763-9692.
OCTOBER 2007
For Performance And for Life I’ll be seeing some of you formance bottlenecks— during the Software Test & understanding, finding Performance Conference and exploiting them—in a in Boston in the early days two-part class. of this month. There, I Also on Thursday hope to hear your input are LogiGear co-founder about what we’ve done Michael Hackett’s tips for right in the past year, turning your test team and should therefore do into a high-performance more of. organization, performI also hope that you’ll ance testing for Web 2.0 tell me how we might by load-test expert conEdward J. Correia improve the magazine to sultant Donald Foss, Java make it more useful for your jobs. EE performance tuning by Haines, October marks the twelfth month that rooting out Java EE CPU bottlenecks I’ve served as editor of this magazine, by Symphony consultant Sreekanth and I hope you have enjoyed the past Makam, maximizing SQL Server 2005 year’s articles and found them useful. performance by Mary Sweeney, and As is true of this month’s issue, performance-testing obnoxious protomuch of this year’s fall conference is cols by Mark Lustig. devoted to performance: improving It all starts with the opening the performance of your applications keynote—also delivered by the busy and of your testing. Among the fullBarber—titled “Performance Testing: day tutorials on day one are Rex It Isn’t What You Might Think,” in Black’s methods on evaluating your which the career performance tester team’s efficiency and effectiveness, challenges software testing myths. Scott Barber’s Testing Secrets in Life on the Streets Context, and guidance from Bob Putting a lighthearted Dragnet spin on Galen on how to create and lead a a serious topic, Rex Black offers five high-performance test organization. examples from the real world on how One-hour courses on day two include to improve—and destroy—application techniques on improving Web applicaperformance or the performance of tion performance using Six Sigma defect your team’s bottom line. Read it and elimination techniques by Microsoft qualavoid a moment that begins with the ity manager Mukesh Jain, performancefamous Dragnet television show testing Web applications using the theme: “Dum-da-dum-dum.” OpenSTA record-and-playback testing To address the needs of a growing framework by consultant Dan Downing, population of multicore processors in and a course on performance testing for today’s enterprise, we’ve got pipeline managers by Barber. processing expert Jim Falgout. He Also on day two is a course on conshares his extensive knowledge of tinuous performance management ways to improve performance using Java taught by Java tuning expert through parallel computing, techSteven Haines, runtime diagnostics testniques for identifying bottlenecks ing by JInspired CTO William Louth, that can choke your application’s performance test data analysis by testing throughput and benchmarks that will consultant Karen Johnson and performhelp you mark improvements in perance-testing SOA by Barber. formance as you go. ý On day three, Barber takes on perwww.stpmag.com •
7
Contributors REX BLACK has a quarter-century of software and systems engineering experience, and is president of RBCS, a software, hardware and systems testing consultancy. Rex writes our cover article on performance testing, which begins on page 12. In it, he pulls from his dozen years as RBCS’s principle consultant to share five real-world cases in which bad client decisions led to hard lessons. For each case, Rex tells how the problem was solved, again pulling from his realworld solutions. The names have been changed, but the cases are real, to help you avoid making the same mistakes. Multicore processors are here, and will soon be part of most if not all new systems. So it’s just a matter of time before your applications will be considered obsolete if they’re not multicore savvy. To get you up to speed, we’ve got JIM FALGOUT, an expert on application parallelism. He’s also the mastermind behind DataRush, a free, all-Java framework for building highly parallel applications marketed by Pervasive Software, where he is solutions architect. Beginning on page 22, Jim opens with a primer of the major parallel-programming techniques and then provides a plan for testing applications for benchmarking performance improvements. Bj ROLLISON is a test architect in the Engineering Excellence group at Microsoft. Since 1994, he has also served as test lead of the setup team for international versions of Windows 95, international test manager for IE and other Web client projects, and was the director of test training. Beginning on page 26, Bj tackles equivalence class partitioning, a technique that reduces the total number of required tests by dividing inputs and/or outputs into classes and treating those as representative subsets that increase confidence in repeatable results. The real trick to ECP testing is understanding how to decompose I/O data, which Bj deftly describes. Regular readers of the Test & QA newsletter might recall the name TORSTEN ZELGER, a trusted source for automation techniques, test-scripting without code and the occasional cartoon. Torsten also contributed a feature in May 2006 on building durable automation scripts. This time, the 16-year veteran of software programming, testing and automation shares his checklist for determining the automation feasibility of applications under test. His time-tested to-dos are standard operating procedure at Audatex Systems, an insuranceindustry solutions provider where Torsten leads a team of nine testers. This article begins on page 31. TO CONTACT AN AUTHOR, please send e-mail to feedback@bzmedia.com.
8
• Software Test & Performance
OCTOBER 2007
Feedback BEYOND THE OBVIOUS Regarding Edward J. Correia’s “Testers Spend Too Much Time Testing” (T&QA Report, Sept. 4, 2007), I think you left the “10” out of the company name. Saying the obvious is quite the consultant’s mastery, but how about something better than “As software programs are increasing in complexity, testing times only seem to have increased.” And as the globe warms, more cooling will be needed. Surely someone has tried, and maybe succeeded, to set up an automated testing system. What happened to them? Why hasn’t it shown up as a useful (read saleable) product? I’m willing to bet that the complexity of a general-purpose test system would well outweigh its usefulness. Way back, when I was on the Apollo project, we looked into that kind of thing and decided that the only thing useful would be if regression testing could be automated/repeated after the first cycle. What has happened since then? That is what I would like to know. Rand Fazar Roanoke, VA
WHY ALL THE BUGS, THEN? Regarding Edward J. Correia’s article “Testers Spend Too Much Time Testing,” if too much time is spent testing programs, then why do so many of them have so many bugs? Ring Higgins Terlingua, TX
THE REAL QUESTION Regarding “Testers Spend Too Much Time Testing,” the article is ludicrous— and maybe the purpose of it is to evoke such a reaction. Testers spend too much time testing? The real question is “Why does software require so much testing?” If there are six passes of functional testing instead of one, it’s because the code is buggy and unstable, because the application is poorly architected and fixing one module breaks another, and/or because the application was not at code complete in the first and second (and possibly subsequent) passes. Add to that the percent of failed verifications in each pass, add to that any features that were blocked in a pass and could not be tested until the next pass, OCTOBER 2007
FLEXIBLE RATIO I enjoyed reading Paul Joseph’s article “How to Keep Quality Assurance From Being a Juggling Act” in your September 2007 issue. This is one of the reasons we call our SQA team by the more accurate title “Software Quality Engineers.” One point of disagreement is around his belief that the QA/QC team size should be slightly more than one QA/QC to three developers. I do not believe you can give a general rule here, and each organization will be different.We produce embedded, real-time products, and our ratio is one SQE to eight SWEs. I've seen ranges from 2:1 to 1:12 be employed successfully. Rick Anderson Beaverton, OR
and it starts to become apparent that the amount of time required for testing is not independent of the state of the software to be tested. If an application is truly code complete for the first pass of functional testing, proper unit testing really has been done, and fixed bugs are verified before a build gets to QA or QC, testing requires much less time. “As software programs are increasing in complexity, testing times only seem to have increased.” Hmm. And that’s not expected? What is expected with more complex applications? Fewer bugs? Increased stability? Shorter development and test cycles? Automation won’t reduce the amount of testing required—that is still dependent on the type and quality of the software submitted for testing—it will merely (if done very well) make the testing more efficient and potentially execute tests 24x7 at very little additional cost. (With a combination of U.S. and offshore testers, test teams are not far off from that 24x7 efficiency anyway.) And even so, none of this is taking into account that automation is currently not sophisticated enough to do all of the testing anyway. What is the real point of this article? To sell automation software? Fine. There is a place for automation in software testing. Review and present the reasons. But to instead posit that testers spend too much time testing on the basis
of only 200 respondents (out of tens of thousands—and 200 is only 2 percent of 10,000), and to not compare that to the amount of time spent in development, the complexity of the project, and the mission-criticalness of the project, makes this article useless and deliberately misleading at best. Why would you publish this when there are so many other valid, scientific approaches to showing why automated testing is valuable, can be very cost effective, and that, in fact, the lack of automated testing could have adverse effects on some (not all) projects? Theresa Marchwinski Mason, OH
THE FULL-TESTING DIFFERENCE Regarding “Testers Spend Too Much Time Testing,” did you survey the level of satisfaction or the efficiency of those sites that took longer? The difference of fully testing is really visible in the products we buy at each release (MC and Oracle are poorly tested). Samuel Ramos Salem, OR FEEDBACK: Letters should include the writer’s name, city, state, company affiliation, e-mail address and daytime phone number. Send your thoughts to feedback@bzmedia.com. Letters become the property of BZ Media and may be edited for space and style. www.stpmag.com •
9
Out of t he Box
Sony Ericsson Puts Its Devices Online For Testing Java ME Apps in a Virtual Lab Others have tried and failed. Now Sony Ericsson will take a stab at it. The company in August launched the Sony Ericsson Virtual Lab, a service that gives device developers and testers online access to existing and pre-production hardware designs for developing, testing and monitoring their Java ME applications. For a starting price of around US$100 per month, teams can access hardware once available only to Sony’s premier partners, and can avoid the shipping time and costs involved with moving physical devices around. The service reportedly can test everything except accelerometer (device movement) functions, including key presses, LCD viewing, ringtones, and sounds and network behavior of software. “Sony Ericsson is responding to developer feedback about having early access to phones and prototypes for testing purposes,” said Ulf Wretling, the company’s general manager of content planning and management. “The Virtual Lab is a fair-for-all, affordable means of [offering] pre-commercial phones to all Sony Ericsson Developer World members,” he said, referring to the company’s developer program. Five of the six models initially available—the W880 and W580 Walkman
The Virtual Lab simulates everything except physical device motion, the company says.
phones, K550 and K810 Cyber-shot phones and the T650—support Sony’s Java Platform 7. Also in the program is the Java Platform 8–based W910 Walkman, with support for the Mobile Services Architecture (JSR-248) and the OpenGL ES API (JSR-239). The service is provided through a partnership with Mobile Complete, which develops and markets Device Anywhere, a test solution for mobile
apps that permits remote testing across geographies and multiple carrier networks. Sony’s development program members can visit the Developer World Web site (www.SonyEricsson.com/ developer) and download a Java utility that takes care of application uploads, timeslot reservation, payment and managing testing sessions. The service is available now.
IDL Gets Data Moving ITT Visual Information Solutions, a wholly owned subsidiary of ITT Corp., in August released IDL 6.4, improving the data visualization and analysis environment with data transfer capabilities. According to the company, the new capability permits data from remote servers and data Open Geospacial Consortium (OGC) databases to be read and written using HTTP, FTP and other common protocols from within the IDL environment. The tool now also supports the OpenGL Shading Language, which “offloads computationally intensive image processing operations to a graphics card,” helping to increase rendering speeds, the company says. The IDC engine now includes improved color mapping, new edge detection filters, noise functions and GIF support for transparency and animations. Version 6.4 reads and writes over HTTP and FTP.
10
• Software Test & Performance
OCTOBER 2007
Agitation in Automated Testing
PremiTech Monitors Vista With SaaS
Agitar Software hopes to shake up the test automation market with the release of AgitarOne 4.2, the latest version of its flagship test tool that it claims delivers “80 percent project code coverage out of the box.” Company CTO Alberto Savoia characterized AgitarOne 4.2 as a “game changer” for development and test teams, and “the most powerful automated test tool for Java—nothing else even comes close.” This is accomplished, he said, through the use of mocking, or the creation of mock objects that enable testing of otherwise untestable code. The mocking technology also helps increase coverage by mimicking database and file systems. Savoia estimates that to achieve 80 to 90 percent coverage manually, an average of three to five lines of JUnit code is required for every line of Java code. “[AgitarOne] eliminates the prospect of having to write hundreds of thousands of lines of JUnit code by hand,” he said. Savoia added that the new automation capabilities give testers time to “focus on testing the areas of code that can benefit most from human insight, intuition and domain knowledge,” enabling teams to have “exhaustive tests without exhausted developers.”
If you’re using Performance Guard to monitor application health, it might interest you that PremiTech, the Denmark-based software-as-a-service provider, has added support for granular transaction monitoring for Web- and client/server-based applications, and now works with Vista and the 64-bit edition of Windows XP. Performance Guard and tools like it are used to keep tabs on how applications are running to provide proactive problem solving and to assist with rootcause problem analysis. Performance Guard provides real-time performance data on IT systems from desktop and laptop PCs and other network endpoints. Launched last month, version 5.1 also offers improved setup and monitoring for Citrix log-in transactions, and grouping of Citrix users and servers.
Talend Sweet on Data Integration Open-source data-integration tools maker Talend last month announced that its Talend Open Studio now works with SugarCRM, enabling users of the free customer relationship–management software to connect with third-party databases without additional development or programming. For testers, this provides an easy graphical environment for connecting with and testing multiple database back ends. The GPL-licensed Open Studio (www.talend.com/download.php) can be downloaded now. OCTOBER 2007
Free, as in Web Vulnerability Scanner We’re all aware of the threats to Web applications posed by cross-site scripting. With the release last month by Acunetix of a free version of its Web Vulnerability Scanner, there’s now one less reason not to protect your company’s apps from the hordes of would-be evildoers. “Companies don’t realize the danger their Web sites are under, and are therefore reluctant to invest in Web vulnerability scanners,” said Jonathan Spiteri, technician manager at Acunetix. As a consequence, he added, IT staff in charge of enterprise often lack the tools to protect their Web sites. “The free SXX scanner will give security officers access to a professional cross-site scanning tool that will allow them to assess their Web sites for the danger,” he said. Available now as a free download ( w w w. a c u n e t i x . c o m / c r o s s - s i t e scripting/scanner.htm), the free edition of WVS scans any Web sites or Web application, reveals XSS vulnerabilities, their locations and related information, and suggests solutions for remediation. The company’s fee-based edition also checks
for other vulnerabilities, including SQL injection, evaluates password strength on authentication pages, and can automatically audit shopping carts, forms and applications that generate dynamic content.
Voicemail, E-mail, ScreenMail? Creating screencasts is about to become easier—without getting more expensive. A company called QlipMedia in midSeptember was set to release QlipBoard, a free Windows utility that captures screenshots, mouse movement and keyboard and microphone input. Testers can use QlipBoard to document, comment on and illustrate/annotate images, application screens or Web pages and distribute those “Qlipit” animations back to developers via e-mail or the Web. Qlipits sent as files can be played back on Windows, Windows Mobile or Mac OS X. Recordings can be paused for image import and resumed; controls permit previews and navigation forward or backward by frame; indicators monitor input status and Qlipit length during recording and playback. At press time, a beta version of the tool was available for download (www.qlipmedia.com). When running, QlipBoard puts a small button in the upper right-hand corner of any window or dialog box that has focus. When pressed, a transparent, resizable selection box is placed over the dialog, and capture buttons are presented. When one of those capture buttons is clicked, the selected area of the screen is pasted into the QlipBoard application and added to the current stream of frames. Test, drawings and highlighting can then be added to the frame by testers or developers, along with verbal annotations. The end result is a .wmv file that can be saved or e-mailed directly from within the tool. E-mail recipients receive a link to page that plays the Qlipit through a browser and also contains the code necessary to embed it within a Web page. Send product announcements to stpnews@bzmedia.com www.stpmag.com •
11
12
• Software Test & Performance
OCTOBER 2007
By Rex Black
T
he deadline’s Friday. This Friday. That’s when this baby goes live; when the huddled masses from this naked city and others
across the globe descend on our creation. Our baby, that we nurtured from conception to the day it was born. What was my part in all this? Nothing major. Just to make sure this crackling cacophony of code was ready for prime time; that it doesn’t go belly-up when the pounding starts. When the throngs of ravenous consumers are unleashed on our servers. People. Users. Tired, hungry, angry and bored by monotonous life on the streets and looking for an escape, a way out, out of their tedious existence. But I’m not worried. I had a plan. A plan I’d put into place on day one of this project. Because performance, load and reliability bugs can really hurt. Inadequate performance can lead to user dissatisfaction, user errors and lost business, especially in Internet-based, business-to-business or business-to-consumer applications. Further, system failure under load often means a total loss of system functionality, right when need is highest—when those throngs are banging away on your system. Reliability problems also can damage your company’s reputation and business prospects, especially when money is involved. I’ve worked in software testing for the last 20 years, and many of the projects I’ve done included performance, load or reliability tests. Such tests tend to be complex and expensive compared to, say, testing for functionality, compatibility or usability. When preparing a product for deployment, one key question to ask is, “Are you sure there are no critical risks to your business that arise from potential performance, load or reliability problems?” For many applications these days, such risks are rampant.
Abandoned Shopping Carts, Degraded Servers For example, let’s say you were running an e-commerce business and 10 percent of your customers abandoned their shopping cart due to sluggish performance. How can you be sure that customers who’ve had a bad experience will bother to come back? Can your company stay in business while hemorrhaging customers this way? A colleague once told me that he had just finished a project for a banking client, during which the bank estimated its average cost of field failure was more than $100,000. That was the average; some failures would be more expensive, and many were related to performance, load or reliability bugs. Some might remember the Victoria’s Secret 1999 online holiday lingerie show, a bleeding-edge idea at that time. This was a good concept—timed propitiously right at the Christmas shopping season—but poorly executed. Their servers just couldn’t handle the load. And worse, they degraded gracelessly: As the load exceeded limits, new viewers were still allowed to connect, slowing everyone’s video feeds to a standstill. In a grimmer example, a few years ago, a new 999 emergency phone service was introduced in the United Kingdom. On introduction, it suffered from reliability (and other) problems, resulting in late ambulance arrivals and tragic fatalities. All of these debacles could have been avoided with performance, load and reliability testing.
Hard-Won Lessons Like any complex and expensive endeavor, there are many ways to screw up a performance, load or reliability test. Based on my experience, you can avoid common goofs in five key ways: • Configure performance, load and reliability test environments to resemble production as closely as possible, and know where test and production environRex Black is president of RBCS, a software, hardware and systems testing consultancy.
OCTOBER 2007
www.stpmag.com •
13
CASE FILES
ments differ. • Generate loads and transactions that model varied real-world scenarios. • Test the tests with models and simulations, and vice versa. • Invest in the right tools, but don’t waste money. • Start modeling, simulation and testing during design, and continue throughout the life cycle. I’ve learned these hard-won lessons through personal experience with successful projects, as well as expensive project failures. All of the case studies and figures are real (with a few identifying details omitted), thanks to the kind permission of my clients. Let’s look at each lesson in detail.
PEOPLE COMMONLY bring physicalworld metaphors to system engineering and system testing. But software is not physical, and software engineering is not yet like other engineering fields. For example, our colleagues in civil engineering build bridges from stan-
dard components and materials such as rivets, concrete and steel. But software is not yet assembled from standard components to any great extent. Much is still designed and built to purpose, often from purpose-built components. Our colleagues in aeronautical engineering can build and test scale models of airplanes in wind tunnels. However, they have Bernoulli’s Law and other physical principles that can be used to extrapolate data from scale models to the real world. Software isn’t physical, so physics doesn’t apply. But the standard rules apply for a few software components. For example, binary trees and normalized relational databases have well-understood performance characteristics. There’s also a fairly reliable rule that once a system resource hits about 80 percent utilization, it starts to saturate and cause non-linear degradation of performance as load increases. However, it’s difficult and, for the most part, unreliable to extrapolate results from smaller environments into larger ones. In one case of a failed attempt to extrapolate from small test environ-
It’s unreliable to extrapolate results from smaller environments into larger ones.
FIG. 1: REAL PROBLEM—UNREAL SIMULATION iSeries Server
Token Ring NT Server Mainframe Ethernet Central Workstation
UnixServer
VMS Server 14
• Software Test & Performance
•
ments, a company built a security management application that supported large, diverse networks. The application gathered, integrated and controlled complex data related to the accounts and other security settings on servers in a managed network. They ran performance tests in a minimal environment, shown in Figure 1. This environment was adequate for basic functionality and compatibility testing, but it was far smaller than the typical customer data center. After selling the application to one customer with a large, complex network and security data set, the company found that one transaction the customer wanted to run overnight, every night, took 25 hours to complete. For systems that deal with large data sets, the data is a key component of the test environment. You could test with representative hardware, operating system configurations and cohabitating software, but if you forget the data, your results might be meaningless. Now for the proper test environment. For a banking application, we built a test environment that mimicked the production environment, as shown in Figure 2. Tests were run in steps, with each step adding 20 users to the load. Tests monitored performance for the beginning of non-linear degradation, the so-called “knees” in the performance curve. Realistic loads were placed on the system from the call-center side of the interface. The only differences between the test and production environments, a wide-area network that tied the call center to the data center, were in the bandwidth and throughput limits. We used testing and modeling to ensure that these differences wouldn’t affect the performance, load and reliability results in the production environment. In other words, we made sure that, under real-world conditions, the traffic between call center and data center would never exceed the traffic we were simulating, thus guaranteeing that there was no hidden chokepoint
•
OCTOBER 2007
CASE FILES
OCTOBER 2007
However, there are some instances of load and reliability testing in which intrusive or self-generated load can work. A number of years ago, I led a system test team that was testing a distributed Unix application. The system would support a cluster of as many as 31 CPUs, which could be a mix of mainframes and PCs running Unix. For load generators, we built simple Unix/C programs that would use up CPU, memory and disk space resources, as well as generating files, interprocess communication, process migration and other cross-network activities. The load generators, simple though they were, allowed us to create worst-case scenarios in terms of the amount of resource utilization and the number of simultaneously running programs. No realistically foreseen application mix on the cluster could exceed the load created by those simple programs. Even better, randomness built into the programs allowed us to create high, spiking, random loads that we sustained for up to 48 hours. We considered it a passing test result if none of the sys-
FIG. 2: REAL SOLUTION—REAL WORLD
Oracle Database Server
LaserPro
MQ Server
Ethernet
Ethernet
THE NEXT STEP is to achieve realistic performance, load and reliability testing using realistic transactions, usage profiles and loads. Real-world scenarios should include not just realistic usage under typical conditions, but also factor in the occurrence of regular events. Such events can include backups; timebased peaks and lulls in activity; seasonal events like holiday shopping and year-end closing; different classes of users such as experienced, novice and special-application users; and allowance for growth into the future. Also, don’t forget about external factors such as constant and variable loads on LAN, WAN and Internet connections, the load imposed by cohabiting applications, and so forth. Don’t just test to the realistically foreseen limits of load, but rather try what my associates and I like to call “tipover tests.” These involve increasing load until the system fails. At that point, what you’re doing is both checking for graceful degradation and trying to figure out where the bottlenecks are. Here’s a case of testing with unrealistic load. One project we worked on involved the development of an interactive voice response server, such as those used in phone banking systems. The server was to support more than 1,000 simultaneous users. A vendor was developing the software for the telephony subsystem. However, during subsystem testing, the vendor’s developers tested the server’s performance by generating load using only half of the telephony cards on the system under test.
Based on a simple inspection of the load-generating software, we discovered that the system load imposed by the load generators was well below that of the telephony software we were testing. We warned the vendor and the client that these test results were meaninglessness, but our warnings were ignored. So, it was no surprise that the invalid tests “passed,” and as with all false negatives, project participants and stakeholders were given false confidence in the server. The load generator that we built ran on an identical but separate host system. We loaded all the telephony ports on the system under test with representative inputs. The tests failed, revealing project-threatening design problems, which I’ll discuss later. The lesson here is to use nonintrusive load generators for most performance testing. The tests executed can stress no more than half the system’s ports since the other half must be reserved for generating the test load. Performance, load and reliability testing of subsystems with inexact settings can yield misleading results.
MQ
X25???
Server
Mainframe Server Datacenter
Netscape Enterprise Server
Credit Bureau Mainframe(s) TRW, etc.
Router Datacenter WAN Agent Desktop Ethernet
or bottleneck. To the extent that you can’t completely replicate the actual production environment, be sure that you identify and understand all the differences and how they might affect your results. However, more than two material differences will complicate extrapolation of results to the real world and call the validity of your test results into question.
Router
Intra/ Internet WAN
ISP
Internet Customer Home, Office or On the Road
ISPs, etc. Agent Desktop
Agent Desktop Call Center
Innocent non-customers and criminal hackers various locations
www.stpmag.com •
15
CASE FILES
SO NOW IT’S CLEAR that realism of the environment and the load are important. Being complex endeavors, performance, load and reliability testing are easy to screw up, and these two areas of realism are especially easy to screw up. The key to establishing realism of environment and load before going live is to build a model or simulation of your system’s performance. Use that model or simulation to evaluate the performance of the system during the design phase and evaluate the tests during test execution. And since the model or simulation is equally subject to error, the tests also can evaluate the models. Furthermore, a model or simulation that is validated through testing is useful for predicting future performance and for doing various kinds of performance “what if” scenarios without having to resort to expensive performance tests as frequently. Spreadsheets are good for initial performance, load and reliability modeling and design. Once spreadsheets are in place, you can use them as design documents both for the actual system and for a dynamic simulation. The dynamic simulations allow for more what-ifs and are useful during design, implementation and testing. Here’s a case for using models and simulations properly during the system test–execution period. On an Internet appliance project, we needed to test server-side performance and reliability at projected loads of 25,000, 40,000 and 50,000 devices in the field. In terms of realism, we used a production environment and a realistic load generator, which we built using the development team’s unit testing harnesses. The test environment, including the load generators and functional and beta test devices, is shown in Figure 3.
16
• Software Test & Performance
Allowing some amount of functional and beta testing of the devices during certain carefully selected periods of the performance, load and reliability tests gave the test team and the beta users a realistic sense of user experience under load. This is all a pretty standard testing procedure. However, it got interesting when a dynamic simulation was built for design and implementation work. We were able to compare our test results against the simulation results at both coarse-grain and fine-grained levels of detail. Let’s look first at the coarse-grained level of detail, from a simulation data extract of performance, load and reliability test results from the mail servers (IMAP1 and IMAP2 in Figure 3). The simulation predicted 55 percent server CPU idleness at 25,000 devices. Table 1 shows the CPU idle statistics under worst-case load profiles, with 25,000 devices. We
•
can see that the test results support the simulation results at 25,000 devices. Table 2 shows the fine-grained analysis of the various transactions the IMAP servers had to handle, their frequencyper-connection between the device and the IMAP server, the average time the simulation predicted for the transaction to complete, and the average time that was observed during testing. The times are given in milliseconds. As you can see, we had good matches with connect and authorize transactions, not a bad match with select transactions, a significant mismatch with banner transactions, and a huge discrepancy with fetch transactions. Analysis of these test results led to the discovery of some significant problems with the simulation model’s load profiles and to significant defects in the mail server configuration. Based on this case, we learned that
Models can help with ‘what-if ’ scenarios.
•
FIG. 3: INTERNET APPLIANCE PROJECT TEST ENVIRONMENT
Test Devices
E-com1
E-com2
Maildb1
Maildb2
IMAP1
IMAP2
DBMS1
DBMS2
Update1
Update2
Load Generator
Intra/ Internet
Ethernet
tems crashed, none of the network bridges lost connectivity to either network, no data was lost, and no applications terminated abnormally.
Switch
Beta Devices
OCTOBER 2007
CASE FILES
FIG. 4: COMBINING COTS AND OPEN SOURCE
{
ACD City IVR
PBX NXT1
CSA Workstation
QA Partner drove the call center GUI application
IVR ACD
Custom-built load generator drove the IVR application
CT-Connect Server
CSA Workstation
CSA Server CM Server
WAN Pub/Su
Voice Repository
CSA Workstation
AS A GENERAL RULE, tools are necessary for performance, load and reliability testing. Trying to do these tests manually would be difficult at best, and not a best practice, to be sure. When you think of performance testing tools, one or two might come to mind, along with their large price tags. However, there are many suitable tools, some commercial, some open source and some custom-developed. For complex test situations, you may well need a combination of two or three. The first lesson when considering tool purchases is to avoid the assumption that you’ll need to buy any particular tool. First, create a high-level design of your performance, load and reliability test system, including test cases, test data and automation strategy. Second, identify the specific tool requirements and constraints for your test system. Next, assess the tool options to create a short-list of tools. Then hold a set of competitive demos by the various vendors and with the open source tools. Finally, do a pilot project with the demonstration winner. Only after assessing the results of the pilot should you make a large, long-term investment in any particular tool. It’s easy to forget that open source tools don’t have marketing and advertising budgets, so they won’t come looking for you. Neither do they include techni-
City IVR
Custom-built testing middleware coordinated the tests end-to-end
cal support. Even free tools have associated costs and risks. One risk is that everyone in your test team will use a different performance, load and reliability testing tool, resulting in a Tower of Babel and significant turnover costs. This exact scenario happened on the interactive voice--response server project, and prevented us from reusing scripts created by the telephony subsystem vendor’s developers because they wouldn’t coordinate with us on the scripting language. I can’t stress the issue of tool selection mistakes enough; I’ve seen it repeated many, many times. In another case, a client spoke with
TABLE 1: COARSE GRAIN Server
CPU Idle
IMAP1
59%
IMAP2
55%
tool vendors at a conference and selected a commercial tool based solely on a salesperson’s representations that it would work for them. While conferences are a great place to do competitive feature shopping, such on-the-spot purchasing is usually a bad idea.
TABLE 2: FINE GRAIN Transaction
Frequency
Simulation
Testing
Connect
1 per connect
0.74
0.77 13.19
Banner
1 per connect
2.57
Authorize
1 per connect
19.68
19.08
Select
1 per connect
139.87
163.91
Fetch
1 per connect
125.44
5,885.04
All times in milliseconds.
OCTOBER 2007
{
it’s important to design your tests and models so that you can compare the results between the two.
They soon realized they couldn’t use the tool because no one on the team had the required skills. A call to the tool vendor’s salesperson brought a recommendation to hire a consultant to automate their tests. That’s not necessarily a bad idea, but they forgot to develop a transition plan or train the staff to maintain tests as they were completed. They also tried to automate an unstable interface, which in this case was the GUI. After six months, the consultant left and the test team soon found that they couldn’t maintain the tests, couldn’t interpret the results and saw more and more false positives as the graphic user interface evolved. After a short period, the tool was abandoned, wasting the company about $500,000. Here’s a better way. While working on a complex system of WAN-connected interactive voice response servers linked to a large call center, we carefully designed our test system, then selected our tools using the piloting process described earlier. We ended up using the following tools: • QA Partner (now Borland SilkTest) to drive the call-center applications through the graphical user interface • Custom-built load generators to drive the interactive voice-response server applications • Custom-built middleware to tie the two sides together and coordinate the testing end-to-end We used building blocks like the TCL scripting language for our custom-built test components. The overall test system architecture is shown in Figure 4. The www.stpmag.com •
17
CASE FILES
LAST BUT CERTAINLY not least is to begin thinking about and resolving performance, load and reliability problems from day one of the project. Performance, load and reliability problems often arise during the initial system design, when resources, repositories, interfaces and pipes are designed with mismatched or too little bandwidth. These problems also can creep in when a single bad unit or subsystem affects the performance or reliability of the entire system. Even when interfaces are designed properly during implementation, interface problems between subsystems can arise, creating performance, load and reliability problems. Slow, brittle and unreliable system architectures usually can’t be patched into perfection or even into acceptability. I’ve got two related lessons here. First, you can and should use your spreadsheets and dynamic simulations to identify and fix performance, load and reliability problems during the design phase. Second, don’t wait until the very end, but integrate performance, load and reliability testing throughout unit, subsystem, integration and system test. If you wait until the end and try throwing more hardware at your problems, it might not work. The later in the project you are, the fewer options you have to solve problems. This is doubly true for performance- and reliability-released design and implementation decisions. During the interactive voiceresponse server project, one senior technical staff member built a spreadsheet model that predicted many of the performance and reliability problems we found in later system testing. Sadly, there was no follow-up on his analysis during design, implementation
18
• Software Test & Performance
or subsystem testing. By the time we got involved, during integration and system testing, the late discovery of problems led project management to try patching the interactive voice-response server, either by tweaking the telephony subsystem or adding hardware. Patching of one problem would resolve that particular problem and improve performance, only to uncover another bottleneck. This became a slow, painful, onion-peeling process that ultimately didn’t resolve the problems (see Figure 5). In a case where we did apply performance modeling and testing consistently and throughout the life cycle, in the Internet project, we started with a spreadsheet during initial server system design. That initial design was turned into a simulation, which was fine-tuned as detailed design work continued. Throughout development, performance testing of units and subsystems was compared with the simulation. This resulted in relatively few surprises during testing. We quickly learned that we’re not likely to get models and simulations right from the very start; we had to iteratively improve them as we went along. Your tests will probably also have problems at first, so plan to modify them to resolve discrepancies between the predicted and observed results.
Time to Get Started We’ve covered a lot of ground. Perhaps some of the challenges described here have also been part of your experience. You can start to implement some of these lessons by assessing your current practices in performance, load and reliability testing. How well or poorly does your company do in applying each of these five lessons? Which common mistakes are you making? What warnings are you not heeding? Which tips for success could you apply? Based on this assessment, see if you can put together a short action plan that identifies the organizational pain and losses associated with the current approach. Then develop a path toward improvement. The goal is to avoid expensive, unforeseen field failures by assessing and improving your performance, load and reliability risk-mitigation strategy throughout the life cycle. Once you have you plan in place, take it to management. Most of the time, managers will be receptive to a carefully thought-out plan for process improvement based on a clearly articulated business problem and a realistic solution. As you implement your improvements, don’t forget to continually reassess how you’re doing. I hope that my mistakes and those I’ve observed in others can help make your projects a success. ý
FIG. 5: PATCHING’S NOT THE ‘BEE’S KNEES’
First test Second test, with patch
Third test, more patching
Response time
total cost of the entire system, including the labor to build it, was less than $500,000, and it provided a complete end-to-end test capability, including performance, load and reliability testing.
Blue circles indicate “knees” in the performance curve where a resource saturated
Load
OCTOBER 2007
Multicore Testing Techniques Can Elevate A Test Team From Citizens to App Performance Royalty
22
• Software Test & Performance
OCTOBER 2007
By Jim Falgout
mong the newest entries into the popular lexicon is multicore. Once thought of as the lunatic fringe, multicore systems were
A
found mainly in the high-performance computing arenas, in science or academia. Today they’re a standard part of many personal computers. Turn on the television at any time of the day or night, and you’ll see evidence of the edification of the general population. Just 10 years ago, ads for DSL and fiber optics, unlimited texting and anywhere minutes, or free Wi-Fi hotspots would sail over the heads of all but the geekiest of geeks. What was a specialty niche is now expected of all applications. Code must be scalable and take advantage of multiple processors or processor cores. As such, testers will be expected to understand parallel computing techniques not just on the fringes, but in the general business-application
Prince of Multicore
population. And it often falls to testers to find places where applications can push the performance envelope. Will you be ready? One of the biggest challenges faced by programmers is the current way of doing things. We’ve lived in a single-threaded world for too long, and have relied mainly on processor speed-ups for performance gains. But processor makers have now turned to multicore technology to continue the performance march, putting a burden on developers and testers to learn to keep up. As processor architectures change and software adapts, so too must methodologies for testing and benchmarking applications. How else will you know if your apps are getting any faster? In this article, you’ll learn benchmarking and performance profiling techniques for data-intensive (sometimes known as batch) applications for multicore systems.
Before attempting to test and benchmark a multicore-enabled application, some basic techniques for building scalable applications should be understood. With multiple processor cores, the obvious thing we want an application to do is to take advantage of all of the cores available to decrease its “wallclock time.” Wall-clock time is the total running time of an application from start to finish. An application that scales well will have a decreasing wallclock time as the number of resources (cores) added to the system under test increases. One basic way of accomplishing this is to overlap application I/O with application compute cycles. This can be done using simple double buffering (a.k.a. producer-consumer) tech-
There are several useful techniques for measuring the performance of data-intensive applications as they’re built or modified for multicore platforms, and many techniques for traditional apps still apply. But some parallel programming frameworks call for new methods of analyzing multicore performance. There are also some common performance pitfalls and techniques to discover them.
•
niques. The producer (disk reader) and consumer (compute node) each run in their own process or thread. All that’s required is a means of communicating the data between the two threads and enough buffer space to ensure that one thread doesn’t starve the other. A more generalized form of overlapping I/O and compute is to use a pipelined architecture, in which all functions of the application, including the I/O and compute functions, are split into nodes. Each function is linked to the other using a data queue of some sort. This may be an inmemory queue for functions that live within the same process, a shared memory queue for interprocess communication or a network socket for intermachine hook-ups. A pipelined architecture can be quite flexible, in that it allows functions to be wired together in different ways to form complex applications. It also can produce highly scalable applications because each function within the application runs in its own thread or process. Pipelining is a technique that can add scalability to applications containing algorithms that may not be made to scale using other means. Another technique for squeezing scalability out of an application is data partitioning. One of two primary methods, horizontal partitioning uses a “divide and conquer” principle to segment data into partitions. Each partition instantiates a copy of the functions to apply to the data. The number of partitions to use may be based on the number of cores available (dynamic) or some known division of the data (static). This technique allows a possibly costly calculation to be spread across many cores. Another partitioning technique is vertical partitioning, which allows row set data to be split into individual columns for separate processing. This can be a useful performance technique because it restricts data movement through the application to only
What was a specialty niche is now expected of all applications.
OCTOBER 2007
•
Jim Falgout is an expert on application parallelism and solutions architect at Pervasive Software.
www.stpmag.com •
23
MULTICORE MONARCHY
FIG. 1: ON YOUR KNEES Execution Time 100.0 90.0 Runtime (seconds)
80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 1
2
4
8
16
32
Processor Core Count
what is needed. It also provides scalability by allowing different threads of control to handle each data column. Vertical partitioning must be used judiciously. Because input data sets for an application may contain many hundreds of columns, overuse of vertical partitioning can lead to the creation of too many threads, which puts stress on the underlying operating system or virtual machine. Algorithmic tuning is a well-used technique for decreasing application runtime. The traditional way of tuning algorithms involves the use of a profiler to identify the “hot spot”—the section of code running the slowest. The programmer then attacks that code, looking for ways to get better performance out of the algorithm. In a multicore system this is applicable, but the methods for tuning algorithms have changed. To get scalability out of an algorithm, it must be chopped in some way to allow multiple threads of control to implement the algorithm in parallel. Tuning the algorithm now relies on multithreaded tuning techniques such as looking for a “hot lock” or a lock that may be too granular, causing the algorithm to lose scalability.
Benchmarking Crown Jewels
•
One of the most fundamental ways to measure application performance is to apply it to a known set of data. The wall-clock time can be captured and compared to previous runs to test for performance regressions. This method is still valid for multicore systems, but tells only part of the story. To truly test scalability, we have to introduce data about the number of available CPUs and cores. Several useful tools are available to measure CPU utilization. Which ones you use are based on your operating system(s): On the Windows platform you have perfmon; Solaris provides Dtrace. Each operating system offers some means of monitoring the CPU utilization of the system under test. Java developers can monitor thread usage and heap memory usage with JConsole and the JMX agent. Many tools also provide a way to capture a run’s monitored data and “play back” the captured data for further evaluation. These utilities also allow tracking the usage of other system resources of interest to benchmarking, including I/O and cache rates, as well as thread and memory usage. To test scalability, you must have the capability to limit the number of
To test scalability, introduce data about the number of cores.
24
• Software Test & Performance
•
cores available to execute your application. Again, this is a system-specific feature. Start with one core and work your way up to the system capacity, doubling as you go. During each run, measure the execution time and CPU utilization. Plot the execution time versus the number of cores and analyze the resulting curve (see Figure 1). As the number of cores is increased, an application that displays scalability will have a decreasing runtime. To demonstrate perfect linear behavior, the runtime will be halved for a doubling of CPU capacity, but this is rarely the case. Any sign of scalability in an application is good. Most applications that scale well will reach a point where the performance curve levels as more resources are added. This point is known as the “knee.” Once the knee is reached, additional performance improvements will only come only by identifying and fixing bottlenecks. There may be several: The application is I/O bound, the application has serialization that is causing a loss of utilization, or the application has run out of work to do.
I/O Traffic If you suspect that the application is I/O bound, rerun the tests capturing I/O performance data using your performance-monitoring tool. If you see that disk work queues are constantly full, disk I/O rates are maxing out, or other indicators that the disk system is running at capacity during the test execution, you can surmise that you’ve got a problem. Switching to a striped file-system and using different I/O channels for input and output files may help. Attempt to rework your I/O system as you can and rerun the tests. If the application now shows better scalability, it was definitely I/O bound. This knowledge will help you help your customers to better tune their systems for running your application.
Daytime Serial Serialization is another cause for hitting the knee. Take a look at your CPU utilization graphs captured during test runs. If you see a sawtooth pattern (see Figure 2), you probably have this problem. The next step is to monitor your application’s usage of locks during execution. Again, many tools that OCTOBER 2007
MULTICORE MONARCHY
monitor lock usage are available. Once you’ve set up your monitoring tool, execute the application again in a multicore configuration. You’ll most likely find a very “hot” lock, one that’s used repeatedly with many threads waiting in contention. You can apply varied techniques to lower this thread contention. Perhaps the lock is protecting a shared counter of some kind. In Java, the counter and the lock can be replaced with a java.util.concurrent class such as AtomicInteger that uses test and set techniques to prevent having to use a lock at all. Whatever the case, the application will have to be reworked to lower the thread contention or its scalability can’t be improved.
M
ULTICORE TERMS AND CONDITIONS In discussions of application performance, it’s common to refer to the amount of time (wall-clock time) the application takes to process a set amount of data. Traditionally, developers and testers have worked to produce applications that perform well on systems with a single processor. However, such applications may not perform any faster on systems with multiple processors or multicore processors without modification. Scalable applications, on the other hand, will exhibit shorter wall-clock times as more processors or cores are available. So, scalability implies more resources equals better performance. Scaling also may be affected by system resources such as memory and I/O. This can be especially true for I/O-bound applications. When an application is I/O bound, it spends most of its time doing input and output, usually disk I/O. Adding more processors or cores resources won’t help this application’s performance, but the addition of more I/O resources might. Additional I/O resources can come in the form of striping file systems, division of the file system to different I/O channels, adding memory to channel hardware, and so on.
Hitting the Pipeline Applications that use pipelined architectures have their own performance challenges. Pipelines contain functions that are wired together using data queues. As a function receives data, it runs its computation on that piece of data and sends the results down the pipeline. If you observe a pipelined application that isn’t scaling well, it may be due to one or more functions in the application that are under-performing. You can use many techniques to discover the “hot spots” in the pipeline. Hopefully the pipeline framework exposes performance information about each data queue. Using this information, you can determine which queues demonstrate the largest accumulated wait times— points where either the reader or writer of the queue had to wait to per-
FIG. 2: CPU SAWTOOTH
OCTOBER 2007
CPU utilization is a report of processor usage by the operating system and/or application(s) as a percentage of its total capacity. With multicore systems, it’s useful to monitor the utilization of all cores within a system. An application may tax one or several cores fully while leaving other cores with no work to do. When this happens, the application isn’t displaying good scalability. Serialization, in the context of this article, refers to the technique of using some kind of lock to protect a resource shared by multiple, concurrent threads of execution. Any application that includes serialization will eventually stop scaling. Reference Amdahl’s Law for more information on this phenomenon (http://en.wikipedia.org/wiki/Amdahl’s_law).
form their action. Excessive wait times can be an indication of a slow reader or writer. Using this information, you can determine the cause of the poor performance within the target function. For example, I was once benchmarking an application that used a dataflow framework and observed that
the application didn’t scale well and CPU utilization was lower than expected. Using the framework’s performance-monitoring tools, I was able to find the function that appeared to be slow. When I ran the application in a Java performance-monitoring tool, with filtering to show the function’s classes, I saw that the function was using a class library method that contained a lock on a static class member. The function was recoded to remove the not-so-thread-friendly library method. After the code fix, the application then began demonstrating much better scalability. The upcoming multicore wave places new demands on developers and testers, part of which is to learn techniques for benchmarking applications that have to evolve to take advantage of multicore systems. If your team uses these scalability techniques, you’ll have methods to identify bottlenecks and improve scalability to meet those demands. ý www.stpmag.com •
25
Le’go M y Error Turning Equivalence Class Partitioning Into Child’s Play
By Bj Rollison
or the past six years, I’ve been teaching new testers at Microsoft and professional testers at conferences techniques and methods for solving complex testing problems. Techniques are
F
useful for establishing a foundational examination of the software under test, and a given technique might be extremely useful to help prove or disprove the existence of one specific type or class of problem. The overall effectiveness of any particular technique is limited by the tester’s knowledge of the system and their cognitive skill in applying a given technique in the appropriate situation. Techniques provide only a foundation, requiring the application of additional approaches and investigation to be truly effective. One technique commonly used in determining which test data to use in a test is equivalence class partitioning. ECP is a functional testing technique generally focused on input and output data by decomposing data into discrete subsets of valid and invalid classes. The actual tests are derived from creating unions of valid class subsets until all those subsets have been used Bj Rollison is a test architect with Microsoft’s Engineering Excellence group.
26
• Software Test & Performance
in a test, and from exercising each invalid data subset individually. This might seem like a simple enough explanation, but ECP’s overall effectiveness is based on the tester’s ability to accurately decompose data into discrete valid and invalid class subsets. The better the tester understands the data type(s) they’re working with— the programming language syntax and idiosyncrasies, operating system internals, etc.—the more effective this technique will be. Failure to understand these variables can result in incorrect application of the technique and the potential to miss critical defects. ECP’s value is twofold. First, it systematically reduces the number of tests from all the possible data inputs and/or outputs, and it provides a high degree of confidence that any other data in one particular subset will repeat the same result. For example, let’s assume that we’re testing an edit control for a given name that takes a string of uppercase Latin characters between A and Z, with
a minimum string length of one character and a maximum length of 25. To test all possible data inputs, the number of tests would be equal to 2625 + 2624 + 2623... + 261. Thus, the valid class subsets for this parameter include: • Uppercase Latin characters between A and Z • A string length between one and 25 characters Based on this simple analysis of valid data, we can assert that a string of ABCDEF will produce the same result as a string of ZYXWVUTSRQPONM, and will also produce the same result as a string of KUGVDSFHKIFSDFJKOLNBF. Secondly, we can effectively increase the breadth of data coverage by ranOCTOBER 2007
Code domly selecting test data from the given set because the test data from a specific valid or invalid class subset should produce the same result (provided the data is correctly decomposed into discrete subsets and there is no errant behavior). Using the previous example, we’ll surely want to test the given name parameter using nominal real-world test data such as Bob, Nandu, Elizabeth or Ling. But static test data provides little value in terms of testing an algorithm’s sturdiness after the first iteration (unless there are fundamental changes) and doesn’t exercise data permutations well. By randomly selecting string length and valid characters, we can provide variability into subsequent iterations of the test case and increase the probability of exposing unusual defects that might not otherwise be exposed using typical or real-world valid class data.
Decomposing’s Hairy The most difficult and challenging aspect of employing ECP techniques is the decomposition of data into unique valid and invalid class subsets. This requires in-depth knowledge of the data (data type, encoding technology, etc.), how the program and the system process, manipulate, transfer and store data, common user inputs and a historical analysis of the data set in terms of known problematic data, high-risk data and so on. Overgeneralization of the data for a specific class (not decomposed sufficiently) reduces the number of subsets in that particular class, but also increases the likelihood of missing OCTOBER 2007
errors. However, if the data is hyperanalyzed (or overly decomposed into non-unique subsets), the number of subsets in that particular class increases, potentially increasing the number of redundant equivalence-class tests without increasing the probability of exposing new or different errors. My theory of data decomposition for ECP states the pitfalls of this technique and provides the basis for understanding why in-depth knowledge of the system and domain space is critical for the greatest success and the maximum effectiveness of this technique. ECP focuses primarily on input and output data. Initially, the data is separated into two separate classes. Valid class data returns a positive result under normal circumstances. This data is expected not to result in an error condition or cause an unexpected failure. Invalid class data is expected to cause an error condition (and hopefully display an appropriate error message), or have a high probability of exposing errant behavior or failures. Once the data is separated into valid and invalid classes, the tester must further analyze the data in each class to decompose it into discrete subsets within that class. Essentially, any data point in any individual subset (of valid or invalid class) should produce the same exact result when used in a test. Using the example above, we don’t have to test every single uppercase Latin character between A and Z, or every combination of characters with a maximum string length of 25 characters in the edit control for a given name. We can reasonably hypothesize that a character C will produce the same result as the character Q. So in this example, there are two valid class subsets; uppercase Latin characters between A and Z, and string length between one and 25 characters. There is a set of heuristics to help testers
CLASS DIVISION
decompose the data into discrete subsets within each class. Heuristics are intended to act as guidelines, and they’re not always perfect, but they can be useful for decision making, troubleshooting and problem solving. Glenford Myers outlined four heuristics in his classic “The Art of Software Testing” (2nd edition, Wiley, 2004) to assist the tester in decomposing data. • Range. A contiguous set of data in which any data point within the minimal and maximal boundary values is expected to produce the same result • Number. A given number of values expected to produce the same result • Unique. Data within a class or subset of a class that may be handled differently than other data within that class or subset of a class • Special. Data that must or must not meet specific requirements
ECP in Action Let’s look at a simple example based partially on a program described by Paul C. Jorgensen in his book “Software Testing: A Craftsman’s Approach” (CRC Press, 1995). Jorgensen’s Next Date program takes three integer inputs representing a specific month, day and
year in the Gregorian calendar and returns the next Gregorian calendar date in the month/day/year format. The algorithm to determine the next date is based on the Gregorian calendar instituted by the 1582 papal bull “Inter gravissimas.” The program is implemented in C#, and the output will always be in the m/d/yyyy format regardless of the user’s locale setting (meaning that the output isn’t culture sensitive). On the surface, the input and output data for this program appear rather simple. But the ability to decompose the test data into discrete subsets of valid and invalid data requires in-depth system knowledge, and in this particular case the tester must also have indepth domain knowledge of the Gregorian calendar, the papal bull “Inter gravissimas” and even broad knowledge of English history. Table 1 illustrates the most effective breakdown of the data sets for this particular program.
Analyzing Parameter Subsets The first parameter is the month. This parameter accepts a range of integers from 1 through 12, representing months of the year. However, at some point the implementation must distin-
TABLE 1: HOT DATES Input/Output
Valid Class Subsets
Month
v1 – 30-day months v2 – 31-day months v3 – February
i1 i2 i3 i4 i5
Day
v4 v5 v6 v7
i6 – >=32 i7 – <= 0 i8 – any non-integer i9 – empty* i10 – >= 3 integers
Year
v8 – 1582 – 3000** v9 – non–leap year v10 – leap year v11 – century leap year v12 – century non– leap year
i11 i12 i13 i14 i15
Output
v13 – 1/2/1582 – 1/1/3001
i16 – <= 1/1/1582 i17 – >= 1/2/3001
Unique or Special Dates
v14 – 9/3/1752 – 9/13/1752 v15 – 1/19/2038
i18 – 10/5/1582 – 10/14/1582
– – – –
1 1 1 1
-
30 31 28 29
Invalid Class Subsets – – – – –
>= 13 <= 0 any non-integer empty* >= 3 integers
– – – – –
<= 1581 >= 3001 any non-integer empty* >= 5 integers
guish between the 30-day months, 31day months and February in leap and non-leap years. Therefore, we can segregate the inputs for month based on the number of days in a month and February, which is unique compared to the other valid month inputs. We could also create a fourth valid class subset of 1 through 12 to help identify the minima and maxima boundary conditions for this input parameter. But, if we created only one subset of 1 through 12 for months, that would be an overgeneralization and result in missing critical tests. The invalid class subsets include integer inputs greater than 12 and less than 1, any non-integer value, an empty string and any attempt to enter three or more characters into the control. The day parameter is separated into four subsets based on the number of days for 30-day months, 31-day months and February leap and non-leap years. The primary reason to decompose the dates into discrete subsets for each month type is to aid in the determination of specific boundary conditions. Equivalence class subsets often indicate specific boundary conditions. For example, the day parameter clearly identifies five specific boundary conditions (depending on the exact implementation) of 1, 28, 29, 30 and 31. Also, dividing the day ranges into exclusive subsets helps to ensure that we don’t generate false positives. For example, if we overgeneralized the month and day parameter valid class subsets to just 1 through 12 for month and 1 through 31 for day, an automated test could generate an invalid input combination of 2 for month and 30 for day. The year parameter includes one subset that specifies the range of valid years, one subset for leap years, one subset for non-leap years, one subset for century leap years and one subset for non-century leap years. Leap years, century leap years and non-leap years are easily determined by a mathematical formula and implemented in an algorithm similar to the following: private static bool IsLeapYear(int year)
* Based on the implementation in C#, an empty string will follow the same code path as a non-integer input; however, a test with an empty string may be considered a special case for parameters. ** The upper range of 3,000 for the year is an artificial boundary constraining the int data type. (It was arbitrarily chosen because I really don't expect to be around after 3000, so I really don't care about the next date after 1/1/3001.)
28
• Software Test & Performance
{ if (((year % 4) == 0) && ((year % 400) == 0) || ((year % 4) == 0) && (((year % 400) % 100) != 0)) { return true; } return false; }
OCTOBER 2007
CLASS DIVISION
expected output is in fact 1/20/2038 and not 12/13/1901.
TABLE 2: THE NEXT DATE ECP Test
Month
1
v1
2
v2
3
v3
28
v9
3/1/yyyy
4
v3
28
v10
2/29/yyyy
5
v3
28
v11
2/29/yyyy
6
v3
28
v12
Year
Day v4 v4
v6
v7
v5
V6
v7
Other
Expected Result
v8
next date
v8
next date
3/1/yyyy
7
v14
next date
8
v15
1/20/2038
9
i1
v4
v5
v6
v7
v8
error msg
10
i2
v4
v5
v6
v7
v8
error msg
11
i3
v4
v5
v6
v7
v8
error msg
12
i4
v4
v5
v6
v7
v8
error msg
v4
v5
v6
v7
i5
v8
error msg
14
v1 v2
v3
i6
v8
error msg
15
v1 v2
v3
i7
v8
error msg
16
v1 v2
v3
i8
v8
error msg
17
v1 v2
v3
i9
v8
error msg
18
v1 v2
v3
i10
v8
error msg
19
v1 v2
v3
v6
i11
error msg
20
v1 v2
v3
v6
i12
error msg
21
v1 v2
v3
v6
i13
error msg
22
v1 v2
v3
v6
i14
error msg
23
v1 v2
v3
v6
i15
13
24
Since determining leap years is based on a mathematical formula, testing any leap year and any century leap year between 1582 and 3000 will always produce the same result and return true. There are no specific boundary conditions related to leap years or century leap years. For this parameter, the specific input minima and maxima boundary values are 1582 and 3000, respectively. Often, the application of the ECP technique focuses on inputs. However, outputs also should be analyzed and included in an equivalence class table. Again, the primary value of decomposing the data into discrete subsets is because subsets that include a range of values help define the minima and maxima boundary conditions. Subsets that include a unique value from a range of values also may be considered a specific boundary condition for boundary value analysis testing. Finally, one unique date and two unique date ranges are more interesting than any of the other possible dates. The unique range of dates includes the invalid class subset of dates between OCTOBER 2007
error msg i18
error msg
10/5/1582 and 10/14/1582. These dates were excluded from the Gregorian calendar as instituted by the papal bull “Inter gravissimus,” and any date within this range should result in an error message or a message indicating the date is invalid according to the functional requirements. The range of unique dates in the valid class is between 9/3/1752 and 9/13/1752. England and her colonies didn’t adopt the Gregorian calendar until 1752, so they had to exclude 10 days to synchronize the calendar with the lunar cycle. But, since the program is based on the original Gregorian calendar instituted by the papal bull, this range of dates is more unique than others because a developer who remembers English or American history might inadvertently exclude these dates, which would actually be a defect in this particular case. Also, 1/19/2038 is another date that is more unique than other possible valid dates. If for some reason we suspect that this program used the CPU tick count, we need to test this date (after 3:14:07 a.m.) to verify that the
Equivalence Tests There are approximately a half million possible inputs of valid dates between 1/1/1582 and 12/31/3000. If it took five seconds to manually input and verify the correct result of all valid dates in the valid range of dates, it would take approximately 29 days. Again, the hypothesis of ECP testing is that an input of 6/12/2001 would produce the next calendar date on the Gregorian calendar, as would 9/29/1899 or 4/3/2999. Therefore we don’t have to test all possible dates, and doing so is probably not the best use of a tester’s time. Once the data is decomposed into discrete subsets, the next step is to define how the subsets will be used in a test. The most common approach is to create combinations of valid class subsets for all parameters until all valid class subsets have been included at least once in a validation test. Then test each invalid class subset individually for each parameter while setting the other parameters to some nominal valid value. For example, the first test for valid input data would be a combination of valid class subsets v1, v4 (or v6 or v7) and v8. The test data represented by a combination of these three subsets is any 30-day month, any day between 1 and 30, and any year between 1582 and 3000. Table 2 lists the test data inputs for the Next Date program. In this example, the ECP technique illustrates that the number of equivalence class tests for valid inputs is reduced to only eight, and 16 tests are required to check for proper error handling of the invalid class subset inputs. However, limiting the number of validation tests to eight might be a bit risky. This is where random selection of data points within specified subsets provides a greater breadth of coverage of all possible data. When we design the test (either manual or automated for test 1, we can randomly select either 4, 6, 9 or 11 for the month input (or iterate through 4-, 6-, 9- and 11-month inputs), select any integer value between 1 and 30 for the day input and any integer value between 1582 and 3000 for the year input. www.stpmag.com •
29
CLASS DIVISION
By executing the test several times and randomly selecting data from the specified subsets for that test, we increase our breadth of coverage of all data points within that defined subset, and design a test that provides flexibility for the tester or the automated test. There is no specific test for valid class subset v13 because we can assert that this is covered by the valid input ECP tests. There are likewise no ECP tests for the invalid class subsets i16 and i17. From an ECP perspective, a date less than or equal to 1/1/1582 is equivalent to the ECP test 19 where the year input is less than or equal to 1581. Similarly a test for i17 would experiment with outputs greater than 1/1/3001. This would require an input year of 3001 or greater, which is covered by ECP test 20. However, the subsets v13, i16 and i17 help the tester identify potential boundary conditions. This is another value of the ECP technique. The ECP subsets, especially the subsets that identify a range of values or unique values within a range, often highlight the minima and
maxima boundary conditions for a range of values or identify a specific value within a range of values that may represent a boundary condition under certain conditions.
•
ues that are more interesting than other values within the specified range. I suspect that some people try to combine both techniques to save time, but this is a dangerous practice because it may lead to missed tests or faulty assumptions. However, the ECP technique is sometimes used in conjunction with boundary testing because— when done properly—it’s useful in identifying potential boundary conditions and provides a basic framework for boundary value analysis testing.
Overgeneralization of data increases the probability of missing errors.
• Notice that the ECP technique does not specifically target minima and maxima boundary conditions, nor does it specifically include boundary value analysis or boundary testing. Boundary testing is a different technique that targets boundary class defects exposed by analyzing the values immediately above and immediately below the minima and maxima values of a range, or unique val-
Knowledge Required Equivalence class partitioning is not a silver bullet. It incorporates a set of techniques, a systematic approach to solving a given problem. This doesn’t mean that the application of a technique is a braindead, rote activity; nor is an individual technique a magical panacea or the only approach that should be used for all testing problems. Quite the contrary. This technique provides a strong foundation from which other techniques and approaches can be applied. ý
Coming February 26-27 The Hilton Hotel New York City, NY
please note colors: FUTURE is c 100/m 0/y 0/k 10 2008 is c 98 TEST is k 100
www.futuretest.net 30
• Software Test & Performance
OCTOBER 2007
Are You Being
Served?
Here’s What to Tell Developers About The Hooks You Need for Testing By Torsten Zelger
Photograph by Alexander Sorokin
I
f you’ve been testing software for a while, you’ve undoubtedly come across at least one application that gave you nothing to go on: a
program in which basic testability features were not considered at all. Too often, application testability isn’t a factor during the design phase—or at any other time— before the tester sits down to begin testing. This is, of course, is a huge problem for testers, particularly if automation is in the plan. If this problem sounds familiar, it might help to inform developers about your needs, and make sure they’re being served. But do you know what your needs are? Exactly what would you say if a developer approached you today and asked how to modify an app to make it easier to do your job (once you regained consciousness)? Over the years I’ve assembled a checklist of attributes to look for before even thinking of performing any automated software testing. The list has become mandatory for all testers in our company. And now I share that list with you. Torsten Zelger is a test automation specialist at Audatex Systems in Switzerland, which develops software for the insurance industry. OCTOBER 2007
Unique Object IDs Maintenance of automated test scripts can be slimmed down dramatically if developers are clued in about providing adequate object-recognition mechanisms. If no mechanism exists to uniquely identify GUI controls, it would be inefficient and unwise to start GUI-based scripting. Many of your scripts would become obsolete when GUI-control positions are shifted if new ones are added in between.
Dynamic Lists Exact output isn’t always predictable. Elements such as lists and search results can change with each use, and must therefore be checked. If such controls aren’t equipped with some unique identification, the output is more difficult for the test automation engineer to verify and can lead to extensive validation-script development, a task more easily accomplished by means of manual testing.
for controls with the same or similar purpose, fewer GUI objects need to be maintained in your scripts or in your repository. This makes automated tests easier to handle. I was once given the task of making sure that users of a car valuation application could request a valuation for all vehicles offered by the GUI. The first step was to identify all models for each manufacturer, then to ascertain all submodels for each model (n*n*n). The next step was to click the Submit button and check whether the correct page showed up. The verification for a random selection of the tests was easy to perform
FIG. 1: SEEING RED
Consistency If development uses the same ID/name www.stpmag.com •
31
CAST YOUR HOOKS
required us to include a special compomanually. If the of such tools and FIG. 2: SINGLE-SELECT nent into the Delphi project to enable next page confound uses for each. our test tool to see the data properties of tained the expected Both tools had a builtmost of the GUI controls. Corresponding controls, we were in recorder and both newsgroups are full of such examples. done. In case of recorded the selecan error after subtion of a list-box item mission, a manual Custom Controls correctly, but each in tester could have The use of painted and/or customized its own way. easily detected a red third-party GUI controls, or components The differences error message. But that contain excessive graphics, can preswere in the execution with about 1,500 ent an extra challenge for automated GUI of the saved recordvehicles, it was imtesting. This is true not only when it comes ings. While tool A did possible to do this to recognizing the controls, but also to exactly what was test manually in an your ability to get/read property data. recorded, tool B acceptable time frame. For example, a flight traffic–control returned an exception and was incapable The challenge in automating this test application, by design, is full of dynamiof completing the script. After some inveswas to catch the red error message (see cally displayed pictures. It’s therefore not tigation, it turned out that tool B had a Figure 1). The unique ID associated with very useful built-in the table-embedded error text was diffeature that unfortu- FIG. 3: MENU CODE ferent depending on the type of error nately didn’t work if that occurred. If a test didn't make it to the value of the comthe next page, the script knew somebo box consisted of a thing went wrong, but we wanted to special char, such as know exactly what went wrong without the OR char (see the need to parse the HTML code. Figures 2, 3 and 4). Tool B used the Frequent Changes OR char to drive the It’s worth verifying whether the object IDs test automation tool a good candidate for GUI automation get assigned a different name automatically to multi-select items in only one line of unless the developer provides you with each time the application is restarted or code (see Figures 5, 6 and 7). It used the a hotkey combination (or the tabulator if they’re assigned dynamically depending character as the separation indicator and button) to navigate through the graphion the actions performed on the GUI. failed to execute the command because cal icons (airplanes) using the keyboard. I’ve seen applications in which the name the combo box used the same character To find the target airplane in such a GUI of the submission button depended on the as part of the selectable values. The develsource from which the dialog was called. control, sometimes n or more hotkey clicks oper cooperated by using a different charneed to be performed depending on the Such dynamics have an extremely negative acter to name/value items in the list. number of airplanes shown on the screen. Instead of using The developer might need to provide you the value, we could FIG. 4: SCRIPT with an extra GUI element that is readable have asked the by your test tool so you can determine which developer to add a airplane has the focus. This approach would unique ID for each allow you to navigate through a custom conof the values. In this trol accurately, even though it’s a black box case, it was a failing and the number of airplanes could vary impact on script maintenance. While you’re of tool B, which could read only the valfrom one test to another. organizing your code into reusable funcue and not the ID. As another example, consider an tions, you’ll not only need to pass the object Language invoice system that updates the calculationname, you also need to know if the name The programming language used to output preview pane in real time each time is variable. develop the AUT also impacts your Such situations can greatly complicate automation project, since it may play an automation. It’s helpful to demonstrate to FIG. 5: MULTI-SELECT important role in whether the test tool developers what happens to your scripts can read the control’s property data. when they perform such changes without Developers might need to include extra notifying you. These demonstrations need code that provides automation tools with to be repeated periodically. the ability to communicate with the AUT. One test tool I’ve used would recogSpecial Chars Different GUI-automation tools do difnize the names of Java GUI controls only if they started with a dot (.) as prefix. It ferent things. What’s good for tool A was documented in the test tool’s refmight not work for tool B, or the other erence manual, but who reads manuals way around. the first time through? When attempting to automate my car Another AUT developed in Delphi valuation program, I employed a pair
32
• Software Test & Performance
OCTOBER 2007
CAST YOUR HOOKS
you add a new item or change a quantity. or by providing them FIG. 6: MENU CODE The preview-pane control may be a custom with an API to read, control from a third-party vendor that lacks update and create the ability to serve your test tool with any data where access to valuable information that could be used the database is not as an automated checkpoint. allowed or blocked In such a situation, it’s useful to check through firewalls. whether the preview pane’s output may I once tested a be collected in a readable format from Web-based wreck a different source, such as from a file. If auction system that the preview pane gets its information had several built-in from an API that you can “hook” to, it API functions for customers who didn't place bids, for instance). There also was suddenly opens a whole new set of poswant to use the Web browser UI. A servlet no special API function that allowed us sibilities for automation. Other automaAPI was used to upload wreck data, downto update the auction end-time to the curtion engineers might have already given load lists when the auction ended and rent time. The advantages of such feaup by that point, but not you. perform post actions after the winner of tures were obvious because changes in If the preview-pane output is the API were treated with much available in the file system, it might FIG. 7: SCRIPT more care than the GUI. But the be more efficient to perform tests tester’s request came so late to using the file(s) and keep only a development that the testability small set of manual tests to valifeatures were not given a high date the transformation. priority. Such hooks are powerful Obfuscation the auctioned wreck had been identified. instruments that testers often underrate Use of obfuscation tools to secure source This system was predestined to be fulor request too late. Consider the vast code or protect copyrights can impede ly and automatically tested through the amount of test cases you could upload if testing, particularly the reading of GUI interface. Such tests would have taken a the system provided an API under which component data. The best way to comfraction of the time to execute when comall the GUI fields could be filled into an bat this is to be sure developers give you pared to an equivalent GUI test suite. XML file and then uploaded to the sysaccess to code before it is obfuscated. However, there were two minor functions tem under test through a stub. missing that blocked us from putting The ability to read and write configuPrototype together a great automation job on this ration files or a user’s current settings Sometimes it can be hard to make sure all project with low maintenance costs and stored on a master database through a testayour controls can communicate with the fast execution. bility hook is an effective approach and GUI test-automation tool because of the The API was designed for the needs should be requested from the very beginlack of test cases or documentation, or of insurance companies. No API function ning. It’s also helpful for invoking differbecause the application is simply too comwas ever requested or considered for the ent behavior on the client application. plex for a test automation engineer to grab need of other stakeholders (dealers who Consider how much more vulnerable your all the controls in all the dialogs. GUI scripts are if the only way to At a later stage of develop- FIG. 8: GUI COMPOSITE check a user’s configuration setment, you might need to autotings is by navigation through a mate a test that requires a dialog graphical user interface. with a control you never noticed The same principle applies to before. You might request a simmanual testing. Why not allow ple demo application in which testers to access the log files on a developers expose all the differserver through a Web-based interent GUI controls they intend to face? Of course, relying solely on use (see Figure 8). This might testing beneath the GUI also has help you to quickly identify its risks. It’s often not the way an where your test automation tool end user works with the system, may have problems reading or and problems in the GUI layer can accessing some controls. remain undetected. I’ve also experienced API implementations that Hooks and Stubs did a different job than the same Testing activities, automated or action performed at the GUI levmanual, can be more effective if el. In this case, we determined that testers have tools that allow quick developers had redundant code checks and/or changes to fields branches, of which only one was in the underlying persistence layworking correctly. er (see Figure 9). This can be Tool Dependency done by allowing testers to conPay attention to dependencies nect to the persistence database OCTOBER 2007
www.stpmag.com •
33
CAST YOUR HOOKS
on other applications. If the application under test works together with other products by calling them or exchanging data, such tools need to be kept in mind as well. This could mean that you may have to perform an extra testability analysis for those tools.
A
UTOMATION CHECKLIST n Determine whether the automation test tool(s) can access error descriptions by forcing the
application into error conditions.
n Restart the application after you’ve recorded the test scripts and check whether the object
names remain the same.
n If developers keep a central repository of the object ID/NAMES (for example, resource.h for
Visual C++), analyze its history to determine frequency of change.
Number of Builds
n Identify list views to determine how your scripts cope with data sorted in different orders,
If you get only a few builds to test, the effort required to automate testing might not be worth it. On the other hand, too many tests may stop you from maintaining existing scripts and/or limit the coverage of your tests. There is no exact number to check against; each project is different. I’ve experienced projects in which we received several builds a day, and testing was expected of each one. In other projects, builds came once a month. The latter allowed us to focus on script development, but when we needed a fix, the wait to adapt and test our scripts was long. The ideal solution is to find the right balance, or to have access to the source code and build your own intermediate version of the AUT whenever you need it.
and/or whether the correct control is still selected when list items are added since the script was recorded. n Find out whether the test automation tool has difficulties reading certain GUI controls. If so, find alternatives such as keyboard hotkeys. Ask developers for such testability features. n Check that ID/NAMES are not provided with special characters.This is especially important for projects targeting European countries. n Identify programming languages used in the AUT and clarify support with test-tool maker and if add-ons are required. n Inform developers about the impact on test scripts if code on the GUI level is obfuscated. Ideally, request that obfuscation be limited to the business- and data-access layers or test a non-obfuscated version. n Request advance implementations of all GUI objects to be used in the AUT. n Ask for APIs that allow testing beneath the GUI and test important core functions by bypassing the GUI to improve script maintainability. n Ask for testability hooks and the ability to submit debugging parameters. n Estimate the number of expected builds over a certain period of time to determine whether automation is practical. n Evaluate GUI stability and script accordingly. n Know where automation scripts will be executed, on what machines and operating systems, and plan accordingly. n If building automation scripts for a new version of an app, ask for a predecessor build and record a few scripts for the old release first. Then, before you upgrade to the new release, execute the automated regression tests to see what changes might be required in your scripts for the upgrade.
AUT Maturity When it comes to early testing, it’s important to distinguish among automated component testing, API-based testing and GUI-based automation. Automated component and API-based test scripts can often be built in parallel with development. The same doesn’t apply to GUI-based test scripts. Minor changes to the interface often cause tests to fail. Starting GUI automation early in projects makes sense only if the task is to identify potential testability issues and to determine whether developers have introduced accurate and consistent naming of the GUI controls. Productive test scripts for the GUI lay-
FIG. 9: UNDER THE HOOD
34
• Software Test & Performance
er should be written only when the GUI is stable and working as close to 100 percent as possible. The earlier that GUI test scripts are started, the more time they’ll require to maintain. On the other hand, the later they’re started, the more likely you’ll be assigned to another project, leaving the others with only a few working scripts, if any. Seek a balance here.
ent Web browsers without adaptation. A good strategy is to create an automation against one specific environment, then perform random checks on others manually. At the end, you might have to settle for just one test configuration running automatically, or consider different tools for different browsers.
Make the Case Early Tool Selection Test automation tools are often limited to or run best with a certain operating system or configuration. If you limit your search to tools that run on multiple operating systems, you might also be restricting yourself in terms of quality or feature set. A better approach might be to select tools based on what’s best for the target operating system. The same applies to automation tools for Web browsers. Don’t expect scripts to execute on differ-
In my experience, this checklist has helped us quite a lot at the start of a project to demonstrate what’s needed for an AUT to facilitate test automation. It’s easier for everyone if a set of test cases has already been identified with the goal of automation, rather than trying to determine later whether the AUT can be tested automatically. This approach, to first investigate the application under test and then start preparing the automation, enables you to provide developers with early feedback on testability at a time when changes are easy to introduce. Once an application is released for acceptance testing, it becomes more difficult to win over developers for implementing your requirements or making them a high priority. Making your needs known early will serve you best. ý OCTOBER 2007
Business demands IT value.
Compuware delivers.
Compuware Corporation is a recognized industry leader in enterprise software and IT services that help maximize the value of technology investments. We offer a powerful set of integrated solutions for enterprise IT including IT portfolio management, application development, quality assurance and IT service management. Our solutions address the major stages of the IT life cycleâ&#x20AC;&#x201D;and help govern and manage the IT business process as a whole. www.compuware.com
Best Prac t ices
Beyond Microsoft Minutiae Trolling for best practices veteran with particular in .NET testing, it would be expertise in Java and Web easy enough to cast into applications. “The business the deep pool that is is the question.” MSDN and come up with a column that might well run No Manual UI Testing the entire length of this Businesses want customers month’s Software Test & who are happy and loyal, Performance. feelings that are engenSo too is it tempting to dered by things like intujump into the mosh pit that itive, commonsensical user Geoff Koch is one of the great software interfaces (UIs) and respondebates in recent years: Which is better, sive, highly available Web applications. Java or .NET? After corresponding with With their reusable development Glenn Morgan, however, I feel that either tools, .NET and other managed code of these approaches would ring somewhat frameworks make it simple enough to hollow for those tasked with stitching build all parts of a program, including its together applications in an all-Web world. UIs or application presentation layer. But “Most companies already have eletesting this user-facing code is another ment management in place to monitor matter entirely, especially when it comes network utilization, server CPU usage or to individual .aspx pages. application metrics,” says Morgan, tech“How can you test that the UI reacts to nical director of U.K.-based Net the user correctly? That the postback Consulting Ltd., when asked about broad events happen in the right order? That themes in testing with the .NET framethe correct next page is loaded after the work. “However, all this monitoring tends user completes the page?” writes consultto remain within the separate silos, and ant and trainer Justin Gehtland in a there is no joined-up view of performMarch 2004 article on NUnitASP for ance as a whole.” TheServerSide.NET, a community site for Morgan’s comments are reflective of a enterprise .NET developers. few broader industry truths. The first is NUnitASP is an open source tool that that even though managed frameworks provides the necessary scaffolding to do such as .NET make life easier for develunit testing with NUnit. Pages run in the opers, complexity in coding still reigns, in actual ASP.NET worker process, but the large measure because nearly every applideveloper creates a mock facade containcation is networked. The second is that er with which to test the UI to do unit testeven though tools exist to help, testing ing with NUnit. Originally ported from still is mostly about hands-on diligence JUnit and also open source, NUnit is a and disciplined organizational process. C#-based unit-testing framework for The third is that polemics about various .NET. technology platforms, though still Proprietary solutions also exist for bandied about with some vigor in the blothe problem of testing UIs built with gosphere, are uninteresting if not downand for .NET. For example, programright banal to today’s pragmatic tech mers willing to use the Infragistics NetAdvantage toolset to develop UIs problem solvers. for Windows Forms and ASP.NET “We should say, ‘Let’s look at what the applications can use HP QuickTest customer wants to do and where he wants Professional to automate functional to go,’” writes Ben Pashkoff, in his August and regression testing. 6 blog posting titled, ‘Is Java vs. .NET HP’s product can record and replay even relevant anymore?’ script commands for applications built “The technology is not the question,” with NetAdvantage, which has custom continues Pashkoff, a 20-year industry
36
• Software Test & Performance
libraries for building interface elements such as grids, menus and tabs. This tack holds some advantages over the time-sink option of using generic mouse coordinates to test the UI. However, despite HP’s marketing claims—a cheery whitepaper promises that the HP-Infragistics combo is “extremely easy to learn and intuitive to use” and will “dramatically lower personnel costs by running tests unattended”—it’s impossible to miss the fact that this approach is still labor- and skillintensive. Specifically, the manual plugand-chug run through the application test that’s recorded had better be smart and robust, or replaying it will be all but worthless. .NET is now nearly seven years old, if you date its origins back to the early 2002 release of version 1.0 and not to earlier beta releases. In true Microsoft fashion, the framework has continued to improve. And though the impression may linger that Java is the language of choice for Web developers, even the staunchest Microsoft critics have to acknowledge that .NET now offers much in the way of Web-friendly tools. Another Redmond trademark is a prodigious effort to support the community of Microsoft developers, especially through MSDN. Of course, the site is a logical first stop when looking for information on .NET testing. I’ll leave it to you to do your own in-depth searching. For those looking for a starting point, the February 2007 technical article “Reviewing Managed Code” by Microsoft’s John D’Addamio is worth flagging with a bookmark or del.icio.us tag, mostly for its description of FxCop, a tool for scrubbing .NET managed-code assemblies for adherence to .NET design guidelines. The problem with MSDN, however, is that it too often assumes the readerBest Practices columnist Geoff Koch filed this story from a Starbucks in Lethbridge, Alberta, Canada, where the locals always use the Celsius scale. Contact him at koch.geoff @gmail.com. OCTOBER 2007
Best Practices developer to be toiling in an allMicrosoft environment in which every issue can be solved with a new development tool or two and an upgrade to the latest operating system or database. In the real world, applications increasingly span platforms, continents and cultures, and are tied together by a hodgepodge of rarely standard hardware and software. As a result, polishing up the performance of a .NET application can be something of a quixotic quest if that same application will eventually have to make its way in the wider world of distributed, over-the-Web computing. “There are a lot of companies who develop in-house software and get perfectly acceptable load-test performance results,” says NetConsulting’s Morgan. “When they first install it at the other end of a WAN link, though, the whole thing grinds.” One problem, he continues, is application chattiness: multiple request/ response cycles for each transaction. More cycles mean a greater performance
hit, especially when dealing with latency associated with WAN connections. “I worked on an account for a transport company that was considering moving an application server to a remote data center,” Morgan says. “A look at one transaction showed some 18,000 request/response cycles. Add 100 milliseconds of latency to each of these, and the user response time would have been increased by 30 minutes.”
MS Transaction Tagging? Morgan hopes that Microsoft might some day add features to .NET to make it a bit easier to do performance testing in such environments. His suggestion is some sort of transaction tagging, which would allow developers and IT architects to track specific transactions from the client through the proliferating layers of back-end tiers. The current industry solutions, networktracing software that either extrapolates the experience of a single user or follows persistent information like Web cookies, simply don’t scale well to large environ-
ments, Morgan complains. “.NET seems to be in the ideal position to correct this situation, since all code passes through a common framework,” he says. “Tagging transactions at this level should be relatively easy, and would avoid the problem of having to create the feature in different codesets that would otherwise make it impractical to implement.” For now, Morgan seems content to avoid both the MSDN minutiae and the Java vs. .NET vitriol, instead focusing on mundane issues such as tracking and timing information in network packet headers. “The advantage of this approach is that there is no distinction between monitoring .NET, HTTP, Oracle or any other TCP/IP-based application,” he says. “Is this approach as accurate as some tools that track the keyboard-to-eyeball response times of an application? No, but it’s like measuring temperature in Celsius or Fahrenheit—they’ll both tell you when it starts to get hot, they’ll just use different numbers to do it.” ý
Index to Advertisers Page
Advertiser
URL
Compuware
www.compuware.com
35
EclipseWorld
www.eclipseworld.net
39
Empirix
www.empirix.com/freedom
FutureTest 2008
www.futuretest.net
30
Gomez
www.gomez.com/load-testing
19
Hewlett-Packard
www.hp.com/go/software
IBM
www.ibm.com/takebackcontrol/innovate
iTKO
www.itko.com/lisa
Pragmatic
www.softwareplanner.com
37
Seapine
www.seapine.com/stptcm
4
OCTOBER 2007
6
20-21, 40
2-3
8
www.stpmag.com •
37
Future Future Test
Test
Solving SOA Issues With Lean Principles A recent survey of CIOs of al resources such as servers, North American companies software licenses or network by the McKinsey managebandwidth. It can also mean ment consultancy revealed wasting the end user’s time two important trends in due to bad performance. information technology for SOA, while increasing 2007: The migration to servcomplexity, offers a signifiice oriented architectures cant improvement in flexiand the introduction of bility. Furthermore, with the lean manufacturing princiWeb, the browser becomes ples to data center operathe common UI for all perHon Wong tions. These trends have sonal and enterprise appliimportant implications for the discipline cations, without requiring you to relearn of software testing. application-specific UIs or commands. Since the inner workings, reliability Under lean principles, software test and scalability of SOA services are not must ensure that the right things (transknown or within the control of the develactions) are delivered to the right place oper, it’s difficult to ensure the composite (successful completion of such transacapplication’s performance and reliability, tions) at the right time (adequate pereven if its functionality can be validated formance) prior to production deployusing “typical” test cases generated by ment. It’s not sufficient to validate an load-testing tools. The testing challenge application’s functionality and performis compounded by a lack of visibility into ance in the vacuum of the test lab. the Web infrastructure, which can cause Instead, extend the test to real users exeunacceptable levels of end-user service. cuting real transactions in real time during beta test or early deployment. The Right Stuff Moreover, information on real user expeThe key to testing complex SOA applicarience should be fed back to architects, tions lies, ironically, in applying lean developers and IT ops to drive continued manufacturing principles that aim for the application and infrastructure tuning. elimination of waste. Based on W. “The Machine That Changed the Edwards Deming’s classic work in teamWorld: The Story of Lean Production” centric management, statistical quality (by James Womack and co-authors) control and process improvement, these established five key elements: Value: Understand the value that the concepts allow companies like Toyota user places on the product or service. For (with its Toyota Production System) to software testing, this involves measuring dominate their market segments. performance (e.g., the amount of time Lean is about delivering the right needed to refresh a Web page or perform things to the right place at the right time catalog look-up) and functionality (e.g., in the right quantity, while minimizing broken links or page error). waste and staying open to change. The Value Stream: Value stream is Waste, in the case of IT, normally the entire flow of the transaction life means over-provisioning of infrastructur-
38
• Software Test & Performance
cycle that extends from the browser (for Web-based SOA applications) to the database. Because of the complexity of Web-deployed SOA applications, software test should be able to track a transaction through the entire value stream in real life—not just in a test environment. Flow: If transaction flow is interrupted, waste will occur. Test engineering must be able to obtain a holistic view from end to end so as to identify application hot spots or infrastructural bottlenecks. Pull: Instead of overprovisioning of IT resources across the entire infrastructure to minimize the occurrence of bottlenecks, software test should have the ability to pinpoint bottlenecks tracking real-life use of the applications using real production infrastructure during beta test. Perfection: The continued drive toward perfection is critical, especially for complex SOA applications. Software test should have a common workflow and communication platform so that actionable information can be relayed to other IT constituents. There should be a sharable “case file” that correlates data collected from physical (servers, network etc.) and software (method calls, SQL queries etc.) components to identify the cause(s) of performance problems or errors. As a result, interdepartmental triage teams won’t be required to debate and re-create problems after the fact. Furthermore, the information can be used to tune the applications and infrastructure. The benefits of SOA and lean manufacturing principles are realized only if software tests measure performance from the user’s perspective during load or beta testing, and relate any problems to issues within the infrastructure or application. In addition, there must be a common communication platform for the various IT constituents to access the test data, so as to drive continued improvements. ý Hon Wong is CEO of Symphoniq, a maker of tools for Web application performance management. He’s also a prolific blogger at symphoniq.typepad.com. OCTOBER 2007
Make sure your critical applications are never in critical condition. Weâ&#x20AC;&#x2122;ve turned I.T. on its head by focusing on I.T. solutions that drive your business. What does this mean for Quality Management? It means efficiency that results in shorter cycle times and reduced risk to your company. It also means you can go live with confidence knowing that HP Quality Management
upgrades. Find out how. Visit www.hp.com/go/software. Technology for better business outcomes.
Š2007 Hewlett-Packard Development Company, L.P.
software has helped thousands of customers achieve the highest quality application deployments and