APM:
Monitoring The firsT of Three ParTs
What it means in today’s complex software world
SPONSORED BY
SD Times
April 2020
www.sdtimes.com
application Performance
What it means in software world BY DAVID RUBINSTEIN
Monitoring The firsT of Three ParTs
S
oftware continues to grow as the driver of today’s global economy, and how a company’s applications perform is critical to retaining customer loyalty and business. People now demand instant gratification and will not tolerate latency — not even a little bit.
www.sdtimes.com
April 2020
SD Times
Monitoring:
today’s complex As a result, application performance monitoring is perhaps more important than ever to companies looking to remain competitive in this digital economy. But today’s APM doesn’t look much like the APM of a decade ago. Performance monitoring then was more about the application itself, and very specific to the data tied to that application. Back then, applications ran in data centers on-premises, and written as monoliths, largely in Java, tied to a single database. With that simple ntier architecture, organizations were able to easily collect all the data they needed, which was then displayed in Networks Operations Centers to systems administrators. The hard work
came from command-line launching of monitoring tools — requiring systems administration experts — sifting through log files to see what was real and what was a false alarm, and from reaching the right people to remediate the problem. In today’s world, doing APM efficiently is a much greater challenge. Applications are cobbled together, not written in monoliths. Some of those components might be running onpremises while others are likely to be cloud services, written as microservices and running in containers. Data is coming from the application, from containers, Kubernetes, service meshes, mobile and edge devices, APIs and more. The
complexities of modern software architectures broaden the definition of what it means to do performance monitoring. “APM solutions have adapted and adjusted greatly over the last 10 years. You wouldn’t recognize them at all from what they were when this market was first defined,” said Charley Rich, a research director at Gartner and lead author of the APM Magic Quadrant, as well as the lead author on Gartner’s AIOps market guide. So, although APM is a mature practice, organizations are having to look beyond the application — to multiple clouds and data sources, to the network, to the IT infrastructure — to get the big picture of what’s going on with
SD Times
April 2020
www.sdtimes.com
their applications. And we’re hearing talk of automation, machine learning and being proactive about problem remediation, rather than being reactive. “APM, a few years ago, started expanding broadly both downstream and upstream to incorporate infrastructure monitoring into the products,” Rich said. “Many times, there’s a prob-
lem on a server, or a VM, or a container, and that’s the root cause of the problem. If you don’t have that infrastructure data, you can only infer.” Rekha Singhal, the Software-Computing Systems Research Area head at Tata Consultancy Services, sees two major monitoring challenges that modern software architectures present.
Gartner’s 3 requirements for APM APM, as Gartner defines it in its Magic Quadrant criteria, is based on three broad sets of capabilities, and in order to be considered an APM vendor by Gartner, you have to have all three. Charley Rich, Gartner research director and lead author of its APM Magic Quadrant, explained: The first one is digital experience monitoring (DXM). That, Rich said, is “the ability to do real user monitoring, injecting JavaScript in a browser, and synthetic transactions — the recording of those playbacks from different geographical points of presence.” This is critical for the last mile of a transaction and allows you to isolate and use analytics to figure out what’s normal and what is not, and understand the impact of latency. But, he cautioned, you can’t get to the root cause of issues with DXM alone, because it’s just the last mile. Digital experience monitoring as defined by Gartner is to capture the UX latency errors — the spinner or hourglass you see on a mobile app, where it’s just waiting and nothing happens — and find out why. Rich said this is done by doing real user monitoring — for web apps, that means injecting JavaScript into the browser to break down the load times of everything on your page as well as background calls. It also requires the ability to capture screenshots automatically, and capture entire user sessions. This, he said, “can get a movie of your interactions, so when they’re doing problem resolution, not only do they have the log data, actual data from what you said when a ticket was opened, and other performance metrics, but they can see what you saw, and play it back in slowmotion, which often provides clues you don’t know.” The second component of a Gartner-defined APM solution is application discovery diagnostics and tracing. This is the technology to deploy agents out to the different applications, VMs, containers, and the like. With this, Rich siad, you can “discover all the applications, profile all their usage, all of their connections, and then stitch that together to what we learn from digital experience to represent the end-to-end transaction, with all of the points of latency and bottlenecks and errors so we understand the entire thing from the web browser all the way through application servers, middleware and databases.” The final component is analytics. Using AI, machine-learning analytics applied to application performance monitoring solutions can do event correlation, reduce false alarms, do anomaly detection to find outliers, and then, do root cause analysis driven by algorithms and graph analysis. z — David Rubinstein
First, she said, is multi-layered distributed deployment using Big Data technologies, such as Kafka, Hadoop and HDFS. The second is that modern software, also called Software 2.0, is a mix of traditional task-driven programs and data-driven machine learning models. “The distributed deployment brings additional performance monitoring challenges due to cascaded failures, staggered processes and global clock synchronization for co-relating events across the cluster,” she explained. ”Further, a Software 2.0 architecture may need a tight integrated pipeline from development to production to ensure good accuracy for data-driven models. Performance definition for Software 2.0 architectures are extended to both system performance and model performance.” Moreover, she added, modern applications are largely deployed on heterogeneous architectures, including CPU, GPU, FPGA and ASICs. “We still do not have mechanisms to monitor performance of these hardware accelerators and the applications executing on them,” she noted.
The new culture of APM Despite these mechanisms for total monitoring not being available, companies today need to compete to be more responsive to customer needs. And to do so, they have to be proactive. Joe Butson, co-founder of consulting company Big Deal Digital, said, “We’re moving to a culture of responding ‘our hair’s on fire,’ to being proactive,” he said. “We have a lot more data … and we have to get that information into some sort of a visualization tool. And, we have to prioritize what we’re watching. What this has done is change the culture of the people looking at this information and trying to monitor and trying to move from a reactive to proactive mode.” In earlier days of APM, when things in application slowed or broke, people would get paged. Butson said, “It’s fine if it happens from 9 to 5, you have lots of people in the office, but then, some poor person’s got the pager that night, and that just didn’t work because what it meant in the MTTR — mean time to
Full Page Ads_SDT034.qxp_Layout 1 3/20/20 3:12 PM Page 29
SD Times
April 2020
www.sdtimes.com
recovery — depending upon when the event occurred, it took a long time to recover. In a very digitized world, if you’re down, it makes it into the press, so you have a lot of risk, from an organizational perspective, and there’s reputation risk. High-performing companies are looking at data and anticipating what could happen. And that’s a really big change, Butson said. “Organizations that do this well are winning in the marketplace.”
Whose job is it, anyway? With all of this data being generated and collected, more people in more parts of the enterprise need access to this information. “I think the big thing is, 10-15 years ago, there were a lot of app support teams doing monitoring, I&O teams, who were very relegated to this task,” said Stephen Elliot, program vice president for I&O at research firm IDC. “You know, ‘identify the problem, go solve it.’ Then the war rooms were created. Now, with agile and DevOps, we have [site reliability engineers], we have DevOps engineers, there is a broader set of people that might own the responsibility, or have to be part of the broader process discussion.” And that’s a cultural change. “In the NOCs, we would have had operations engineers and sys admins looking at things,” Butson said. “We’re moving across the silos and have the development people and their managers looking at refined views, because they can’t consume it all.” It’s up to each segment of the organization looking at data to prioritize what they’re looking at. “The dev world comes at it a little differently than the operations people,” Butson continued. “Operations people are looking for stability. The development people really care about speed. And now that you’re bringing security people into it, they look at their own things in their own way. When you’re talking about operations and engineering and the business people getting together, that’s not a natural thing, but it’s far better to have the end-to-end shared vision than to have silos. You want to have a shared under-
standing. You want people working together in a cross-functional way.” Enterprises are thinking through the question of who owns responsibility for performance and availability of a service. According to IDC’s Elliot, there is a modern approach to performance and availability. He said at modern companies, the thinking is, “ ‘we’ve got a DevOps team, and when they write the service, they own the service, they have full end-to-end responsibilities, including security, performance and availability.’ That’s a modern, advanced way to think.” In the vast majority of companies, ownership for performance and availability lies with particular groups having different responsibilities. This can be based on the enterprise’s organizational structure, and the skills and maturity level that each team has. For instance, an infrastructure and operations group might own performance tuning. Elliot said, “We’ve talked to clients who have a cloud COE that actually have responsibility for that particular cloud. While they may be using utilities from a cloud provider, like AWS Cloud Watch or Cloud Trail, they also have the idea that they have to not only trust their data but then they have to validate it. They might have an additional observability tool to help validate the performance they’re expecting from that public cloud provider.” In those modern organizations, site reliability engineers (SREs) often have that responsibility. But again, Elliot here stressed skill sets. “When we talk to customers about an SRE, it’s really dependent on, where did these folks come from?” he said. “Where they reallocated internally? Are they a combination of skills from ops and dev and business? Typically, these folks reside more
ParT ii: aPM vs. aioPs vs. observabiliTy: WhaT’s The difference?
along the lines of IT operations teams, and generally they have operating history with performance management, change management, monitoring. They also start thinking. Are these the right tasks for these folks to own? Do they have the skills to execute it properly?” Organizations also have to balance that out with the notion of applying development practices to traditional I&O principles, and bringing a software engineering mindset to systems admin disciplines. And, according to Elliot, “It’s a hard transition.” Compound all that with the growing complexity of applications, running the cloud as containerized microservices, managed by Kubernetes using, say, an Istio service mesh in a multicloud environment. TCS’ Singhal explained that containers are not permanent, and microservices deployments have shorter execution times. Therefore, any instrumentation in these types of deployment could affect the guarantee of application performance, she said. As for functions as a service, which are stateless, application states need to be maintained explicitly for performance analysis, she continued. It is these changes in software architectures and infrastructure that are forcing organizations to rethink how they approach performance monitoring from a culture standpoint and from a tooling standpoint. APM vendors are adding capability to do infrastructure monitoring, which encompasses server monitoring, some amount of log file analyst, and some amount of network performance monitoring, Gartner’s Rich said. Others are adding or have added capabilities to map out business processes and relate the milestones in a business process to what the APM solution is monitoring. “All the data’s there,” Rich said. “It’s in the payloads, it’s accessible through APIs.” He said this ability to visualize data can show you, for instance, why Boston users are abandoning their carts 20% greater than they are in New York over the last three days, and come up with something in the application that explains that. z
Full Page Ads_SDT034.qxp_Layout 1 3/23/20 3:13 PM Page 31