Observability: A guide for buyers

Observability

g u i d e f o r b u y e r s

a

By David Rubinstein

in a world that’s gone from developers delivering features for the business to delivering products to users, the customer experience is now the top driver for many organizations creating software.

so, how do these organizations know how their applications and websites are treating customers? Through observation. but today’s observability bears little resemblance to the days of performance monitoring and Network operations Centers.

observability grew out of the complexities of running cloud-native applications, where insights into third-party application components and their aPis, as well as into containers and infrastructure and more, are critical to keeping those applications running smoothly. That, coupled with the need for businesses to understand the experiences their customers and potential customers are having, means it’s critical to have an eye on every layer that impacts how an application is performing.

and it’s not just how the app itself is performing — now it’s about security and infrastructure issues, which have to be monitored to prevent disruptions that can cost businesses heavily.

“ultimately, it’s about knowing what’s happening with the services, getting insights into the services, and ultimately, what impact they have on the business,” explained Carlos Casanova, principal analyst at forrester. in the early talk of performance monitoring, terms like user experience were used to imply business outcomes. Today, he said, “We’ve kind of matured, and reﬁned our terminology to be all the eX (employee experience) and dX (devel-

outcomes

oper experience) and uX, and whoever else knows what the next X is. andidon’tthinkitwillendthere.”

a“stateofobservability2023” studyreleasedearlierthismonthby theenterprisestrategygrouponbehalfofdigitaltransformationcompanysplunkrevealedthatdowntime cancostorganizationsupto $500,000perhour.observability,the studynotes,canhelporganizations resolveinstancesofunplanned downtimeinmereminutes,whenin thepast,itcouldtakehoursordays.

Jason Haworth, chief product officer at apica,explained:“asa customer,youmightnotactually carethatdouble-click ads aren’t loading quickly. However, as a user,

2 June 2023

Observability gives organizations insights into the full user experience to drive better business

ifthedoubleclickadispreventing therestofmypagefromrendering andloadingup,orihavethese weirdinterstitialadsthatpopup, andmyentirewebsitethati’m lookingatchanges,it’sprobablya goodideatoﬁgurethatout.andﬁgureoutwhatitisperformance-wise customersareexperiencing.”

Observability is not just for cloud services

Casanova said that although observability came into its own with roots in cloud native, it’s important

to look at on-premises servers and even mainframes to get a look at the full chain of events within an application to see how it impacts the experience. but he said that a large percentage of the audience today doesn’t see observability going beyond the cloud-native space. “ and that’s oK,” he added. “if they’re deﬁning it more in that aPM kind of realm, and addressing it there, then there’s nothing wrong with that, but it doesn’t address the full capacity of systems that are out there.”

but many organizations still have a big stake in running things onpremises. Casanova noted that migration to the cloud is still in its early stages — despite the rush to the cloud that the CoVid-19 pandemic precipitated. “it’s not like we’re at 80% migration to the cloud. continued on page 4 >

Subscribe

•reports on the technologies affecting iT operators — aPM, data Center optimization, Multi-Cloud, iTsM and storage

•insights into the practices and innovations reshaping iT ops such as aiops automation, Containers, devops, edge Computing and more

•The latest news from the iT providers, industry consortia, open source projects and research institutions

www.ITOpsTimes.com

EDITOR-IN-CHIEF

David Rubinstein drubinstein@d2emerge.com

ART DIRECTOR

Mara Leonardi

mleonardi@d2emerge.com

CONTRIBUTING WRITERS

Eli Cohen

PUBLISHER

David Lyman

978-465-2351

dlyman@d2emerge.com

MARKETING AND DIGITAL MEDIA SPECIALIST

Andrew Rockefeller

arockefeller@d2emerge.com

June 2023 3

Stay on top of the IT industry

to get the latest

analysis

to ITOps Times Weekly News Digest

news, news

and commentary delivered to your inbox.

< continued from page 3

We’re probably in the 20% range,” he said. “so from an observability perspective, where it’s going, it still has a large opportunity to grow there.” observability has been easier to do on cloud-native systems because the systems have been more designed and built to be observed, he said, as opposed to commercial components or legacy systems that have taken a lift-and-shift approach to cloud migration.

and there are some systems that are simply not willing to be observed, such as stealth devices. “if it’s deliberately keeping itself secret, it doesn’t want to put out too much into the logs. it doesn’t want to put out the telemetry that, say, a nefarious actor is going to act on. so how observable is that system?” Not very observable. However, he pointed out that these more sophisticated monitoring technologies, many of which employ some level of ai/ML, can

infer and investigate actions in the system. “it’s easy to just collect the logs, metrics and traces, and then mash them together, and then spit it back out in some kind of visuals. Where the difﬁculty of the challenge, and where the sophistication of some of these tools comes in, is where they’re able to say, ‘i can ﬁll in those gaps that are missing from the logs and traces and metrics with what i’ve seen before, and match it very closely to that.’” n

thesethingsintosomeofourdifferentclientconnections.butwe havetheabilitytocontrolafull Windowsdesktopenvironment. andwecanloadupvirtualmachinesonit,wecanlookatvirtual smartdevices,allthesedifferent components.andagain,applythe sametestingfunctionality,and thenmonitorthosepiecesover time.

How Apica provides insights into performance

apica,whichbeganin2005asa loadtestingcompany,transitioned beginningin2010intomonitoring. That’sbecausetheynoticedorganizationsdoingloadtestingwere breakingitoffintoseparateteams, whichtendtousedifferentscriptingmethodologiesandpre-existingtools.

butit’stheconvergenceof thesedifferenttoolsetsthatisbecomingverypowerful,accordingto Jason Haworth, chief product officer at Apica.

“We’veactuallybeenableto solvecriticalproblemsmuch,much quicker,”hesaid.“Wecreatedwhat wecallthesedesktopapplication

checks, which gives us the ability to do rPa functions on anything. because the [apica] agent can basically do anything that you can do with a Windows or Linux system, you can do this for any application in any protocol. so we specifically focused on browser checks, which is 85% of the internet traffic. but we can also do things like monitor web streams, and monitor them end to end for things like color code, color palette, how those pieces actually work. We can do another depth and go further into application flows. so we can basically do any protocol out there that exists on the internet by opening up and writing

“Theidea,”hecontinued,“isthat you’regoingtogetconsistenttestingbasesacrossallthesedifferent scenariotypes.andthatconsistencyisthebigpiece,because whatyouwinduphavingissome groupsthatdosometestingwith onetool,othergroupsthatdotestingwithadifferenttool.andthen thingscouldslipthroughthe cracksbecausetheydon’tcover eachother’sgaps.”

Haworth pointedoutthatapica isnotafullapplicationusabilitytestingplatform.“yougiveusthescenariosandthenweloadthemup andthenwecandothemrepeatedlyinaCi/Cdpipelineoratscale. becausewewillbasicallytakeanyone’sscripts,convertthemand makethemwork.andit’sthislevel ofenablementthatweprovidethat allowsus,asmallcompany,tobein somereallybigbanksandbigioT shopsthesedays.” n

4 June 2023

Deﬁning observability leaders

By David Rubinstein

in its “state of observability 2023” study, monitoring and digital transformation company splunk deﬁnes leaders in observability as organizations with at least two years of experience with observability that have achieved the highest rank in ﬁve factors:

l ability to correlate data across all observability tools

l adoption of ai/ML technology within their observability tools

l Having specialized skills in observability

l ability to cover both cloud-native and traditional application architectures

l adoption of aiops

other key findings from the study includethatobservabilityleadersexperienceone-thirdfeweroutagesper yearthanbeginners;havegreatervisualclaritythatgivesorganizations theabilitytofindandfixproblems faster; areconfidentintheirabilityto meetavailabilityandperformance requirements;unifyvisibilityacross environments;andunderstandthat aiopsisinstrumentaltocustomer experiencebydeterminingtheroot causeofanissue,predictingproblemsbeforetheybecomecustomerfacingincidents,andassessingthe severityofanincident.

“Withtherisingcomplexityof today’stechnologyenvironments andthedirectconnectionbetween reducingdisruptionsandoptimal customerexperiences,observability isfundamentaltothesuccessfulop-

erationsofmodernbusinesses,” spirosXanthos,seniorvicepresidentandgeneralmanagerforthe observabilitybusinessatsplunk, saidintheannouncementofthe re-port.“observabilityenablesbusinessestokeeptheirsoftwareand infrastructurereliable,systemssecureandcustomershappy,making itacriticalcomponenttoany organi-zation’sresiliencestrategy.”

Wheretestingcomesin

fordigitalbusinesses,themost importantperspectiveisthatofthe peopleusingyourapplications so, apica’s chief product officer Jason Haworth noted,ifyou’renotactively monitoringthefulluserexperience, you’renotreallymonitoring,you’re simplyobserving andobservation, withoutdeﬁnitivewaysoractionsto bedone,isuseless

“icanlookatsomethingallthe livelongday,icanlookatmykitchen beingdirty,becausemykidslefta mess butifidon’tgointhereand cleanit,oryellatmykids,it’sgonna stayamess sothat’swhatthese observabilityplayersaremissingthe most,thattheyhavenowayto comparethebaseline Theyhaveno waytolookathowthingsactually work andthentheyhavenowayto actuallyﬁxitinarelativelyshorttime frame.soallthesetoolsarededicated to-wardsﬁndingcomponentpieces forhowthesethingswork butthey’re alleastereggs,becausethey’renot

actually exposingtherealproblem.”

Withmanydifferentcomponents makingupauserexperience—user interface,security,third-partyand open-sourcesoftware,aPis,cloud environmentsandmore—testing canprovideanunderstandingofthe healthoftheapplication.Testing teamscanbuildtestsandcreatedifferentscenariosthattestallthose components, Haworth said.andthen thosetestscriptscanbere-runat scaleforloadtesting,orthescripts canbebrokenintopiecestobemade smallerforthingslikeunittesting.

“oncethetestingphaseisover, youwanttogotothetypical blue/greendeploymentenvironment inthecloud,”hecontinued.“onceyou doyourbluetesting,becauseyou haveagoodbaselineonwhatyour performancelookslike,thenyoutake thosesamescriptsandturnthemon formonitoring.”Thatway,hesaid, you’reusingthesamescriptswith consistentunderstanding,withthe samesetoftimesseriesdataovera protractedperiodoftime.“andwhat youcangatherfromthatisactuallya percentageofhealthoverbaseline. sonotonlycanyouunderstandwhat thehealthofyourserviceis,upor down,butyoucanactuallyunderstandwhattheperformancecharacteristicslooklike.” insteadofhaving toguessifyou’reprovidingagood enduserexperience,youcanactually tellifyouare,henoted. n

June 2023 5

How API observability solves troubleshooting misery

By Eli Cohen

Microservices have gained popularity as an architecture for developing distributed applications that can scale and evolve rapidly.

Nevertheless, as microservices are embraced, observability becomes a critical element for guaranteeing the efﬁciency of troubleshooting and ongoing maintenance.

aPi observability enables the monitoring and analysis of the interactions between microservices through their aPis. Naturally, in a microservices architecture, the app is divided into multiple, independently deployable services, and communication between them happens through aPis, from traditional ones such as resTful or soaP to newer ones such as grPC, graphQL, or even asynchronous ones like Kafka.

aPi observability allows for the tracking of aPi calls, response times, errors, and gives granular visibility into the behavior of a single aPi, which can help identify performance issues and ensure the reliability of the system. Without aPi observability, it can be challenging or even impossible to identify the root cause of issues and troubleshoot them accordingly.

How is observability different from monitoring?

observability is the ability to understand the internal state of a system by analyzing the data it produces, such as logs, metrics, and traces. it allows developers to gain

insights into the performance, behavior, and health of a system, even in distributed and dynamic environments like in microservices. it provides a comprehensive view of a system by allowing developers to track requests as they move through the various components of a microservices architecture.

Monitoring, on the other hand, uses logs, metrics, and alerts as well, but it is often limited to predeﬁned metrics and thresholds, and can miss issues that are not covered by the deﬁned metrics. in contrast, observability is a broader concept that encompasses monitoring but goes beyond it, hence allowing faster troubleshooting, improved collaboration, debugging, and solving issues before they become problems in production.

To summarize, observability enables teams to quickly identify and diagnose issues, reducing MTTr (Mean Time to resolve) and improving overall quality.

The challenge of API observability in microservices

one critical component of observability in microservices is aPi observability, which is the ability to monitor and analyze aPi behavior. While critical, aPi observability can also pose signiﬁcant challenges:

•aPi proliferation is a major challenge in aPi observability

•developers and product owners lose track and control due to too many aPis

•developers don’t have access to organized data needed to resolve issues such as high error rates or latency

•aPis behavior is constantly changing, which adds to the complexity

•Third-party aPis also add to the complexity

•different types of aPis, including synchronous (HTTP/grPC) and asynchronous, leading to a difference in data ﬂows that can make it more challenging to trace requests and understand the behavior of the system

•Not speciﬁc to aPis but relevant still, abundance of data generated by multiple observability tools can cause data overload, making it challenging to extract useful insights, and the manual maintenance of data collection is time-consuming and error-prone. as a result, cloud-native developers struggle to lower MTTr, improve developer experience, and maintain app quality in production. in addition, traditional aPM tooling may fall short in providing sufﬁcient insights for complex, cloud-native environments as they weren’t built from the ground up for monitoring aPis.

A day-to-day challenge explained: Root cause analysis for an API latency issue

Let’s demonstrate some of the above-mentioned challenges in microservices aPi observability and troubleshooting.

6 June 2023

microservices

imagine an organization that has a distributed system with multiple microservices. one of the developers is trying to perform a root cause analysis of an increase in aPi latency

The developer struggles with quickly identifying the outlier of long aPi calls, and with understanding the distribution of duration within the aPi call and zooming in on the ones that present the longest duration

she also struggles with understanding the full context, meaning the data on what the specific aPi’s downstream dependencies are, in order to figure out the root cause and map the bottlenecks across the e2e flow.

The next challenge is identifying the aPi call that is responsible for the latency issue and determining whether it impacts all other traces or is just local in nature.

another important point is repro-

duction, exploring if the problem is still there and doesn’t represent a momentary issue.

This is just one simple example but developers encounter similar issues on a daily basis: from aPi discovery to validating aPis, to investigating and troubleshooting issues.

a good observability approach and a few best practices can solve the mentioned challenges, and help instantly troubleshoot the example above, but it should include tooling that allows:

• aPi-level observability

• dependencies between aPis

• aPi specs and their enforcement

API observability best practices

challenges is through the use of auto-instrumentation instrumentation refers to the process of adding code to an application or service to collect data about its behavior and performance

instrumentation is valuable for troubleshooting microservices because it provides a comprehensive view that is otherwise almost impossible to obtain through other methods it collects data by instrumenting every service. This can include metrics such as request latency, error rates, and resource usage

in addition, it allows real-time monitoring, hence enabling developers to quickly identify and respond to issues as they arise This can help reduce downtime and improve system reliability.

1.Enable

auto-instrumentation. one of the best practices to overcome aPi observability

another beneﬁt is it provides a

continued on page 8 >

June 2023 7

Analyzing bottlenecks in the E2E ﬂow using Helios’ trace visualization tool

Microservices troubleshooting misery

< continued from page 7

high level of granularity in terms of the data that is collected, such as tracing individual requests across multiple services to identify the source of an issue, rather than just seeing a high-level metric for the entire system

it also lets developers deﬁne and track custom metrics speciﬁc to their apps and needs

auto-instrumentation, in particular, involves the automatic injection of code into an application or service without the need for manual coding With auto-instrumentation, developers can track and monitor their aPi calls and services, getting access to the necessary data, without needing to manually instrument each service

auto-instrumentation can be implemented through openTelemetry, which supports automatic instrumentation of aPis and services, enabling developers to gain valuable insights into their application’s behavior, troubleshoot errors, and optimize their system

2.Add distributed tracing to your monitoring stack. a useful best practice that has gained popularity is to add distributed tracing on top of logging and metrics distributed tracing refers to the process of tracking the ﬂow of requests as they move through a distributed system

With distributed tracing, develop-

ers can identify performance bottlenecks and troubleshoot errors by following a request’s journey through the system, from the initial aPi call to the ﬁnal response. by using related tools, developers can observe the ﬂow of requests, measure latency and error rates, and gain a holistic view of their system’s performance.

3.Enrich observability with trace visualization and granular error data. using tools that smartly visualize spans and traces and add enriching data is another best practice for achieving aPi observability in microservices architectures.

rather than relying on timeline views to understand the ﬂow of requests through a system, (which isn’t optimal for aPis) developers can use visualization tools that help them understand the context of the data they are analyzing smart visualization of distributed tracing data, allows developers to easily understand the context of each trace and span, as well as view rich contextual error data it allows developers not only to easily implement distributed tracing, but to also make it actionable by maximizing its potential

4.Fight data overload with automated insights and error alerts. one of the main challenges in achieving aPi observability is the sheer volume of data gener-

eli Cohen is Ceo and co-founder of Helios before co-founding Helios, a production-readiness platform for developers, eli served as director of engineering, product manager, and engineering team leader at a variety of successful startups eli is an alumnus of the elite israeli intelligence unit 8200, and he holds both a b sc in Computer science and an Mba from the Hebrew university of Jerusalem

ated by microservices architectures With so much data being collected from various components in the system, it’s easy for developers to become overwhelmed and miss important insights

To avoid data overload, it’s important to insist on automated insights and error alerts intelligent insights that highlight which areas of the system require attention, such as slow-performing aPis or high-error-rate services, can minimize MTTr

important insights include:

1 auto-generated aPi spec

2 dashboard that shows aPi behavior over time

3 detection of changes in aPi behavior

4 anomaly detection of aPi latency and behavior

Par ting words

achieving aPi observability in microservices is painful due to lack of control, data overload, and more but this new world calls for new ways and developers adopt observability approaches that include distributed tracing, helping them reduce burnout, improve developer experience, optimize their app’s quality, and minimize root-cause analysis effort.

in the example described above, and by using observability enriched with smart visualization and granular error data, the developer could view the outlier or long spans, understand span duration distribution and deep dive into the longest spans, and then identify and analyze the bottlenecks that caused the latency n

8 June 2023

A guide to observability tools

n Catchpoint: Catchpoint is the internet resilience Company The leading global brands rely on the unparalleled visibility gained from Catchpoint’s internet Performance Monitoring (iPM) suite across thousands of global vantage points to catch any issues in the internet stack before they impact the customers, workforce, networks, website performance, applications, and aPis Learn more at: catchpoint com

n Cisco AppDynamics: Cisco appdynamics offers a full-stack monitoring and observability solution for traditional, hybrid, and cloud-native applications, available on-premises using both proprietary agents and openTelemetry, it provides visibility across the application, infrastructure, network, and security stacks.

n Datadog: its saas-based observability and security solution includes application performance monitoring, infrastructure monitoring, log management, digital experience monitoring, network monitoring, and security. it takes logs, metrics, traces, events, and security signals from across the stack, coupled with metadata to provide context, to assess the overall user experience

n Dynatrace: The company provides software intelligence to simplify enterprise cloud complexity and accelerate digital transformation. With ai and complete automation, its all-in-one platform provides answers, not just data, about the performance of applications, the underlying infrastructure and the experience of all users.

n Elastic: based on its open-source elastic search Platform, the company ' s solution for on-premises, cloud or hybrid deployments offers visibility into applications, infrastructure, services, containers, Kubernetes and more to

FEATURED PROVIDER

n Apica: apica keeps the enterprise operating its ascent platform delivers observability, automated root cause analysis, and advanced testing allowing organizations to ﬁnd and resolve complex digital performance issues before they negatively impact the bottom line. Today, business operations depend on understanding the health of multi-cloud, hybrid, and on-premises environments to keep business-critical applications online while providing an optimal user experience apica delivers detailed insights across these locations, on any device, or app, helping organizations reduce, prevent and resolve outages and lost revenue

identify the root cause of issues

n Grafana Labs: grafana Cloud, a composable observability platform, integrates metrics, traces, and logs from other observability solutions through plug-ins it takes advantage of open source observability software, including Prometheus, Loki, and Tempo

n Honeycomb: The company ' s datastore makes it possible for you to investigate user experience and quickly hone in on problems Honeycomb organizes your telemetry data for fast, accurate exploration from the same ui, regardless of data type, allowing you to debug issues for a single user or complex patterns across multiple users and services

n IBM Instana: instana’s enterprise observability Platform ingests all performance metrics, traces all requests, and proﬁles every process across all major cloud platforms. automatic discovery, monitoring, root cause analysis and feedback reduce the amount of stress when deploying code with immediate feedback on the performance and quality of your applications.

n LogicMonitor: Capabilities include infrastructure, network and cloud monitoring, aiops, log analysis, server monitoring, website synthetics and application performance management The company ' s solution helps organiza-

tions identify performance bottlenecks early so customer experience is improved while compute requirements and operations expenses are reduced

n New Relic: The New relic platform offers application, infrastructure, network, browser and mobile monitoring, as well as log management New relic grok is a generative ai assistant for observability

n SolarWinds: The solarWinds Platform is designed to connect with your critical business services, to provide ﬂexibility, visibility, and control wherever your environment lives

n Splunk: The splunk platform enables end-to-end visibility from edge to cloud. Create custom dashboards and data visualizations to unlock insights from anywhere in your operations center, on the desktop, or in the ﬁeld. its platform can correlate data and alerts across disparate sources to gain contextual understanding of an incident.

n Sumo Logic: The company ' s cloud-native, multi-tenant, secure platform helps you make data-driven decisions and reduces your time to investigate security and operational issues sumo Logic provides out-of-thebox integrations with aWs, google Cloud, and Microsoft azure, and well as your hybrid and on-premise environments n

June 2023 9

Turn static files into dynamic content formats.

Create a flipbook

Articles inside