Beyond Moore’s Law: Parallel Processing in Heterogeneous SoCs

from Embedded Computing Design Spring 2021 with Embedded World Profiles

Just CPUs Were Never Enough

By Brandon Lewis, Editor-in-Chief

With the dependable performance-per-watt gains of transistor scaling drawing to a close, how will future generations of processors access the compute necessary to efficiently execute demanding workloads? The answer my come via parallel processing on heterogeneous SoCs.

“We’ve been working on 7 nm for a long time, and during that time we not only saw the end of Moore’s law, but we also saw the end of Amdahl’s law and Dennard scaling,” says Manuel Uhm, Director of Silicon Marketing at Xilinx. “What that means is, if all we did was take an FPGA and just shrink those transistors to 7 nm from our previous node, which was 16 nm, and just call it a day, many customers trying to move over the exact same design might quite possibly end up with a design that quite frankly does not have any increase in performance and may, in fact, increase power consumption.

“And clearly that’s going totally the wrong way.”

To be clear, it’s not impossible to shrink silicon transistors below 7 nm; 5 nm devices are already in production. It’s that the underlying metal isn’t running any faster, and current leakage is on the rise.

Meanwhile, in the other direction, traditional multicore devices have hit scaling limitations of their own. Of course, those parallel processors have historically been homogeneous, “and the reality is there is no single processor archiecture that can do every task optimally,” Uhm contests. “Not an FPGA, not a CPU, not a GPU.”

This isn’t to say parallelism can’t be advantageous in tackling the complex processing tasks presented by modern applications. Indeed, beyond Moore’s law and Dennard scaling, parallel computing may be our best option in high-performance computing (HPC) and other demanding use cases.

Yes, we still need parallel processing. But of the heterogeneous variety.

Heterogeneous Processing: Not Just for Data Center

As mentioned, the bleeding edge of heterogeneous parallel processing technology is a response to performance walls in high-end applications. But these architectures are also becoming more commonplace in embedded computing environments.

Dan Mandell, Senior Analyst at VDC Research, points out that while “it is true that many heterogeneous processing architectures have been focused on high-end applications, particularly for the datacenter and HPC … miniaturization of FPGA SoCs and other heterogeneous accelerated silicon is top of mind for companies like Microsemi and Xilinx to bring more of these devices into intelligent edge infrastructure like edge/ industrial servers and IoT gateways.”

According to Mandell, a key driver of general-purpose heterogeneous computing platforms in the embedded market “is a lot of hesitancy among OEMs and others today about committing to a hardware architecture.” The hesitation, he says, is a product of rapid evolutions in specialized accelerated silicon, as well as uncertainty in the frameworks and workloads that will be produced by the edge software and AI ecosystems in the coming years.

He expects all of these circumstances to “have a great influence in future semiconductor sourcing,” as well as how chip suppliers approach their processor roadmaps.

“The price and power envelope of most of these FPGA SoCs today will force suppliers to initially focus on relatively high-end, high-resource embedded and edge applications,” Mandell posits. “However, there is an active effort to make FPGA SoCs ‘size agnostic’ to eventually support even battery-powered connectivity devices.”

So as heterogenous parallel processing becomes more commonplace, should embedded engineers prepare for a paradigm shift in system design? Deepu Talla, Vice President and General Manager of Embedded & Edge Computing at Nvidia, doesn’t think so.

“If you think about it, embedded processors have always used accelerators,” Talla says. “Even 20 years ago, there was an Arm CPU, there was a DSP, and then there was video encode/decode done in specific hardware, right? They’re fixed-function in some sense, but they’re all processing things in parallel.

“The reason you needed to do that was cost, power, size,” he continues. “The efficiency of the parallel processor is orders of magnitude more than just the CPU.”

Nvidia’s Xavier SoC, the device at the heart of their Jetson Xavier embedded platform, as well as the company’s next-generation Orin architecture that will be available in late 2021 or 2022, both equip GPUs, Arm CPUs, deep learning accelerators, vision accelerators, encoders/decoders, and other specialized processing blocks.

However, one change embedded developers can expect as advanced heterogeneous SoCs become more prevalent is the use of network-on-chip (NoC) interconnects, which have progressed over the last decade from traditional on-chip buses like the AMBA interface. This provides “control over how you connect the CPU, GPU, your video encoder, deep learning accelerator, the display processor, the camera processor, the security processor, all those things,” Talla says.

NoCs help accelerate and optimize the flow of data from block to block across the SoC, which aids in the most efficient workload execution possible. For example, NXP has leveraged both NoCs and traditional bus architectures in their versatile line of i.MX SoCs. Recently, the company announced the i.MX9.

“Heterogeneous compute is something that we’ve actually been implementing for many years. I believe now is where we are really starting to hit that sweet spot of how we’re using it,” says Dr. Gowrishankar Chindalore, Head of Business & Technology Strategy for Edge Processing at NXP Semiconductors, Inc. “The same is happening with machine learning, because we’re using a CPU, GPU, DSPs, and neural processing unit (NPU) today.

“But part of the optimization, it’s not just the compute elements, it’s everything around the system that needs to happen,” he continues. “So where we’re focusing on improving efficiency, in addition to the heterogeneous compute, is looking at wastage through the whole flow in the chip division pipeline, the video pipeline, the graphics pipeline. “WE REALIZED THAT WE’RE GOING TO BE ABLE TO OFFER INCREASED PERFORMANCE OR DECREASE POWER CONSUMPTION, AND IN SOME CASES IT’S EITHER/OR. IT’S NOT ALWAYS A GIVEN THAT YOU’RE GOING TO GET BOTH.”

“Before with every process node, it’s like, ‘Oh great. I get double the performance at half the power consumption!’ Uhm says. “Those days are gone. Those days are absolutely gone for everybody. At 7 nm, those transistors start getting leaky now. And you just run into other kinds of problems that are, in many cases, we believe, insurmountable.

“And so, having come to that realization, we’re looking now at system-level problems,” he continues “We’re putting all these things together and understanding all those trade-offs and making sure that we’re able to encompass as much of the processing as possible in a way that allows the performance and power budgets to be met. And again, those aren’t easy things anymore. We realized that we’re going to be able to offer increased performance or decrease power consumption, and in some cases it’s either/or. It’s not always a given that you’re going to get both.

“Again, no processor is optimal for everything. You can’t always increase performance and lower power consumption,” Uhm continues. “But focusing on this new architecture, a heterogeneous processor, essentially that allows them to do that.”

“Because the more that we can do that, the more efficiency we get in performance and, clearly, the less energy that’s used to do the same function,” he adds.

Heading Towards a Heterogeneous World

Citing VDC Research’s 2020 IoT, Embedded & Mobile Processors technology report, Mandell expects the global market for embedded SoCs to “continue outgrowing the merchant markets for discrete semiconductors such as MPUs, MCUs, GPUs, etc. for the next several years,” as OEMs look to consolidate computing resources and multichip implementations. Over the long term, the demand for workload acceleration and processor optimization will only “drive a further uptick,” he says.

In the mean time, the way we measure performance and power consumptiton will have to change. As Mike Demler, Senior Analyst at The Linley Group addresses in his firm’s Guide to Processors for Deep Learning asserts, even new AI-centtric benchmarks like TOPS/W are “misleading, because the real AI workloads never achieve close to 100 percent utilization.”

We will have to measure things like power efficiency with “a real workload, such as Bert NLP models, rather than a theoretical, architecture-based specification,” he says.

But does it even make sense to measure the processor complex in isolation anymore? Did it ever really matter? As it always has, the focus will be on what it delivers in the context of your system.

Beyond Moore’s Law: Parallel Processing in Heterogeneous SoCs

Next Article

Just CPUs Were Never Enough

Heterogeneous Processing: Not Just for Data Center

Heading Towards a Heterogeneous World

More articles from this publication:

Just CPUs Were Never Enough

22

Compare MISRA C with SPARK for Safe and Secure Programming

Controlling Output Modules with Full Isolation from the Microcontroller

Embedded Multicore and OpenAMP

This article is from:

Embedded Computing Design Spring 2021 with Embedded World Profiles