High E fficiency Video Coding (HEVC): Challenges & Benefits by Aricent Technology

High Efficiency Video Coding (HEVC): Challenges & Benefits SANJEEV VERMA Principal Systems Engineer, Aricent

www.aricent.com

High Efficiency Video Coding (HEVC): Challenges & Benefits Display technologies play a very crucial role in defining the user experience for a whole range of devices from big screen TVs to small handheld devices such as mobile phones. To satisfy the ever-increasing demand for better visual experience, display technologies are continually evolving. In a relatively short span of time we have gone from High Definition (HD) to full HD and now most devices offer Ultra HD as well. Delivering heavy UHD content over legacy carriers is a huge challenge and demands a much more efficient video compression technology. High Efficiency Video Coding (HEVC) standard effectively addresses this problem and delivers high resolution content without any jitter even on a low bandwidth connection. This whitepaper discusses how HEVC adoption helps in saving bandwidth and enables distribution of UHD content. The paper discusses how an end user benefits from HEVC adoption in terms of enjoying higher resolution, improved playback smoothness and higher bit-depth video quality. The paper provides insights into the HEVC industry trends and the challenges involved in migrating to HEVC using the currently available hardware platforms. The paper also provides details on the additional complexity introduced by H.265 standard and the challenges involved in implementing the associated toolsets. The paper proposes GPU accelerated HEVC decoder for improved battery life and discusses the hybrid multithreading approach for better load balancing between the CPU cores. The paper also touches upon the profiling techniques to identify the hot spots in the code and cache memory considerations that need to be followed while architecting video software for improved performance.

HEVC – Definition and Differentiator Joint Collaborative Team on Video Coding (JCT-VC) released the final publication of High Efficiency Video coding (HEVC) standard worldwide in Q4’ 2013. HEVC is a video coding standard that provides much better quality (at the same bit-rate)

Native parallel tools (Tiles and WPP) introduced in the standard make it a multi-core friendly codec. More exhaustive prediction modes, hierarchical block partitioning strategy, and improved post processing are a few of the key enhancements that enable HEVC deliver the quality required by the UHDTV revolution.

than its predecessor H.264 and enables a multimedia experience that is even ahead of High Definition – Ultra-High Definition!

High Efficiency Video Coding (HEVC): Challenges & Benefits

Benefits of HEVC

Enables Adaptive Streaming

Higher compression offered by HEVC technology has opened

The Internet speed fluctuations, variations in the content bitrate

up doors for seamlessly streaming full HD content @ 60/120

and instantaneous increase in the (computational) complexity of

frames per second (fps) on the channels that were originally

the video can cause undesirable frame drops or re-buffering during

made for streaming full HD 30 fps media. HEVC is a boon for

streaming. Adaptive streaming is a technology that provides a user

online media hubs, IPTV companies, broadcasters and other

an option to switch between the contents of various bit-rates in

network operators as it would enable them to deliver a

accordance with the available bandwidth or CPU speed.

compelling user experience even over low speed broadband

MPEG-DASH, Microsoft Smooth Streaming (MSS), and Apple’s

connections.

HTTP Live Streaming (HLS) are the few of the leading technologies that address frame drop issues and provide a smooth playback on users’ device by adapting between the right content.

Better Quality A video standard is said to be more efficient if it achieves

A solution incorporating both MPEG-DASH and HEVC can leverage

better Peak Signal to Noise Ratio (PSNR) or loses lesser

HEVC to encode the content with very high compression ratio

quality for a given bit-rate during encode-decode cycle. Fig-1

(even at low bitrates) and utilize MPEG-DASH for adaptive stream-

compares the PSNR data, obtained for HEVC and H.264

ing thus delivering unprecedented quality of experience to the end

Codecs. It is clearly seen that HEVC consistently leads H.264

users.

and delivers better PSNR at all bitrates. Experiments reveal that HEVC is able to save almost 40 to 50 percent bit-rates for most of the standard content scenarios and hence opens up

Enables UHDTV Broadcasting

doors for 4K video streaming on the current networks.

Not just on-line streaming, satellite television will also be greatly benefited by HEVC adoption. Leading DTH service providers are planning to upgrade their content which will then be delivered over HEVC technology. The DTH ecosystem is laying the foundation for

Average PSNR

UHD content delivery so that UHDTV broadcasting can become main-stream by 2016. Ultra-HD enabled televisions are already

being manufactured by Sony, Samsung, LG and other consumer electronics leaders. NHK (Nippon Hōsō Kyōkai), a Japanese public broadcaster, is preparing to broadcast UHD content in Japan in the

near future. In fact, recently NHK announced an 8K sensor that is capable of shooting at 120 fps (frame per second).

30 2 Mbps

4 Mbps

6 Mbps

Bitrate

HEVC H.264

HEVC Adoption Trend Online streaming is fast becoming the most preferred medium to watch video. In fact, more and more people, these days, watch movies, TV programs, etc. on YouTube rather than on their TV sets.

Fig 1: PSNR Comparison: H.264 vs HEVC (for Aricent generated high motion content)

Smoother Playback Frequent re-buffering and a jerky playback due to lack of speed (bandwidth) is very annoying and reduces the quality of user experience. As a result there are still a huge number of people who prefer to watch downloaded content rather than watching it online. HEVC can change this scenario by reducing the channel traffic by 50 percent. This extra buffer can be used to avoid re-buffering and gives user a smooth playback experience, without any interruptions.

High Efficiency Video Coding (HEVC): Challenges & Benefits

The paid viewership is also increasing by the day, leading to a steep increase in consumers’ average expense towards online video streaming. According to statistics published by YouTube– “Over 6 billion hours of video are watched each month on YouTube ,that's almost an hour for every person on Earth, and 50% more than last year. Around 100 hours of video content is uploaded to YouTube every minute.” Given this scenario, it is a must for content providers/aggregators to deliver content at a lower cost, while improving the quality of the video. HEVC would play a significant role in further bringing down the cost of online streaming. With the current infrastructure, whatever a

user spends for video streaming can be straight away cut

size 64x64, 32x32 or 16x16. CTU may be split recursively into

down by 50% by using HEVC technology because HEVC

four parts called Coding Units (CUs) all the way down to 8x8.

provides 50% more compression compared to legacy technol-

Fig-2 depicts the quad tree recursion based partitioning

ogies. Alternatively, by deploying HEVC, the quality of the

system for a CTU pictorially. Each CU can be further divided

content can be upgraded without any extra load on the

into Prediction Units (PUs) in a symmetrical or asymmetrical

channels and users can enjoy enhanced quality at the same

way, as shown in the Fig-3.

cost. Using HEVC on 3G/ 4G network is certainly going to reduce the cost for mobile users and would encourage more

CU (8x8)

video viewing over mobile networks. In fact, Vodafone is

CU (8x8)

CU (16x16)

already marketing themselves as – “A network for 24x7 streaming” with regards to e-learning and online video viewership.”

CU (32x32)

Challenges with High Efficiency Video Coding (HEVC) Computational needs in video coding have increased drastically after Joint Collaborative Team on Video Coding (JCT-VC)

CU (16x16)

announced the HEVC standard for video compression. While

CU (32x32

higher compression offered by HEVC provides better quality, it also poses the need to come up with equally efficient platforms and implementations that can handle the increased complexity brought by the standard. Sections below discuss the complexity metric for various modules of HEVC when compared to the H.264 standard.

Fig. 2: Quad Tree based recursion within a Coding Tree Unit (CTU)

Increased Complexity in Intra-prediction Intra (or IDR) frames act as key frames in video coding process and hence the prediction accuracy of intra frames play a vital role in deciding the overall quality of the video. Intra frame acts as an initial reference frame for other P or B predicted frames within a Group of Pictures (GOP). If there is a significant loss of quality in the intra-prediction process of a key frame, it can propagate in a massive way to rest of non-I frames till a next I frame arrives. Keeping this in mind HEVC standard proposes

2Nx2N

2NxN

Nx2N

NxN

2NxnU

2NxnD

total of 35 different modes while H.264 used maximum of 9 modes for a block based intra-prediction. Searching in additional directions provides better quality but at the same time computational complexity is increased multifold. Intra smoothing is another feature that brings in further complexity in the key frames’ processing.

Flexible Block Partitioning H.264 divides the frame uniformly into processing units of size 16x16 called as macroblocks. Macroblocks can be further divided into smaller blocks of size 8x8 or 4x4 for prediction purpose. H.265 has a much more complex image partitioning method and replaces macroblocks with concept called Coding Tree Unit (CTU) that allows quad tree recursion based block partitioning. A frame is divided into CTUs which could be of

High Efficiency Video Coding (HEVC): Challenges & Benefits

nLx2N

nRx2N

Fig. 3: Coding Unit (CU) Splits - Symmetrical and Asymmetrical

More versatile block sizes mean more complex motion estima-

particular CTU. The offset also depends on neighboring pixel

tion search algorithms in HEVC which require more computa-

values and the direction indicated in the SAO parameters. While

tional power. Dynamically changing CU split architecture

it brings an additional computational complexity during codec

introduces many condition checks at a block level, which may

implementation, it also induces neighboring dependencies

not be straight forward to implement for deep pipeline based

making it challenging to be implemented on a parallel architec-

architecture such as ARMv7/v8.

ture like GPU.

Inter-prediction complexity has been increased in HEVC by 6 taps. Chroma interpolation uses 4 tap based interpolation as

Addressing HEVC Challenges through Aricent’s Offerings

compared to bilinear filter in H.264. Additionally, motion vector

Leading processor makers such as ARM®, Intel® and AMD®

prediction module becomes more computationally intensive

have been continuously striving to deliver faster yet low power

by introducing merge and skip modes as explained in [8].

platforms to meet the computational needs of ever growing

using 8-tap interpolation filters while H.264 used maximum of

multimedia market. Single Instruction, Multiple Data (SIMD) Neon® technology combined with a load store architecture

Variable Size Block Transform HEVC standard supports 4x4, 8x8, 16x16 and 32x32 sizes for block transformation while H.264 supports a uniform transform block size of 4x4 for main profile. Having versatile transform size methodology provides better compression but at the same time performing transform on bigger blocks becomes more complex from (Single Instruction Multiple Data) SIMD instructions and data cache perspective. Increased precision for the coefficients in the transform matrix further adds to the complexity of the overall transformation process. Fig-4 below captures how a transform unit (TU) size is varied across an HEVC frame.

present in ARMv7 based processors (ARM Cortex-A8®, A9®, A15® etc.) enables parallel processing at the instruction level where 128 bit wide vectors can be operated upon in a single instruction. This means Neon co-processor can either operate on sixteen 8-bit elements or eight 16-bit elements in parallel for any arithmetic or logical or a memory load/store operation. Similarly Intel’s latest architectures like SSE 4.0, AVX and AVX2 have varied forms of parallel processing capabilities that leverage SIMD architecture and deliver the best performance as needed by HEVC. With current silicon technology it may not be possible to increase the CPU clock beyond a certain extent due to thermal issues. However, heterogeneous System on Chips (SoCs) with multiple processing units have been launched in the market recently by chip makers which can deliver the desired compute performance to fulfill the increasing demand of video algorithms. Samsung® Exynos™, NVIDIA® Tegra® and Qualcomm® Snapdragon™ chipset series are to name just a few, powered by ARMv7 architecture and incorporate multiple CPU cores (running as high as 2.5GHz) along with GPU Compute capability. No doubt, these platforms provide greater computational power to video software makers, but at the same time programmers need to design and architect their software in a parallel way to extract the maximum performance out of multi-core based systems.

Fig 4: TU Split variation in HEVC

Additional Post Processing (Sample Adaptive Offset)

Leveraging GPU Compute ARM Mali Graphics Processing Unit (GPU) T6xx loaded with

Sample Adaptive Offset (SAO) is a toolset that has been added

128 bit SIMD capabilities and parallel computing technology is

in HEVC after the de-blocking stage. This improves the PSNR

now being leveraged by video algorithm developers at Aricent

by reducing the ringing related distortions and also enhances

to develop codec solutions with low power consumption and

the visual quality of the video. In SAO, an offset is added to a

improved performance targeting Ultra HD resolution. OpenCL

pixel sample based on the SAO parameters signaled for a

APIs exposed by the Mali GPU facilitate quicker implementa-

High Efficiency Video Coding (HEVC): Challenges & Benefits

tion of video algorithms, which saves time-to-market for new

memory access are recommended for a CPU based platforms.

products. By offloading certain modules of HEVC video decoder

However for architecture like AMD® Radeon® GPU, memory

to GPU, not only is the decoding made faster but also a lot of

bank conflicts [11] need to be taken care while deciding the

power saving is achieved, which otherwise would have been

memory access pattern. One may need to study the cache

consumed by the CPU as GPUs are highly power efficient when

allocation and eviction policy to plan the data flow for software.

compared to CPUs.

Aricent HEVC Software Enabler Effective CPU loading with Hybrid Multithreading

Aricent offers highly optimized HEVC Software codecs that are

Parallel computing is becoming commonplace and most

deployed on various Operating Systems such as Android, iOS,

performance critical software is being ported to take advantage

Linux and Windows Phone on both ARM and Intel based

of multi-core architectures. Optimal load balancing can be a

devices. The codecs are fully compliant to HEVC standard and

bottleneck if the software has not been suitably architected.

support full HD (1920x1080) and UHD resolutions including 2K

Aricent proposed [10] hybrid design approach that combines

and 4K. The software solutions have been highly optimized to

functional and spatial techniques of multithreading and

achieve peak performance on various SoCs like Qualcomm

effectively leverages a multi-core architecture to develop highly

Snapdragon, Samsung Exynos, Apple A6 and other next gener-

efficient video software in various content scenarios. By using a

ation chipsets and support GPGPU offloading for better battery

hybrid multithreading approach Aricent is able to develop

life. The HEVC decoder solution also enables multi-screen

HEVC decoder that is capable of delivering up to 90 frames per

support for varying resolution of various consumer devices.

ideal for early adoption. The platform agnostic codecs can be

second with full HD (1920x1080 resolution) on quad core A15® based ARM® platform. Hybrid approach showed better results in optimizing HEVC decoder software on Intel® Core™ i5

Conclusion

architecture as well and showed improved numbers for most of

UHDTV broadcasting will become mainstream very soon and

the content when compared to the conventional techniques of

HEVC will play a vital role in delivering the required compres-

multithreading.

sion to complement the technology. VP9 is emerging as a competing technology to HEVC and has the advantage of being

Identifying Hot Spots and Software Profiling Identifying performance critical functions in software is an important step in the optimization cycle. Typically 20% of the software runs 80% of the time and needs to be optimized for performance. This is done by using profiling tools such as GNU profiler GPROF®, DS5 by ARM®, codeXL® by AMD® to name

a license free codec. Nevertheless, due to better compression efficiency, wider color space/format coverage, and having originated from a more reliable standard body HEVC will remain a leading technology for video compression in this decade.

References

a few. Profiling and optimization is an iterative process that is

1. Bingbing Xia,Fei Qiao,Huazhong Yang and Hui Wang, ”An

followed till the desired performance is achieved. Once perfor-

efficient methodology for transaction-level design of multi-core

mance critical functions are identified, they are coded in

h.264 video decoder”, Consumer Electronics (ICCE), 2011 IEEE

assembly language to get the best performance. When used in

International Conference, Jan. 2011

conjunction with SIMD instructions, manually coded assembly

2. Kue-Hwan Sihn, Hyunki Baik, Jong-Tae Kim, Sehyun Bae and

functions perform 4 to 5 times faster than compiler optimized

Hyo Jung Song, ”Novel approaches to parallel H.264 decoder

functions on most platforms.

on symmetric multicore systems”, Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International

Cache Friendly Memory Access Rearranging the data structures and modifying memory access patterns as per the cache architecture is yet another important step in optimization process. Based on the available cache memory and levels of cache, code flow needs to be worked out, for example in HEVC, block based decode pipeline is more cache friendly than a frame based decoding. If data cache is relatively bigger, one can choose to process few blocks or a row at a time to gain additional performance for code cache. In all scenarios, memory access patterns that allow consecutive

High Efficiency Video Coding (HEVC): Challenges & Benefits

Conference, Apr. 2009 3. Nishihara, K., Hatabu, A. and Moriyoshi,T., ”Parallelization of H.264 video decoder for embedded multicore processor”, Multimedia and Expo, 2008 IEEE International Conference, Apr. 2008 4. Falcao, G., Sousa, L., and Silva, V.,”Massively LDPC Decoding on Multicore Architectures”, Parallel and Distributed Systems, IEEE Transactions, Feb. 2011 5. Ngai-Man Cheung, Xiaopeng Fan, Au, O.C. and Man-Cheung Kung,”Video Coding on Multicore Graphics Processors”, Signal

Processing Magazine, IEEE, Issue 2, Mar. 2010Processing Magazine, IEEE, Issue 2, Mar. 2010 6. Yun-il Kim, Jong-Tae Kim, Sehyun Bae, Hyunki Baik and Hyo Jung Song, ”H.264/AVC decoder parallelization and optimization on asymmetric multicore platform using dynamic load balancing”, Multimedia and Expo, 2008 IEEE International Conference, June 23 2008-April 26 2008 7. ARM Limited, ”Cortex™-A15 Revision: r2p0, Technical Reference Manual” , http://infocenter.arm.com, Sept 2011 8. ITU-T, ”Recommendation ITU-T H.265”, www.itu.int, Apr. 2013 9. Sanjeev Verma, “Enabling GPU Compute on an ARM Mali-T600 GPU creates a power efficient HEVC decode solution”, ” http://goo.gl/PxmuWS”, Feb 2014 10. Sanjeev Verma, “Parallel Computing: Architecting video software for multi-core heterogeneous platforms”,” http://goo.gl/nTWj3B”, Jul 2014 11. AMD, “AMD Accelerated Parallel Processing OpenCL Programming Guide”, ” http://goo.gl/te0mB8”, Jul 2014

High Efficiency Video Coding (HEVC): Challenges & Benefits

Engineering excellence.Sourced Aricent is the world’s #1 pure-play product engineering services and software firm. The company has 20-plus years experience co-creating ambitious products with the leading networking, telecom, software, semiconductor, Internet and industrial companies. The firm's 10,000-plus engineers focus exclusively on software-powered innovation for the connected world. frog, the global leader in innovation and design, based in San Francisco is part of Aricent. The company’s key investors are Kohlberg Kravis Roberts & Co. and Sequoia Capital. info@aricent.com