High Efficiency Video Coding (HEVC): Challenges & Benefits SANJEEV VERMA Principal Systems Engineer, Aricent
www.aricent.com
High Efficiency Video Coding (HEVC): Challenges & Benefits Display technologies play a very crucial role in defining the user experience for a whole range of devices from big screen TVs to small handheld devices such as mobile phones. To satisfy the ever-increasing demand for better visual experience, display technologies are continually evolving. In a relatively short span of time we have gone from High Definition (HD) to full HD and now most devices offer Ultra HD as well. Delivering heavy UHD content over legacy carriers is a huge challenge and demands a much more efficient video compression technology. High Efficiency Video Coding (HEVC) standard effectively addresses this problem and delivers high resolution content without any jitter even on a low bandwidth connection. This whitepaper discusses how HEVC adoption helps in saving bandwidth and enables distribution of UHD content. The paper discusses how an end user benefits from HEVC adoption in terms of enjoying higher resolution, improved playback smoothness and higher bit-depth video quality. The paper provides insights into the HEVC industry trends and the challenges involved in migrating to HEVC using the currently available hardware platforms. The paper also provides details on the additional complexity introduced by H.265 standard and the challenges involved in implementing the associated toolsets. The paper proposes GPU accelerated HEVC decoder for improved battery life and discusses the hybrid multithreading approach for better load balancing between the CPU cores. The paper also touches upon the profiling techniques to identify the hot spots in the code and cache memory considerations that need to be followed while architecting video software for improved performance.
HEVC – Definition and Differentiator Joint Collaborative Team on Video Coding (JCT-VC) released the final publication of High Efficiency Video coding (HEVC) standard worldwide in Q4’ 2013. HEVC is a video coding standard that provides much better quality (at the same bit-rate)
Native parallel tools (Tiles and WPP) introduced in the standard make it a multi-core friendly codec. More exhaustive prediction modes, hierarchical block partitioning strategy, and improved post processing are a few of the key enhancements that enable HEVC deliver the quality required by the UHDTV revolution.
than its predecessor H.264 and enables a multimedia experience that is even ahead of High Definition – Ultra-High Definition!
High Efficiency Video Coding (HEVC): Challenges & Benefits
1
Benefits of HEVC
Enables Adaptive Streaming
Higher compression offered by HEVC technology has opened
The Internet speed fluctuations, variations in the content bitrate
up doors for seamlessly streaming full HD content @ 60/120
and instantaneous increase in the (computational) complexity of
frames per second (fps) on the channels that were originally
the video can cause undesirable frame drops or re-buffering during
made for streaming full HD 30 fps media. HEVC is a boon for
streaming. Adaptive streaming is a technology that provides a user
online media hubs, IPTV companies, broadcasters and other
an option to switch between the contents of various bit-rates in
network operators as it would enable them to deliver a
accordance with the available bandwidth or CPU speed.
compelling user experience even over low speed broadband
MPEG-DASH, Microsoft Smooth Streaming (MSS), and Apple’s
connections.
HTTP Live Streaming (HLS) are the few of the leading technologies that address frame drop issues and provide a smooth playback on users’ device by adapting between the right content.
Better Quality A video standard is said to be more efficient if it achieves
A solution incorporating both MPEG-DASH and HEVC can leverage
better Peak Signal to Noise Ratio (PSNR) or loses lesser
HEVC to encode the content with very high compression ratio
quality for a given bit-rate during encode-decode cycle. Fig-1
(even at low bitrates) and utilize MPEG-DASH for adaptive stream-
compares the PSNR data, obtained for HEVC and H.264
ing thus delivering unprecedented quality of experience to the end
Codecs. It is clearly seen that HEVC consistently leads H.264
users.
and delivers better PSNR at all bitrates. Experiments reveal that HEVC is able to save almost 40 to 50 percent bit-rates for most of the standard content scenarios and hence opens up
Enables UHDTV Broadcasting
doors for 4K video streaming on the current networks.
Not just on-line streaming, satellite television will also be greatly benefited by HEVC adoption. Leading DTH service providers are planning to upgrade their content which will then be delivered over HEVC technology. The DTH ecosystem is laying the foundation for
Average PSNR
UHD content delivery so that UHDTV broadcasting can become main-stream by 2016. Ultra-HD enabled televisions are already
40
being manufactured by Sony, Samsung, LG and other consumer electronics leaders. NHK (Nippon Hōsō Kyōkai), a Japanese public broadcaster, is preparing to broadcast UHD content in Japan in the
35
near future. In fact, recently NHK announced an 8K sensor that is capable of shooting at 120 fps (frame per second).
30 2 Mbps
4 Mbps
6 Mbps
Bitrate
HEVC H.264
HEVC Adoption Trend Online streaming is fast becoming the most preferred medium to watch video. In fact, more and more people, these days, watch movies, TV programs, etc. on YouTube rather than on their TV sets.
Fig 1: PSNR Comparison: H.264 vs HEVC (for Aricent generated high motion content)
Smoother Playback Frequent re-buffering and a jerky playback due to lack of speed (bandwidth) is very annoying and reduces the quality of user experience. As a result there are still a huge number of people who prefer to watch downloaded content rather than watching it online. HEVC can change this scenario by reducing the channel traffic by 50 percent. This extra buffer can be used to avoid re-buffering and gives user a smooth playback experience, without any interruptions.
High Efficiency Video Coding (HEVC): Challenges & Benefits
The paid viewership is also increasing by the day, leading to a steep increase in consumers’ average expense towards online video streaming. According to statistics published by YouTube– “Over 6 billion hours of video are watched each month on YouTube ,that's almost an hour for every person on Earth, and 50% more than last year. Around 100 hours of video content is uploaded to YouTube every minute.” Given this scenario, it is a must for content providers/aggregators to deliver content at a lower cost, while improving the quality of the video. HEVC would play a significant role in further bringing down the cost of online streaming. With the current infrastructure, whatever a
2
user spends for video streaming can be straight away cut
size 64x64, 32x32 or 16x16. CTU may be split recursively into
down by 50% by using HEVC technology because HEVC
four parts called Coding Units (CUs) all the way down to 8x8.
provides 50% more compression compared to legacy technol-
Fig-2 depicts the quad tree recursion based partitioning
ogies. Alternatively, by deploying HEVC, the quality of the
system for a CTU pictorially. Each CU can be further divided
content can be upgraded without any extra load on the
into Prediction Units (PUs) in a symmetrical or asymmetrical
channels and users can enjoy enhanced quality at the same
way, as shown in the Fig-3.
cost. Using HEVC on 3G/ 4G network is certainly going to reduce the cost for mobile users and would encourage more
CU (8x8)
video viewing over mobile networks. In fact, Vodafone is
CU (8x8)
CU (16x16)
already marketing themselves as – “A network for 24x7 streaming” with regards to e-learning and online video viewership.”
CU (32x32)
Challenges with High Efficiency Video Coding (HEVC) Computational needs in video coding have increased drastically after Joint Collaborative Team on Video Coding (JCT-VC)
CU (16x16)
CU (16x16)
announced the HEVC standard for video compression. While
CU (32x32
higher compression offered by HEVC provides better quality, it also poses the need to come up with equally efficient platforms and implementations that can handle the increased complexity brought by the standard. Sections below discuss the complexity metric for various modules of HEVC when compared to the H.264 standard.
Fig. 2: Quad Tree based recursion within a Coding Tree Unit (CTU)
Increased Complexity in Intra-prediction Intra (or IDR) frames act as key frames in video coding process and hence the prediction accuracy of intra frames play a vital role in deciding the overall quality of the video. Intra frame acts as an initial reference frame for other P or B predicted frames within a Group of Pictures (GOP). If there is a significant loss of quality in the intra-prediction process of a key frame, it can propagate in a massive way to rest of non-I frames till a next I frame arrives. Keeping this in mind HEVC standard proposes
2Nx2N
2NxN
Nx2N
NxN
2NxnU
2NxnD
total of 35 different modes while H.264 used maximum of 9 modes for a block based intra-prediction. Searching in additional directions provides better quality but at the same time computational complexity is increased multifold. Intra smoothing is another feature that brings in further complexity in the key frames’ processing.
Flexible Block Partitioning H.264 divides the frame uniformly into processing units of size 16x16 called as macroblocks. Macroblocks can be further divided into smaller blocks of size 8x8 or 4x4 for prediction purpose. H.265 has a much more complex image partitioning method and replaces macroblocks with concept called Coding Tree Unit (CTU) that allows quad tree recursion based block partitioning. A frame is divided into CTUs which could be of
High Efficiency Video Coding (HEVC): Challenges & Benefits
nLx2N
nRx2N
Fig. 3: Coding Unit (CU) Splits - Symmetrical and Asymmetrical
3
More versatile block sizes mean more complex motion estima-
particular CTU. The offset also depends on neighboring pixel
tion search algorithms in HEVC which require more computa-
values and the direction indicated in the SAO parameters. While
tional power. Dynamically changing CU split architecture
it brings an additional computational complexity during codec
introduces many condition checks at a block level, which may
implementation, it also induces neighboring dependencies
not be straight forward to implement for deep pipeline based
making it challenging to be implemented on a parallel architec-
architecture such as ARMv7/v8.
ture like GPU.
Inter-prediction complexity has been increased in HEVC by 6 taps. Chroma interpolation uses 4 tap based interpolation as
Addressing HEVC Challenges through Aricent’s Offerings
compared to bilinear filter in H.264. Additionally, motion vector
Leading processor makers such as ARM®, Intel® and AMD®
prediction module becomes more computationally intensive
have been continuously striving to deliver faster yet low power
by introducing merge and skip modes as explained in [8].
platforms to meet the computational needs of ever growing
using 8-tap interpolation filters while H.264 used maximum of
multimedia market. Single Instruction, Multiple Data (SIMD) Neon® technology combined with a load store architecture
Variable Size Block Transform HEVC standard supports 4x4, 8x8, 16x16 and 32x32 sizes for block transformation while H.264 supports a uniform transform block size of 4x4 for main profile. Having versatile transform size methodology provides better compression but at the same time performing transform on bigger blocks becomes more complex from (Single Instruction Multiple Data) SIMD instructions and data cache perspective. Increased precision for the coefficients in the transform matrix further adds to the complexity of the overall transformation process. Fig-4 below captures how a transform unit (TU) size is varied across an HEVC frame.
present in ARMv7 based processors (ARM Cortex-A8®, A9®, A15® etc.) enables parallel processing at the instruction level where 128 bit wide vectors can be operated upon in a single instruction. This means Neon co-processor can either operate on sixteen 8-bit elements or eight 16-bit elements in parallel for any arithmetic or logical or a memory load/store operation. Similarly Intel’s latest architectures like SSE 4.0, AVX and AVX2 have varied forms of parallel processing capabilities that leverage SIMD architecture and deliver the best performance as needed by HEVC. With current silicon technology it may not be possible to increase the CPU clock beyond a certain extent due to thermal issues. However, heterogeneous System on Chips (SoCs) with multiple processing units have been launched in the market recently by chip makers which can deliver the desired compute performance to fulfill the increasing demand of video algorithms. Samsung® Exynos™, NVIDIA® Tegra® and Qualcomm® Snapdragon™ chipset series are to name just a few, powered by ARMv7 architecture and incorporate multiple CPU cores (running as high as 2.5GHz) along with GPU Compute capability. No doubt, these platforms provide greater computational power to video software makers, but at the same time programmers need to design and architect their software in a parallel way to extract the maximum performance out of multi-core based systems.
Fig 4: TU Split variation in HEVC
Additional Post Processing (Sample Adaptive Offset)
Leveraging GPU Compute ARM Mali Graphics Processing Unit (GPU) T6xx loaded with
Sample Adaptive Offset (SAO) is a toolset that has been added
128 bit SIMD capabilities and parallel computing technology is
in HEVC after the de-blocking stage. This improves the PSNR
now being leveraged by video algorithm developers at Aricent
by reducing the ringing related distortions and also enhances
to develop codec solutions with low power consumption and
the visual quality of the video. In SAO, an offset is added to a
improved performance targeting Ultra HD resolution. OpenCL
pixel sample based on the SAO parameters signaled for a
APIs exposed by the Mali GPU facilitate quicker implementa-
High Efficiency Video Coding (HEVC): Challenges & Benefits
4
tion of video algorithms, which saves time-to-market for new
memory access are recommended for a CPU based platforms.
products. By offloading certain modules of HEVC video decoder
However for architecture like AMD® Radeon® GPU, memory
to GPU, not only is the decoding made faster but also a lot of
bank conflicts [11] need to be taken care while deciding the
power saving is achieved, which otherwise would have been
memory access pattern. One may need to study the cache
consumed by the CPU as GPUs are highly power efficient when
allocation and eviction policy to plan the data flow for software.
compared to CPUs.
Aricent HEVC Software Enabler Effective CPU loading with Hybrid Multithreading
Aricent offers highly optimized HEVC Software codecs that are
Parallel computing is becoming commonplace and most
deployed on various Operating Systems such as Android, iOS,
performance critical software is being ported to take advantage
Linux and Windows Phone on both ARM and Intel based
of multi-core architectures. Optimal load balancing can be a
devices. The codecs are fully compliant to HEVC standard and
bottleneck if the software has not been suitably architected.
support full HD (1920x1080) and UHD resolutions including 2K
Aricent proposed [10] hybrid design approach that combines
and 4K. The software solutions have been highly optimized to
functional and spatial techniques of multithreading and
achieve peak performance on various SoCs like Qualcomm
effectively leverages a multi-core architecture to develop highly
Snapdragon, Samsung Exynos, Apple A6 and other next gener-
efficient video software in various content scenarios. By using a
ation chipsets and support GPGPU offloading for better battery
hybrid multithreading approach Aricent is able to develop
life. The HEVC decoder solution also enables multi-screen
HEVC decoder that is capable of delivering up to 90 frames per
support for varying resolution of various consumer devices.
ideal for early adoption. The platform agnostic codecs can be
second with full HD (1920x1080 resolution) on quad core A15® based ARM® platform. Hybrid approach showed better results in optimizing HEVC decoder software on Intel® Core™ i5
Conclusion
architecture as well and showed improved numbers for most of
UHDTV broadcasting will become mainstream very soon and
the content when compared to the conventional techniques of
HEVC will play a vital role in delivering the required compres-
multithreading.
sion to complement the technology. VP9 is emerging as a competing technology to HEVC and has the advantage of being
Identifying Hot Spots and Software Profiling Identifying performance critical functions in software is an important step in the optimization cycle. Typically 20% of the software runs 80% of the time and needs to be optimized for performance. This is done by using profiling tools such as GNU profiler GPROF®, DS5 by ARM®, codeXL® by AMD® to name
a license free codec. Nevertheless, due to better compression efficiency, wider color space/format coverage, and having originated from a more reliable standard body HEVC will remain a leading technology for video compression in this decade.
References
a few. Profiling and optimization is an iterative process that is
1. Bingbing Xia,Fei Qiao,Huazhong Yang and Hui Wang, ”An
followed till the desired performance is achieved. Once perfor-
efficient methodology for transaction-level design of multi-core
mance critical functions are identified, they are coded in
h.264 video decoder”, Consumer Electronics (ICCE), 2011 IEEE
assembly language to get the best performance. When used in
International Conference, Jan. 2011
conjunction with SIMD instructions, manually coded assembly
2. Kue-Hwan Sihn, Hyunki Baik, Jong-Tae Kim, Sehyun Bae and
functions perform 4 to 5 times faster than compiler optimized
Hyo Jung Song, ”Novel approaches to parallel H.264 decoder
functions on most platforms.
on symmetric multicore systems”, Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International
Cache Friendly Memory Access Rearranging the data structures and modifying memory access patterns as per the cache architecture is yet another important step in optimization process. Based on the available cache memory and levels of cache, code flow needs to be worked out, for example in HEVC, block based decode pipeline is more cache friendly than a frame based decoding. If data cache is relatively bigger, one can choose to process few blocks or a row at a time to gain additional performance for code cache. In all scenarios, memory access patterns that allow consecutive
High Efficiency Video Coding (HEVC): Challenges & Benefits
Conference, Apr. 2009 3. Nishihara, K., Hatabu, A. and Moriyoshi,T., ”Parallelization of H.264 video decoder for embedded multicore processor”, Multimedia and Expo, 2008 IEEE International Conference, Apr. 2008 4. Falcao, G., Sousa, L., and Silva, V.,”Massively LDPC Decoding on Multicore Architectures”, Parallel and Distributed Systems, IEEE Transactions, Feb. 2011 5. Ngai-Man Cheung, Xiaopeng Fan, Au, O.C. and Man-Cheung Kung,”Video Coding on Multicore Graphics Processors”, Signal
5
Processing Magazine, IEEE, Issue 2, Mar. 2010Processing Magazine, IEEE, Issue 2, Mar. 2010 6. Yun-il Kim, Jong-Tae Kim, Sehyun Bae, Hyunki Baik and Hyo Jung Song, ”H.264/AVC decoder parallelization and optimization on asymmetric multicore platform using dynamic load balancing”, Multimedia and Expo, 2008 IEEE International Conference, June 23 2008-April 26 2008 7. ARM Limited, ”Cortex™-A15 Revision: r2p0, Technical Reference Manual” , http://infocenter.arm.com, Sept 2011 8. ITU-T, ”Recommendation ITU-T H.265”, www.itu.int, Apr. 2013 9. Sanjeev Verma, “Enabling GPU Compute on an ARM Mali-T600 GPU creates a power efficient HEVC decode solution”, ” http://goo.gl/PxmuWS”, Feb 2014 10. Sanjeev Verma, “Parallel Computing: Architecting video software for multi-core heterogeneous platforms”,” http://goo.gl/nTWj3B”, Jul 2014 11. AMD, “AMD Accelerated Parallel Processing OpenCL Programming Guide”, ” http://goo.gl/te0mB8”, Jul 2014
High Efficiency Video Coding (HEVC): Challenges & Benefits
6
Engineering excellence.Sourced Aricent is the world’s #1 pure-play product engineering services and software firm. The company has 20-plus years experience co-creating ambitious products with the leading networking, telecom, software, semiconductor, Internet and industrial companies. The firm's 10,000-plus engineers focus exclusively on software-powered innovation for the connected world. frog, the global leader in innovation and design, based in San Francisco is part of Aricent. The company’s key investors are Kohlberg Kravis Roberts & Co. and Sequoia Capital. info@aricent.com
© 2014 Aricent. All rights reserved. All Aricent brand and product names are service marks, trademarks, or registered marks of Aricent in the United States and other countries.