7 minute read

Film to Computer Vision: Imaging in the Age of AI (cover

From Film to Computer Vision: Imaging in the Age of AI

By Bruno Artacho, PhD Candidate at RIT, and Dr. Andreas Savakis, RIT Professor and Center Director

The City of Rochester has a tradition of significantly contributing to engineering innovations. The momentum created by Rochester’s engineers, scientists and business leaders helped place the city in a prominent position during the 20th Century. The city is the birthplace of leading technologies in industries ranging from imaging to education and healthcare. Continuing on the path of engineering innovations, the 21st Century brought the advent of Artificial Intelligence (AI), fueled by innovations in Deep Learning.

Deep Learning is a data-driven approach to AI, where a computer learns to perform advanced tasks with high competency by learning from large datasets. Deep Learning deploys neural networks with many layers, i.e. deep architectures composed of brain-inspired artificial neurons, that learn from examples how to perform tasks such as classification of images, text, or sounds. These techniques are used more and more frequently on smartphones, voice assistants, and autonomous vehicles, achieving results that often match or surpass human performance.

Once again, Rochester’s spirit for innovation helps position the region at the forefront of contributing to the development of deep learning methods. Rochester Institute of Technology (RIT) has been an important contributor by fueling innovation, educating young engineers, and fostering growth through collaboration with local industry. One research lab at RIT is the Vision and Image Processing Laboratory (VIP-lab), directed by Professor Andreas Savakis, which focuses on developing adaptable, robust and efficient computer vision methods with state-ofthe-art performance. Bruno Artacho is a doctoral candidate in the RIT’s Electrical and Computer Engineering Ph.D. program who conducts his research in the VIP-lab.

The field of computer vision deals with the development of algorithms and deep learning models that extract information from images in order to perform various tasks of interest, such as object detection and tracking, face recognition, and scene analysis. The VIP-lab team of graduate students, under the direction of Prof. Savakis, have been working on several computer vision tasks, including: human pose estimation (humancomputer interfaces and health monitoring), visual object tracking (autonomous navigation and traffic monitoring), analyzing changes in satellite imagery (e.g. pre and post natural disasters), and segmenting the outline of objects in an image (self-driving vehicles). These promising works have attracted funding from and partnerships with the Air Force Research Laboratories, the National Science Foundation and various industrial partners.

Two tasks in the computer vision field where Bruno Artacho has focused during his doctoral research are semantic segmentation and human pose estimation. Semantic segmentation aims to extract meaningful information from each location in an image, by labelling each image pixel by a known semantic category, e.g. person, car, stop sign, etc. Applications of semantic segmentation include self-driving vehicles, automatic focus, and foreground/background detection for video calls. The task of human pose estimation focuses on detecting the human body joints in images, by extracting the human body postures under diverse conditions, enabling a multitude of applications including sports training, health monitoring and rehabilitation, sign language analysis, automated security systems, etc.

In the past few years, Bruno Artacho has been working with Prof. Savakis on developing a series of methods that achieve state-of-the-art results for pose estimation. The methods developed include estimating 2D human pose from a single image and or video sequence with a modular and flexible framework called UniPose. The initial work on UniPose focused on images with a single person and was soon extended to UniPose+ for extracting the 3D pose estimation from a single image, as well as OmniPose for detecting the pose of multiple individuals in an image.

The deep learning framework developed for UniPose can be useful for both semantic segmentation and human pose estimation. This architecture utilizes a waterfall representation of image features at multiple scales, which increase accuracy by expanding the method’s Field-of-View, allowing the network to analyze more information at multiple scales. The first novel method utilizing the waterfall configuration for semantic segmentation is called WASPnet. The method achieves better understanding of the overall contextual information of the image for all pixels by utilizing the new waterfall approach by combining the use of four parallel branches of image features at different scales during the pipeline of the neural network. This hybrid novel approach resembles a waterfall shape, hence creating the framework for our approach that is effective and computationally efficient.

Figure 1: WASPnet semantic segmentation examples. The network accurately learns the meaning of pixels from the images, facilitating tasks such as autonomous driving.

The success obtained from the waterfall approach for semantic segmentation was then leveraged for the task of pose estimation of a single person with UniPose. The use of multiple scales in the neural network allowed the UniPose method to better localize human body joints hidden in the image due to occlusion or overlap. The extraction of more detailed contextual information by UniPose results in a more accurate pose estimation in both images and videos depicting sports or sign language. The UniPose method for 2D human pose estimation in images and videos was published in 2020 at the IEEE Conference in Computer Vision and Pattern Recognition (CVPR).

The waterfall framework was expanded towards more complex tasks in human pose estimation by creating the improved UniPose+, an even more accurate state-of-the-art approach, which enabled an expansion of the framework to 3D pose estimation from a single image. UniPose+ was published at the top ranked IEEE Pattern Analysis and Machine Intelligence (PAMI) Journal.

Figure 2: 3D Pose Estimation examples from UniPose+. From a simple image, the network is able to estimate the 3D positioning of the body.

Applications of state-of-the-art methods with the waterfall approach were expanded towards images with multiple people. Two methods were developed for multi-person pose estimation. First, the OmniPose method greatly increased the accuracy for multi-person pose estimation in which the locations of the people in the images are provided to the network using a separate detector. Building upon Omnipose, the BAPose approach enabled the network to not only accurately estimate all human poses in the image, but also locate all instances of people in the image without requiring any additional processing or assistance with detection.

The advances in pose estimation methods have been assisting scientists at RIT and the National Technical Institute of the Deaf (NTID) to better understand the evolution of sign languages, beyond what was previously possible. The pose estimation methods are capable of processing large amounts of data, expanding the sign distribution analysis to the entire sign language vocabulary. This allowed researchers to identify variations in the distribution of signs between commonly used and more complex signs. These findings were recently published in the journal Cognition.

Figure 3: Our BAPose method accurately detects human joints in challenging images that contain multiple people.

The developments from Bruno Artacho’s doctoral research are a stepping stone towards the automation of complex tasks for sign language analysis, including sign-language recognition, interfaces utilizing gestures as commands, and the analysis of posture and movements for sports or rehabilitation. Furthermore, the waterfall framework developed by Bruno Artacho, was applied to hand pose estimation (HandyPose), vehicle pose estimation (VehiPose) and food segmentation (GourmetNet) in collaboration with VIP-lab master’s students.

Deep learning and computer vision are becoming major drivers for investment in the economy, attracting interest from companies and diverse markets around the world. Rochester, aided by RIT and all the universities in the region, has placed itself in a desirable position to attract talent that will continue revolutionary engineering work, propelling the innovation spirit of the city to be an integral part of the future of artificial intelligence. q

Bruno Artacho is a PhD Candidate at RIT. Andreas Savakis is an RIT Professor and Director of the Center for Human-aware Artificial Intelligence."

This article is from: