CMU pioneers method to read body language
Researchers at Carnegie Mellon University’s Robotics Institute in Pittsburgh have enabled a computer to understand the body poses and movements of multiple people from video in real time—including, for the first time, the pose of each individual’s fingers. This new method was developed with the help of the Panoptic Studio, a two-story dome embedded with 500 video cameras, the massively multiview system for social motion capture.
CMU researchers say there are many core challenges to capturing the 3D structure and motion of a group of people engaged in a social interaction: occlusion is functional and frequent; subtle motion needs to be measured over a space large enough to host a social group; and human appearance and configuration variation is immense.
CMU Associate Professor of Robotics Yaser Sheikh said these methods for tracking 2D human form and motion open up new ways for people and machines to interact with each other. A self-driving car could get an early warning that a pedestrian is about to step into the street by monitoring body language.
“We communicate almost as much with the movement of our bodies as we do with our voice,” Sheikh said. “But computers are more or less blind to it.”
To encourage more research and applications, the researchers have released their computer code for both multiperson and hand-pose estimation. It already is being widely used by research groups, and more than 20 commercial groups, including automotive companies, have expressed interest in licensing the technology, Sheikh said. Sheikh and his colleagues will present reports on their multiperson and hand-pose detection methods at CVPR 2017, the Computer Vision and Pattern Recognition Conference on July 21–26 in Honolulu.
The Panoptic Studio is a system organized around the thesis that social interactions should be measured through the perceptual integration of a large variety of viewpoints. The modularized system is designed around this principle, consisting of integrated structural, hardware, and software innovations. The system takes input from 480 synchronized video streams of multiple people engaged in social activities and produces the labeled time-varying 3D structure of anatomical landmarks on individuals in the space. The algorithmic contributions include a hierarchical approach for generating skeletal trajectory proposals, and an optimization framework for skeletal reconstruction with trajectory re-association. The insights gained from experiments in that facility now make it possible to detect the pose of a group of people using a single camera and a laptop computer.
“The Panoptic Studio supercharges our research,” Sheikh said. It now is being used to improve body, face and hand detectors by jointly training them. Also, as work progresses to move from the 2D models of humans to 3D models, the facility’s ability to automatically generate annotated images will be crucial.
When the Panoptic Studio was built a decade ago with support from the National Science Foundation, it was not clear what impact it would have, Sheikh said: “Now, we’re able to break through a number of technical barriers primarily as a result of that NSF grant 10 years ago.”
In addition to Sheikh, Ph.D. student Tomas Simon and master’s degree students Zhe Cao and Shih-En Wei aided the multiperson pose estimation research. Sheikh, Ph.D. student Hanbyul Joo, Simon, and Iain Matthews, an adjunct faculty member in the Robotics Institute, worked on the hand-detection study. Gines Hidalgo Martinez, a master’s degree student, also collaborated on this work, managing the source code.