Maximum-margin structured learning with deep networks for 3D human pose estimation

Sijin Li, Weichen Zhang, Antoni B. Chan

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

165 Citations (Scopus)

Abstract

This paper focuses on structured-output learning using deep neural networks for 3D human pose estimation from monocular images. Our network takes an image and 3D pose as inputs and outputs a score value, which is high when the image-pose pair matches and low otherwise. The network structure consists of a convolutional neural network for image feature extraction, followed by two sub-networks for transforming the image features and pose into a joint embedding. The score function is then the dot-product between the image and pose embeddings. The image-pose embedding and score function are jointly trained using a maximum-margin cost function. Our proposed framework can be interpreted as a special form of structured support vector machines where the joint feature space is discriminatively learned using deep neural networks. We test our framework on the Human3.6m dataset and obtain state-of-the-art results compared to other recent methods. Finally, we present visualizations of the image-pose embedding space, demonstrating the network has learned a high-level embedding of body-orientation and pose-configuration.
Original languageEnglish
Title of host publicationProceedings of the IEEE International Conference on Computer Vision
PublisherIEEE
Pages2848-2856
Volume11-18-December-2015
ISBN (Print)9781467383912
DOIs
Publication statusPublished - Dec 2015
Event15th IEEE International Conference on Computer Vision (ICCV 2015) - Santiago, Chile
Duration: 11 Dec 201518 Dec 2015

Publication series

Name
Volume11-18-December-2015
ISSN (Print)1550-5499

Conference

Conference15th IEEE International Conference on Computer Vision (ICCV 2015)
PlaceChile
CitySantiago
Period11/12/1518/12/15

Fingerprint

Dive into the research topics of 'Maximum-margin structured learning with deep networks for 3D human pose estimation'. Together they form a unique fingerprint.

Cite this