Towards Viewpoint Invariant 3D Human Pose Estimation

Computer Science Department, Stanford University
European Conference on Computer Vision, 2016
*Indicates equal contribution


Abstract

We propose a viewpoint invariant model for 3D human pose estimation from a single depth image. To achieve viewpoint invariance, our deep discriminative model embeds local regions into a learned viewpoint invariant feature space. Formulated as a multi-task learning problem, our model is able to selectively predict partial poses in the presence of noise and occlusion. Our approach leverages a convolutional and recurrent network with a top-down error feedback mechanism to self-correct previous pose estimates in an end-to-end manner. We evaluate our model on a previously published depth dataset and a newly collected human pose dataset containing 100K annotated depth images from extreme viewpoints. Experiments show that our model achieves competitive performance on frontal views while achieving state-of-the-art performance on alternate viewpoints.

Paper

Full Paper: arXiv:1603.07076

Bibtex Citation

@inproceedings{haque2016viewpoint,
    title={Towards Viewpoint Invariant 3D Human Pose Estimation},
    author={Haque, Albert and Peng, Boya and Luo, Zelun and Alahi, Alexandre and Yeung, Serena and Fei-Fei, Li},
    booktitle = {European Conference on Computer Vision (ECCV)},
    month = {October},
    year = {2016}
}

Dataset

Side View

Side View (labeled)

Top View

Top View (labeled)

Note: The dataset files have been compressed with tar/gzip-9 (max) settings. The compressed and uncompressed file sizes are shown below.
View Split Frames People Actions Images Depth Maps Point Clouds Labels
Side Train 39,795 16 15 jpg (964M, 1.1G) HDF5 (884M, 5.7G) HDF5 (7.4G, 18G) HDF5 (17M, 2.9G)
Side Test 10,501 4 15 jpg (247M, 276M) HDF5 (234M, 1.6G) HDF5 (2.0G, 4.6G) HDF5 (3.6M, 771M)
Top Train 39,795 16 15 jpg (882M, 974M) HDF5 (876M, 5.7G) HDF5 (7.1G, 18G) HDF5 (31M, 2.9G)
Top Test 10,501 4 15 jpg (236M, 261M) HDF5 (235M, 1.6G) HDF5 (1.9G, 4.6G) HDF5 (8.9M, 771M)

Data Schema

Each file contains several HDF5 datasets at the root level. Dimensions, attributes, and data types are listed below. The key refers to the (HDF5) dataset name. Let $n$ denote the number of images.

Transformation
To convert from point clouds to a $240\times 320$ image, the following transformations were used. Let $x_{img}$ and $y_{img}$ denote the $(x,y)$ coordinate in the image plane. Using the raw point cloud $(x,y,z)$ real world coordinates, we compute the depth map as follows: $$x_{img} = \frac{x}{Cz} + 160 \mathrm{\quad and \quad} y_{img} = -\frac{y}{Cz} + 120$$ where $C = 3.50\times 10^{-3} = 0.0035$ is the intrinsic camera calibration parameter. This results in the depth map: $(x_{img}, y_{img}, z)$.
Joint ID (index) Mapping
joint_id_to_name = {
  0: 'Head',
  1: 'Neck',
  2: 'R Shoulder',
  3: 'L Shoulder',
  4: 'R Elbow',
  5: 'L Elbow',
  6: 'R Hand',
  7: 'L Hand',
  8: 'Torso',
  9: 'R Hip',
  10: 'L Hip',
  11: 'R Knee',
  12: 'L Knee',
  13: 'R Foot',
  14: 'L Foot',
}
            

Depth Maps

Key Dimensions Data Type Description
id $(n,)$ 8-bit unsigned integer (uint8) Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.
data $(n,240,320)$ Half precision floating point (float16) Depth map (i.e. mesh) corresponding to a single frame. Depth values are in real world meters (m).

Point Clouds

Key Dimensions Data Type Description
id $(n,)$ 8-bit unsigned integer (uint8) Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.
data $(n,76800,3)$ Half precision floating point (float16) Point cloud containing 76,800 points (240x320). Each point is represented by a 3D tuple measured in real world meters (m).

Labels

Key Dimensions Data Type Description
id $(n,)$ 8-bit unsigned integer (uint8) Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.
is_valid $(n,)$ 8-bit unsigned integer (uint8) Flag corresponding to the result of the human labeling effort. This is a boolean value (represented by an integer) where a one (1) denotes clean, human-approved data. A zero (0) denotes noisy human body part labels. If is_valid is equal to zero, you should not use any of the provided human joint locations for the particular frame.
visible_joints $(n,15)$ 16-bit integer (int16) Binary mask indicating if each human joint is visible or occluded. This is denoted by $\alpha$ in the paper. If $\alpha_j=1$ then the $j^{th}$ joint is visible (i.e. not occluded). Otherwise, if $\alpha=0$ then the $j^{th}$ joint is occluded.
image_coordinates $(n,15,2)$ 16-bit integer (int16) Two-dimensional $(x,y)$ points corresponding to the location of each joint in the depth image or depth map.
real_world_coordinates $(n,15,3)$ Half precision floating point (float16) Three-dimensional $(x,y,z)$ points corresponding to the location of each joint in real world meters (m).
segmentation $(n,240,320)$ 8-bit integer (int8) Pixel-wise assignment of body part labels. The background class (i.e. no body part) is denoted by $-1$.

Code

Dependencies

Getting Started

Python
import h5py
import numpy as np
import cv2

joint_id_to_name = {
  0: 'Head',
  1: 'Neck',
  2: 'R Shoulder',
  3: 'L Shoulder',
  4: 'R Elbow',
  5: 'L Elbow',
  6: 'R Hand',
  7: 'L Hand',
  8: 'Torso',
  9: 'R Hip',
  10: 'L Hip',
  11: 'R Knee',
  12: 'L Knee',
  13: 'R Foot',
  14: 'L Foot',
}

def main():
    depth_maps = h5py.File('ITOP_side_test_depth_map.h5', 'r')
    labels = h5py.File('ITOP_side_test_labels.h5', 'r')
    for i in range(depth_maps['data'].shape[0]):
        if labels['is_valid'][i]:
            depth_map = depth_maps['data'][i].astype(np.float32)
            joints = labels['image_coordinates'][i]
            img = depth_map_to_image(depth_map, joints)
            cv2.imshow("Image", img)
            cv2.waitKey(0)
            # ...
            # Your code here
            # ...
    return 0

def depth_map_to_image(depth_map, joints=None):
    img = cv2.normalize(depth_map, depth_map, 0, 1, cv2.NORM_MINMAX)
    img = np.array(img * 255, dtype = np.uint8)
    img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
    img = cv2.applyColorMap(img, cv2.COLORMAP_OCEAN)
    for j in range(15):
        x, y = joints[j, 0], joints[j, 1]
        cv2.circle(img, (x,y), 1, (255,255,255), thickness=2)
        cv2.putText(img, joint_id_to_name[j], (x+5, y+5), cv2.FONT_HERSHEY_SIMPLEX, 0.3, (255,255,255))
    return img

if __name__ == '__main__':
    main()
C++
#include <iostream>
#include <string>
#include <H5Cpp.h>

using namespace std;
using namespace H5;

int main (void) {
    string file_name = "ITOP_side_train_depth_map.h5";
    string dataset_path = "/data";

    // Open HDF5 file
    H5File fp(file_name, H5F_ACC_RDONLY);
    DataSet dset = fp.openDataSet(dataset_path);
    DataSpace dspace = dset.getSpace();

    // Dataset dimensions
    int rank;
    hsize_t dims[3];
    rank = dspace.getSimpleExtentDims(dims, NULL);
    DataSpace memspace (rank, dims);

    // Load dataset
    long n_entries = dims[0] * dims[1] * dims[2];
    float* data_out = new float[n_entries];
    dset.read(data_out, PredType::NATIVE_FLOAT, memspace, dspace );

    for (int i = 0; i < dims[0]; i++) {
        float depth_map[dims[1]][dims[2]];
        for (int j = 0; j < dims[1]; j++) {
            for (int k = 0; k < dims[2]; k++) {
                depth_map[j][k] = data_out[i,j,k];
            }
        }
        // ...
        // Your code here
        // ...
    }
    fp.close();
    return 0;
}

Contact

Click the box to show contact information