Recurrent Attention Models for Depth-Based Person Identification

Computer Science Department, Stanford University
Computer Vision and Pattern Recognition, June 2016


We present an attention-based model that reasons on human body shape and motion dynamics to identify individuals in the absence of RGB information, hence in the dark. Our approach leverages unique 4D spatio-temporal signatures to address the identification problem across days. Formulated as a reinforcement learning task, our model is based on a combination of convolutional and recurrent neural networks with the goal of identifying small, discriminative regions indicative of human identity. We demonstrate that our model produces state-of-the-art results on several published datasets given only depth images. We further study the robustness of our model towards viewpoint, appearance, and volumetric changes. Finally, we share insights gleaned from interpretable 2D, 3D, and 4D visualizations of our model's spatio-temporal attention.


Full Paper: [pdf] (4.8 MB)

Bibtex Citation

    author = {Haque, Albert and Alahi, Alexandre and Fei-Fei, Li},
    title = {Recurrent Attention Models for Depth-Based Person Identification},
    booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2016}

DPI-T Dataset

Depth-Based Person Identification from Top View Dataset


Note: The files have been compressed with tar/gzip-9 (max) settings. The compressed and uncompressed sizes are listed.
Split Frames Videos People RGB Images Depth Images Depth Maps Point Clouds
Train 3,740 226 12 jpg (121M, 129M) jpg (70M, 80M) HDF5 (66M, 1.1G) HDF5 (940M, 3.3G)
Test 4,010 262 12 jpg (130M, 140M) jpg (75M, 86M) HDF5 (71M, 1.1G) HDF5 (1.0G, 3.3G)

Data Schema

Each HDF5 dataset is structured as follows: XX/YYY/XX_YYY_ZZZZ.jpg. Where XX is the person ID, where YYY is the video ID, and where ZZZZ is the frame ID.
Representation Dimension Data Type Description
Depth Map $(n, 240, 320)$ Single precision floating point (float32) Depth map (i.e. mesh) corresponding to a single frame. Depth values are in real world meters (m).
Point Clouds $(n,76800)$ Single precision floating point (float32) Point cloud containing 76,800 points (240x320). Each point is represented by a 3D tuple measured in real world meters (m).
Transformation from 3D to 2D
To convert from point clouds to a $240\times 320$ image, the following transformations were used. Let $x_{img}$ and $y_{img}$ denote the $(x,y)$ coordinate in the image plane. Using the raw point cloud $(x,y,z)$ real world coordinates, we compute the depth map as follows: $$x_{img} = \frac{x}{Cz} + 160 \mathrm{\quad and \quad} y_{img} = -\frac{y}{Cz} + 120$$ where $C = 3.8605\times 10^{-3} = 0.0038605$ is the intrinsic camera calibration parameter. This results in the depth map: $(x_{img}, y_{img}, z)$.

Each file contains several HDF5 datasets at the root level. Dimensions, attributes, and data types are listed below. Let $n$ denote the number of images.



The training code is available on Github. Check the readme on Github for specific instructions and additional dependencies.


Using the Dataset

import h5py

depth_maps = h5py.File('DPI-T_test_depth_map.h5', 'r')

for pid in depth_maps.keys():
    for vid in depth_maps[pid].keys():
        num_frames = depth_maps[pid][vid].shape[0]
        for i in xrange(num_frames):
            depth = depth_maps[pid][vid][i]
            # ...
            # Do your stuff
            # ...
Lua (Torch)
require 'hdf5'

local input_file ='DPI-T_train_depth_map.h5', 'r')
local data = input_file:read('/'):all()

for p_id, person in pairs(data) do
    for v_id, video in pairs(person) do
        local n_frames = video:size()[1]
        for i = 1,n_frames do
            local depth_map = video[i]
            -- ...
            -- Your code here
            -- ...


Check the box to show contact information