We present Supervision by Registration and Triangulation (SRT), an unsupervised approach that utilizes unlabeled multi-view video to improve the accuracy and precision of landmark detectors. Being able to utilize unlabeled data enables our detectors to learn from massive amounts of unlabeled data freely available and not be limited by the quality and quantity of manual human annotations. To utilize unlabeled data, there are two key observations: (1) the detections of the same landmark in adjacent frames should be coherent with registration, i.e., optical flow. (2) the detections of the same landmark in multiple synchronized and calibrated views should correspond to a single 3D point, i.e., multi-view consistency. Registration and multi-view consistency are sources of supervision that do not require manual labeling, thus it can be leveraged to augment existing training data during detector training. End-to-end training is made possible by differentiable registration and 3D triangulation modules. Experiments with 11 datasets and a newly proposed metric to measure precision demonstrate accuracy and precision improvements in landmark detection on both images and video. Detailed ablation studies analyzed (1) different optical flow algorithms, (2) using unlabeled data of different kinds and quantity, (3) different choices of loss calculation, (4) effect of noisy annotations, (5) different hyper-parameters, and (6) failure cases.

Figure 1.. The Supervision by Registration and Triangulation (SRT) framework takes labeled images and unlabeled synchronized and geometrically calibrated multi-view video as input to train an image-based landmark detector which is more precise on images/video, more stable on video, and also more consistent in multi-view scenarios. OF and 3DT stands for Optical Flow and 3D Triangulation respectively.


Figure 2. We did experiments on 11 datasets. "HP" indicates human pose. "PF" and "PP" indicate Panoptic-Face and Panoptic-Pose

Video 1. Apply SRT to improve HRNet on a video clip of 300-VW. Please view the video at full resolution for best effect. (Use Chrome to play the video.)

Figure 3.. We show the effect of utilizing various unlabeled data sets to enhance the regression-based detector. We report NME and P-error on three test subsets of 300-W. We also report NME, AUC@0.08, and P-error on 300-VW A, B, and C. X denotes the corresponding supervision is not used. For all experiments in this table, we use the 49 landmarks, excluding landmarks on the facial boundary. The SBR/SBT/SRT numbers are averaged over 3 runs.


  title     = {Supervision by Registration and Triangulation for Landmark Detection},
  author    = {Dong, Xuanyi and Yang, Yi and Wei, Shih-En and Weng, Xinshuo and Sheikh, Yaser and Yu, Shoou-I},
  journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
  volume    = {},
  number    = {},
  keywords  = {Landmark Detection;Optical Flow;Triangulation;Deep Learning},
  doi       = {10.1109/TPAMI.2020.2983935},
  ISSN      = {1939-3539},
  year      = {2020},
  month     = {},
  note      = {\mbox{doi}:\url{10.1109/TPAMI.2020.2983935}}