Our model is capable of providing absolute-scale 3D human poses from stereo image pairs when only 2D groud truths are availabe for training.
In this paper, we propose a novel deep learning-based 3D underwater human pose estimator capable of providing absolute-scale 3D poses of scuba divers from stereo image pairs. While existing research has made significant advancements in 3D human pose estimation, most methods rely on 3D ground truth for training. To overcome this, our approach leverages epipolar geometry to derive 3D information from 2D estimations. Our method estimates 2D human poses while capturing their corresponding disparity from binocular perspectives, thus avoiding challenges in finding per-pixel correspondences in textureless regions commonly seen underwater. Additionally, to reduce the sensitivity of our method on 2D annotation accuracy, we design an auto-refinement pipeline to automatically correct biases introduced by human labeling. Experiments demonstrate that our approach significantly improves performance compared to previous state-of-the-art methods in dynamic environments while being efficient enough to run on limited-capacity edge devices.
We collected our train dataset that includes footage from both closed-water and open-water environments. It also features challenging poses that divers can perform easily underwater, such as swimming upside down. The 2D keypoints are first manaully annotated by human and further refined by our proposed auto-refinement pipeline.
For quantitative analysis, we defined three metrics: (1) Depth, (2) Orientation, and (3) Pose, to evaluate the performance of different methods. We compare our results with TR [1] on the DiverPose dataset.
(1) The Depth Metric evaluates how accurately the model estimates the distance of the diver from the camera. We asked divers to perform 360-degree horizontal rotations within predefined distances from the camera.
(2) The Orientation Metric measures the model’s ability to determine the diver’s orientation. Here, divers stood or knelt at a fixed position, initially facing the camera, and rotated 45 degrees clockwise at a time for 8 different orientations.
(3) The Pose Metric validates if the model can accurately estimate the relative human pose in 3D space. We asked divers to move their arms or legs either in front of or behind their bodies.
PedX [2] is a dataset that provides high-resolution stereo images and LiDAR data with manual 2D and automatic 3D annotations. We apply our proposed refinement pipeline and model architecture to PedX using the provided 2D bounding boxes. In the visualization, provided ground truth is shown in red and the model estimations are shown in blue. The results demonstrate that our method is able to determine pedestrian location and orientation in complex urban intersections. This indicates that our method is capable of generalizing to real-world scenarios.
[1] J. Zhao, T. Yu, L. An, Y. Huang, F. Deng, and Q. Dai, “Triangulation Residual Loss for Data-efficient 3D Pose Estimation”, NeurIPS, 2023.
[2] W. Kim, and M. S. Ramanagopal, C. Barto, and M.-Y. Yu, K. Rosaen, N. Goumas, R. Vasudevan, and M. Johnson-Roberson, “PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians in Complex Urban Intersections”, IEEE Robotics and Automation Letters, 2019, 1940-1947.