While humans naturally use their hands for communication and manipulating objects, current robotic systems often struggle with complex manual tasks. To bridge this gap, researchers have increasingly turned to machine learning models that process images of humans performing manual activities. These models aim to enhance robotic manipulation and, by extension, improve robot interactions with both humans and the objects in their environment. In addition to this, similar models could further enhance human-machine interfaces, including augmented and virtual reality (AR and VR) systems. However, to train these machine learning systems effectively, access to high-quality, annotated datasets showcasing human interactions with objects is essential.
Meta Reality Labs recently unveiled a promising new dataset called HOT3D, which could accelerate machine learning research focused on analyzing hand-object interactions. This dataset, introduced in a paper published on the arXiv preprint server, contains over 833 minutes of high-quality ego-centric 3D video footage. The videos capture human users engaging with a variety of objects from an egocentric perspective, meaning the footage mirrors what the user would see as they perform manual tasks.
As the researchers explain, HOT3D is a publicly available resource designed for egocentric hand and object tracking in 3D, providing over 3.7 million images of users interacting with objects. The dataset includes 19 subjects interacting with 33 different rigid objects, as well as multi-modal signals like eye gaze and scene point clouds. Additionally, the dataset contains detailed ground-truth annotations, including the 3D poses of objects, hands, and cameras, as well as 3D models of both hands and objects.
The HOT3D dataset covers a broad range of scenarios, from simple tasks like picking up and observing objects to more complex actions commonly seen in everyday settings, such as manipulating kitchen utensils, handling food, or typing on a keyboard. Data was collected using two key devices developed by Meta: the Project Aria glasses and the Quest 3 VR headset.
Project Aria, a set of lightweight augmented reality (AR) glasses, captures both video and audio while tracking the user’s eye movements and the positions of objects in their field of view. The Quest 3 headset, a commercially available VR device, was also used to capture data, providing a second layer of rich, real-time interaction. As the researchers note, ground-truth poses were acquired using a professional motion-capture system, which involved placing optical markers on the hands and objects during data collection.
For hands-on applications in robotics and computer vision, the researchers assessed the HOT3D dataset by training baseline models for three key tasks: 3D hand tracking, 6 degrees of freedom (6DoF) object pose estimation, and 3D lifting of unknown objects in-hand. Their experiments demonstrated that models trained on the multi-view data in HOT3Doutperformed those trained on single-view data by a significant margin.
As the researchers concluded, “The evaluated multi-view methods, uniquely enabled by HOT3D, significantly outperform their single-view counterparts.” This highlights the substantial potential of multi-view egocentric data in advancing various robotics tasks.
The HOT3D dataset is open-source and available for download through the Project Aria website, allowing researchers around the globe to leverage it in their work. This dataset has the potential to further the development of technologies related to human-machine interaction, robotics, and computer vision, offering crucial insights for improving the capabilities of next-generation systems.
By Impact Lab