VTDexManip: A Dataset and Benchmark for Visual-Tactile Pretraining and Dexterous Manipulation With Reinforcement Learning
Yu Cui1
Zhengnan Sun1
1Zhejiang University

Abstract

Vision and touch are essential sensory modalities in human manipulation. While prior works have leveraged human manipulation videos for pretraining robotic skills, most are limited to vision-language modalities and simple grippers. To address this, we collect a large-scale visual-tactile dataset of humans performing 10 daily manipulation tasks across 182 objects. Our dataset is the first to support complex dexterous manipulation with visual-tactile information.

We introduce a benchmark of 6 dexterous tasks and develop a reinforcement learning-based visual-tactile policy learning framework. We compare 17 baseline and pretraining strategies to assess the impact of different modalities.

Key results show: (1) incorporating sparse binary tactile signals boosts policy performance by ~20%, (2) joint visual-tactile pretraining improves generalization across tasks, and (3) the visual-tactile policy is robust to sensor noise and viewpoint variation.

Dataset Statistics

Dataset Statistics

★ Overview of our collected visual-tactile dataset, which includes 10 diverse manipulation tasks and 182 objects. (a) Our collection system. (b) The number of trajectories and objects. (c): The number of total frames (On: frames w/ contact; Off: frames w/o contact ). (d) The distribution of the number of frames. (e) t-SNE of the tactile signals

Benchmark Overview

Benchmark Overview

★ The proposed benchmark includes 6 complex dexterous manipulation tasks with vision-tactile RL training and evaluation. It is designed to test generalization across tasks and modalities. (a) shows the six tasks of our manipulation platform; (b) lists the 18 pretrained and non-pretrained models in our benchmark; (c) is the policy learning framework combines proprioceptive inputs and perception representations to guide actions within an MDP, with skills learned via PPO.

Video

BibTeX

@inproceedings{liu2025vtdexmanip,
title={VTDexManip: A Dataset and Benchmark for Visual-tactile Pretraining and Dexterous Manipulation with Reinforcement Learning},
author={Qingtao Liu and Yu Cui and Zhengnan Sun and Gaofeng Li and Jiming Chen and Qi Ye},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=jf7C7EGw21}
}

Contact

For questions and collaborations, contact:

l_qingtao@zju.edu.cn, qi.ye@zju.edu.cn