Introducing a novel representation that captures object surface features and spatial relations, enabling generalizable skills across grasping, in-hand reorientation, and bimanual handover.
Robotic dexterous manipulation is a challenging problem due to high degrees of freedom (DoFs) and complex contacts. While existing methods focus on sample efficiency, less attention has been paid to the representations for generalization in complex hand-object interactions.
We propose DexRep, a novel representation capturing object surface features and spatial relations. Our method achieves a 87.9% success rate on 5,000+ unseen objects and boosts performance by 20% to 40% in reorientation and bimanual handover tasks.
Fig 1. The DexRepNet++ framework: Integrating geometric surface features and spatial hand-object relations.
A novel geometric and spatial representation for robust hand-object interaction learning.
Trained and validated on 5,000+ objects, achieving state-of-the-art success rates.
Seamless deployment on real robotic platforms with zero-shot or minimal fine-tuning.
We evaluate DexRep across single and dual-hand tasks, demonstrating superior generalization and robustness.
Fig 2. Policies learned with DexRep perform grasping, in-hand reorientation, and handover tasks.
Policies learned with only 40 objects generalize to 5,000+ unseen shapes.
Dynamic rotation to reach target orientations with high precision.
Smooth transfer between two hands, boosting success rates by up to 40%.
Fig 3. Quantitative comparison of component contributions across random seeds.
Through extensive ablation experiments on different representation components, we derive the following core insights:
Empirical results show that the surface feature ($f_s$) is the most critical individual component, achieving a success rate of approximately 93.5% on unseen objects independently.
Peak performance (96.6% SR) is achieved only by combining occupancy ($f_o$), surface distance ($f_s$), and local geometry ($f_l$) for complex topologies.
Integrating global PointNet features (pGlo) degrades performance, proving that local geometric representations possess significantly stronger transferability.
| Configuration | 2-Finger | 3-Finger | 4-Finger |
|---|---|---|---|
| DexRep (%) | 65.4% | 78.2% | 81.5% |
Hand Agnosticism: DexRep models the spatial-geometric relationship of interactions rather than joint signals, allowing seamless adaptation to robots with different finger counts.
85.0% Success Rate
under partial observations
Noise Tolerance: Voxel-based encoding provides inherent tolerance to sensor noise and occlusions typical of commodity cameras.
Zero-Shot Transfer: Despite incomplete point clouds, DexRep maintains a minimal sim-to-real gap (under 5% drop).
Fig 4. Our experimental setup: Allegro Hand v4, Unitree Z1 arm, and Azure Kinect DK sensors.
@ARTICLE{liu2026dexrepnet++,
author={Liu, Qingtao and Sun, Zhengnan and Cui, Yu and Li, Haoming and Li, Gaofeng and Shao, Lin and Chen, Jiming and Ye, Qi},
journal={IEEE Transactions on Robotics},
title={DexRepNet++: Learning Dexterous Robotic Manipulation With Geometric and Spatial Hand-Object Representations},
year={2026},
volume={42},
number={},
pages={799-818},
keywords={Hands;Geometry;Grasping;Robots;Encoding;Handover;Training;Shape;Feature extraction;Visualization;Deep learning in robotics and automation;dexterous manipulation;hand-object representation;reinforcement learning (RL)},
doi={10.1109/TRO.2026.3651669}}
}