Let's Talk About Language! Investigating Linguistic Diversity in Embodied {AI} Datasets

Abstract

The linguistic quality of Embodied AI (EAI) datasets is underexplored. We present a feature extraction pipeline that quantifies diversity across token- and sentence-level traits such as lexical variation and syntactic complexity. Applied to multiple EAI datasets, our analysis reveals a reliance on repetitive language that may hinder generalization. A feature-guided paraphrasing case study on LIBERO-10 shows that minor syntactic shifts can cut OpenVLA’s success rate by over 50%, underscoring the value of fine-grained linguistic analysis for dataset design and model evaluation.

Publication
1st Workshop on Safely Leveraging Vision-Language Foundation Models in Robotics: Challenges and Opportunities