Let's Talk About Language! Investigating Linguistic Diversity in Embodied {AI} Datasets

Selma Liliane Wanna, Agnes Luhtaru, Ryan Barron, Jonathan Salfity, Juston Moore, Cynthia Matuszek, Mitch Pryor

January, 2025

Abstract

The linguistic quality of Embodied AI (EAI) datasets is underexplored. We present a feature extraction pipeline that quantifies diversity across token- and sentence-level traits such as lexical variation and syntactic complexity. Applied to multiple EAI datasets, our analysis reveals a reliance on repetitive language that may hinder generalization. A feature-guided paraphrasing case study on LIBERO-10 shows that minor syntactic shifts can cut OpenVLA’s success rate by over 50%, underscoring the value of fine-grained linguistic analysis for dataset design and model evaluation.

Type

Conference

Publication

1st Workshop on Safely Leveraging Vision-Language Foundation Models in Robotics: Challenges and Opportunities