Humanoids 2025
Dynamic RDMM: Scalable, Controllable Dataset Generation for Instruction-Grounded Robot Learning
Dynamic RDMM treats dataset construction as a programmable system, using hierarchical templates and constraint-aware generation to create symbolic supervision for robot learning, evaluation, and deployment.
Shady Nasrat, Minseong Jo, Seonil Lee, Seung-Joon Yi
Table of contents
- Introduction
- Related work
- Methods
- Generation Pipeline Overview
- Template and Ontology Design
- Controllability and Task Rebalancing
- Dataset Scale and Coverage
- Formal Definition
- RDMM Dataset Evaluation and Use Case Validation
- Model Training Setup
- Dataset-Level Accuracy
- Real-World Deployment
- Conclusion
- Task’s Description, Ratio and Actions
Introduction
Robotic systems that can follow natural-language instructions promise to make intelligent agents more accessible and useful in everyday environments. Recent advances in instruction-tuned Large Language Models (LLMs) have demonstrated compelling capabilities in translating user commands into structured plans, enabling robots to reason over complex task sequences. However, achieving robust, real-world instruction-following behavior remains challenging—especially when LLMs are deployed on embodied agents operating in noisy, dynamic household environments.
One of the central obstacles is the shortage of domain-specific, symbolically grounded training datasets. General-purpose web-scale corpora offer vast linguistic variety, but lack the structure, action alignment, and feasibility constraints needed for robotic execution. Conversely, existing robotics datasets—such as ALFRED[1] and TEACh[2] set—provide grounded examples, but are typically static, narrow in scope, or costly to expand. As a result, robot learning pipelines suffer from a data bottleneck: when models underperform on a specific task, researchers have no quick, controllable way to inject targeted supervision without manual annotation. This gap is especially apparent in open competition settings such as RoboCup@Home[3].
In this work, we introduce the Dynamic RDMM Dataset—and more importantly, the controllable dataset generation engine that produces it. RDMM is a text-to-text dataset in which each sample maps a natural-language instruction to a structured action sequence, composed of symbolic primitives (e.g., MoveTo(kitchen), Pickup(milk), Respond("I’m here")). These sequences can be interpreted directly by robotic planners or mapped onto platform-specific skills, making the dataset readily deployable. Unlike previous datasets, RDMM is not a static benchmark—it is a parametric data engine enabling controllable, symbolic instruction-action supervision at scale.
The core of D-RDMM is a two-stage generation process:
Hierarchical template expansion: A compact YAML library encodes 23 common household tasks (e.g., guiding a person, delivering an item, answering a question) using nested logic structures. These templates recursively expand into semantically valid multi-step instructions.
Dynamic content generation: Templates are populated with verbs, object types, room names, and personal references drawn from curated embedding dictionaries. Semantic constraints ensure physical plausibility (e.g., “put the pizza on the table” is allowed; “put the microwave on the sandwich” is not).
This system offers several advantages critical for robot learning:
Scalability: From just 23 task templates, D-RDMM can generate over 100,000 valid instruction–action pairs.
Task controllability: Researchers can dynamically rebalance data by adjusting task-specific weights without reauthoring templates.
Curriculum and ablation support: Lexical variation, compositional complexity, and object diversity can all be programmatically adjusted to match experimental needs.
To validate D-RDMM, we fine-tune three open-source LLMs (LLaMA-3-8B, Mistral-7B, Qwen-0.5) using only the 1,800-sample seed set. These models achieve 93% accuracy on held-out samples and generalize to previously unseen instructions. Deployed on a mobile robotic platform at RoboCup@Home, D-RDMM-trained models reliably execute composite instructions in a real-world, multi-user environment.
Our contributions are three-fold:
A controllable dataset generation framework for instruction-following tasks in robotic environments;
The Dynamic RDMM Dataset, featuring 1,800 expert-validated pairs with the capacity to scale to 100k+ examples;
Empirical validation showing that small, well-structured datasets can train LLMs to reason over symbolic action plans and generalize in real-world robotic applications.
By treating dataset generation as a parameterized process rather than a static artifact, we turn data design into a flexible tool in the robot learning loop—paving the way for faster, more adaptive development of LLM-based instruction-following systems.
Methods
The Dynamic RDMM Dataset is not a static collection, but the output of a controllable, parameterized generation engine. It is constructed through a two-stage algorithmic process that transforms a compact set of YAML templates into thousands of semantically grounded, text-to-text instruction–action pairs. This modular architecture enables researchers to programmatically scale, rebalance, and adapt the dataset to match specific training and evaluation needs in robot learning, summarized in Fig. .
In addition to dataset generation, the full end-to-end execution framework—including speech recognition (STT), text-to-speech (TTS), visual perception models, person tracking, and symbolic planning—was deployed on a mobile platform and orchestrated using RDMM-trained language models. This integrated AI stack allows natural language instructions to be grounded in multimodal real-world execution. Details of the complete robotic system, including software and hardware integration, are described in our companion paper [21].
Generation Pipeline Overview
The D-RDMM dataset is generated via a two-stage process that supports structured language generation and controllable task complexity.
Stage 1 — Hierarchical Template Expansion. Each task category (e.g., follow, serve, guide) is defined by high-level templates written in YAML. These templates are hierarchical: they include nested references to lower-level subtemplates that describe entities (e.g., “a person wearing item”), actions, or spatial configurations. The system performs recursive expansion, layer by layer, until all placeholders are replaced with terminal placeholders. This process naturally creates variations in instruction complexity:
Low-complexity instruction: “Follow a person"
Medium-complexity: “Follow the person wearing glasses"
High-complexity: “Follow the person wearing glasses and deliver the apple juice to them"
Stage 2 — Dynamic Content Generation. Once the expanded templates contain only atomic placeholders, the system fills them using curated embedding dictionaries for verbs, object classes, rooms, and person names. Combinations are sampled using a task weighting vector \(w\), and filtered through logical rules to eliminate physically implausible instructions (e.g., Put(microwave, sandwich) is invalid).
This two-stage pipeline enables large-scale, controllable generation of realistic instruction–action pairs while supporting task rebalancing and complexity tuning for curriculum learning.
Template and Ontology Design
Unlike prior RoboCup-style generators, D-RDMM outputs complete instruction–action pairs with symbolic structure, supports balanced sampling, and allows controlled scaling without additional human annotation, D-RDMM defines an ontology of entities and their affordances:
21 locations (e.g., desk, wardrobe, coffee table)
6 room types (e.g., kitchen, bedroom, office)
7 object classes 50+ items (e.g., snacks, drinks, toys)
14 person names (e.g., Kai, Noah, Riley)
Placeholders are only filled with values that pass semantic filters. For example, Pour(milk, red bowl) is disallowed, while Pour(sandwich, red bowl) is permitted. Verb–object and object–location pairings are checked against pre-defined grammar constraints and affordance maps.
Controllability and Task Rebalancing
A key feature of D-RDMM is that dataset size and task balance are configurable. Researchers specify a task weighting vector \(w \in \mathbb{R}^{23}\) and a global limit generate_amount. Increasing \(w_{\text{follow}}\) immediately generates more person-following samples without modifying templates.
This makes D-RDMM particularly well-suited for:
Curriculum learning (increasing complexity or lexical diversity)
Task-specific augmentation (targeted fine-tuning on underperforming behaviors)
Ablation studies (removing or isolating specific instruction types)
Dataset Scale and Coverage
The seed release of dataset contains 1,860 expert-verified samples across 23 task types (see Appendix Table ). By adjusting generation parameters, the same setup can scale to over 100,000 unique samples in under a minute on a standard CPU.
Each sample references one or more elements from dataset’s semantic ontology and exhibits natural linguistic variation in verb choice, object type, and phrasing (e.g., “go behind the person wearing yellow shoes” vs. “follow the person wearing blue pants”).
Formal Definition
Formally, the dataset is generated as:
\[D = \bigcup_{t \in T} \bigcup_{g \in G_t} \left\{ \text{Apply}(g, \mathbf{e}) \,\middle|\, \mathbf{e} \in \prod_{j=1}^{n} E[p_j] \right\}\]
Where:
\(T\) is the set of task categories.
\(G_t\) is the set of templates for task \(t\).
\(E[p_j]\) is the list of valid substitutions for placeholder \(p_j\).
\(\mathbf{e} = (e_1, \dots, e_n)\) is a sampled combination from the Cartesian product.
Apply replaces placeholders in \(g\) with \(\mathbf{e}\) to yield a resolved instruction–action pair.
This definition ensures every generated sample is syntactically correct and grounded in valid robotic semantics.
RDMM Dataset Evaluation and Use Case Validation
We conducted both quantitative evaluation on the D-RDMM dataset and real-world deployment to validate its effectiveness as a training resource for robotic decision-making models. While detailed experiments, training procedures, and benchmarking results of D-RDMM-trained language models are presented in a separate research paper [21], this section summarizes the key findings relevant to the dataset’s quality, usability, and practical impact.
Model Training Setup
To validate the usability of D-RDMM, we fine-tuned three publicly available large language models: LLaMA3-8B[22], Mistral-7B[23], and Qwen-0.5[24]. These models were trained end-to-end on the instruction-action pairs generated by D-RDMM across all task types. The models learned to map natural language commands to structured robot control sequences using only text-based input and output.
We additionally evaluated two prompting-based baselines — ChatGPT-4o and ChatGPT-4o-mini — using a 20-shot setup with representative D-RDMM samples. While these models demonstrate strong general language ability, they lack task-specific grounding and structured output alignment.
Dataset-Level Accuracy
We evaluated the D-RDMM-trained models on a held-out subset of the dataset. Accuracy was computed based on exact match between predicted output and ground truth action sequence. As shown in Fig., all three D-RDMM-trained models achieved consistently high accuracy across the dataset, validating the dataset’s utility in teaching complex task reasoning and structured robotic behaviors.
Real-World Deployment
Beyond offline evaluation, we deployed D-RDMM-trained models on a physical robotic platform at the RoboCup@Home competition, where the robot was tasked with executing natural language instructions across a variety of household scenarios. The model-controlled system demonstrated reliable performance in person-following, object delivery, navigation, and multi-step planning tasks — even when interacting with previously unseen entities and descriptions.
Although systematic real-world metrics such as task success rate or response latency were not formally recorded, the robot was able to interpret natural instructions and complete task sequences effectively in a live, unstructured environment. As illustrated in Fig. and Fig. , D-RDMM-trained models executed complex sequential behaviors such as breakfast preparation and grocery tidying in live competition settings.
Conclusion
We introduced the Dynamic RDMM Dataset, a controllable, scalable, and semantically grounded data generation framework for training language models to perform robotic decision making. Each data sample consists of a natural-language instruction paired with a structured symbolic action program, enabling precise instruction-following in domestic settings.
The dataset is constructed through a two-stage pipeline—hierarchical template expansion and dynamic content generation—that produces over 100,000 unique, task-aligned instruction–action pairs from a compact set of expert-defined templates. D-RDMM supports programmable control over dataset size, task balance, and linguistic diversity, making it a flexible tool for robot learning pipelines, ablation studies, and curriculum-driven training.
We validated the dataset by training multiple open-source LLMs, achieving high accuracy and robust real-world performance in the RoboCup@Home competition. By treating dataset generation as a dynamic process rather than a static artifact, D-RDMM transforms data curation into a tunable component of robot learning, accelerating experimentation and deployment.
We release all templates, code, and seed samples to enable reproducible research and further development of instruction-grounded robotics systems.
Task’s Description, Ratio and Actions
This appendix summarizes the task types and symbolic actions used in the RoboCup@Home deployment. Table lists the 23 instruction templates along with their generation ratios. Table provides a description of the action primitives used in the competition.
| Task | Description | Amount |
|---|---|---|
| follow | Follow person to location | 48 |
| pour | Pour object into container | 47 |
| bringdesc | Bring object from location | 134 |
| complex_pose | Recognizing human posture | 105 |
| complex_countobj | Count object at location | 64 |
| countobj | Count object at location | 79 |
| descper | Describe person | 62 |
| 2users | Answer the second user’s question | 113 |
| goBeaconDoSth | Go to place and do something | 111 |
| serve | Put object onto a designated spot | 62 |
| guide | Guide person to location | 106 |
| store | Put object into storage | 37 |
| complex_put_on | Bring, put on object, and answer | 79 |
| complex_deliver | Deliver object and answer | 150 |
| mgreet | Greet person and answer | 110 |
| descobj | Describe object and answer | 77 |
| simple | Simple action and answer | 18 |
| complex_est | Identify extreme attributes | 37 |
| complex_greetdress | Greet person by outfit and answer | 139 |
| complex_countperson | Count people by outfit and answer | 49 |
| complex_guidedress | Guide person by outfit and answer | 179 |
| questions | Answer a simple question | 42 |
| time | Tell the time | 12 |
| Total | 1860 |
| Actions | Description |
|---|---|
| Respond | Respond to the user |
| Move_To | Move to a location |
| Pour_In | Pour an object into a container |
| Search_Object | Search for an object |
| Search_Person | Search for a person |
| Pickup | Pick up an object |
| Place_On | Place the picked-up object |
| Place_Next | Place the picked up object |
| Give_To | Give the object to the user |
| Open | Open the door |
| Close | Close the door |
| Vision_Ask | Ask the vision system and return the answer to Answer |
| Answer | Receive the answer from Vision_Ask, Count_Person, or Count_Object |
| Follow | Follow the person |
| New_Request | Listen to a question from second user and answer it |
| Count_Person | Count people with a given attribute and return the answer to Answer |
| Count_Object | Count a specific object and return the answer to Answer |
| Ask_Name | Ask a person for their name and return the answer to Answer |
| What_Time | Tell the time |
| What_Day | Tell the date |
| What_Tomorrow | Tell the tomorrow date |
References
Authors are with the Faculty of Electrical Engineering, Pusan National University, Busan, South Korea. seungjoon.yi@pusan.ac.kr(Corresponding author: Seung-Joon Yi).↩︎