Robot Learning Part 1.5: Action Chunking with Transformers (ACT)

robotics

transformers

imitation-learning

VAE

robot-learning

Author

Hujie Wang

Published

January 23, 2026

TL;DR

Action chunking predicts sequences of actions instead of single actions, dramatically reducing compounding errors
ACT combines CVAE + Transformers to achieve 80-90% success with just 50 demonstrations (10 minutes of data)
Key innovation: Temporal ensembling with receding horizon control enables smooth, precise manipulation
Performance: Chunk size k=100 improves success from 1% to 44% — action chunking is transformative
Trade-off: ACT is faster and more data-efficient than Diffusion Policy, but less flexible for highly multimodal behaviors

Why Action Chunking Matters

Imagine teaching a robot to pick up a mug. In traditional behavioral cloning (recall from Part 1), the robot predicts one tiny action at a time — 50 micro-adjustments per second. Each prediction is an opportunity for error. Over a 10-second task, that’s 500 chances to drift off course. Errors compound exponentially, and the robot quickly enters states it never saw during training.

Action chunking solves this elegantly: instead of predicting the next 0.02 seconds of motion, predict the next 2 seconds all at once. The effective decision-making horizon drops from 500 steps to just 5. Compounding errors are reduced by a factor of 100.

Intuition: Chunking in Human Cognition

This mirrors how humans naturally organize information. We don’t memorize phone numbers as 10 individual digits — we chunk them into groups (e.g., area code + prefix + line). Similarly, skilled pianists don’t think about individual finger movements; they execute entire musical phrases as cohesive units. Action chunking brings this same principle to robot learning.$^{[3]}$

This single idea — predicting action sequences instead of single actions — is the foundation of Action Chunking with Transformers (ACT), one of the most influential recent breakthroughs in robot imitation learning.

The Problem with Single-Step Prediction

Compounding Errors in Behavioral Cloning

Standard behavioral cloning treats robot learning as supervised learning: given observation $o_t$, predict action $a_t$. Simple, right?

The problem: errors compound over time. Here’s why:

Distribution shift: Even a tiny prediction error moves the robot slightly off the demonstrated trajectory
Novel states: The robot enters states not present in the training data
More errors: Predictions from these novel states are even less accurate
Exponential divergence: In continuous control, errors grow exponentially with task horizon

For a 10-second task at 50Hz (500 timesteps), this compounds into catastrophic failure.

Intuition: The Steering Wheel Analogy

Think of driving a car. If your steering is off by 1 degree, you barely notice after 10 feet. But after a mile, you’ve drifted into another lane entirely. In robot learning, each prediction is like a tiny steering adjustment. Small errors accumulate until the robot is completely off track — trying to grasp objects that aren’t there or colliding with obstacles.

The Non-Markovian Problem

There’s a deeper issue: human demonstrations aren’t perfectly Markovian. Consider a human pausing mid-motion before continuing. The observation is identical during the pause and when movement resumes, yet the action should be different.

Single-step prediction cannot solve this — it’s theoretically impossible to map identical observations to different actions. The policy needs temporal context: “I’ve been stationary for 1 second, time to move” versus “I just started the pause.”

Temporal Dependencies Matter

Robot manipulation is inherently sequential:

Reaching: Smooth acceleration and deceleration phases
Grasping: Coordinated finger closure over time
Placement: Gradual force application to avoid dropping objects

Single-step policies must learn these temporal patterns implicitly through recurrent architectures or history stacking. Action chunking makes temporal structure explicit by predicting entire motion sequences.

Action Chunking: The Core Idea

From Points to Trajectories

Action chunking changes the learning objective from predicting individual actions to predicting sequences:

\[ \begin{aligned} \text{Standard BC:} \quad & \pi(a_t | o_t) && \text{(predict single action)} \\ \text{Action Chunking:} \quad & \pi(a_{t:t+k} | o_t) && \text{(predict k-step sequence)} \end{aligned} \tag{1}\]

At timestep $t$, the policy outputs $k$ future actions: $[a_t, a_{t+1}, ..., a_{t+k-1}]$.

How Horizon Reduction Works

Consider a 10-second manipulation task:

Approach	Control Freq	Total Steps	Decisions	Error Opportunities
Single-step	50Hz	500	500	500
Action chunking (k=100)	50Hz	500	5	5

Key insight: The robot still executes 500 actions, but makes only 5 high-level decisions instead of 500 micro-decisions. Compounding errors are reduced by 100×.

Figure 1: Horizon reduction through action chunking: A 500-step task becomes only 5 decision points when predicting chunks of k=100 actions, dramatically reducing compounding errors.

Receding Horizon Control

Action chunking implements a form of receding horizon control (also called Model Predictive Control):

Plan: Generate a sequence of k future actions
Execute: Perform only the first few actions
Replan: Get a new observation and generate a new sequence
Repeat: The planning horizon continuously “recedes” forward

This balances planning (looking ahead) with reactivity (incorporating new observations).

Intuition: Driving Down a Highway

When you drive, you don’t just focus on the next foot of road — you look ahead several car lengths to plan your path. But you don’t execute your entire plan rigidly. As you move forward and gain new information (a car changing lanes, brake lights ahead), you continuously update your plan. The “horizon” of your planning window moves forward with you, always looking ahead but constantly adapting.$^{[4]}$

ACT Architecture

Action Chunking with Transformers (ACT) was introduced in the 2023 paper “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” by Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn.$^{[1]}$ It combines action chunking with a Conditional Variational Autoencoder (CVAE) and Transformer architecture adapted from DETR (DEtection TRansformer).$^{[5]}$

System Overview

Figure 2: ACT architecture: Images from 4 cameras pass through ResNet-18 vision encoders, joint positions are embedded, and a style variable z is sampled. A transformer encoder fuses these features, then a transformer decoder generates k-step action sequences.

As shown in Figure 2, ACT consists of three main components:

Vision Backbone: ResNet-18 processes RGB images from 4 camera views
CVAE Encoder (training only): Encodes demonstrated action sequences into latent variable $z$
Policy Decoder: Transformer generates k-step action sequences conditioned on images, joint positions, and $z$

The Role of CVAE

Why use a Variational Autoencoder instead of simple supervised learning?

The Multimodality Problem

Recall from Part 1 that human demonstrations exhibit natural variability:

Task strategies: Different valid approaches to the same task
Execution speed: Hurried vs. careful movements
Motion style: Direct paths vs. cautious, curved trajectories

Standard supervised learning averages these variations, often producing infeasible “middle ground” actions. The CVAE solves this by modeling the distribution of valid action sequences.

Intuition: The Style Variable

Think of the latent variable $z$ as capturing how you perform a task, separate from what task you’re doing. When placing a mug, $z$ might encode: - Approach from left vs. right - Fast, confident motion vs. slow, careful placement - High arc trajectory vs. low, direct path

During training, the CVAE learns that all these styles are valid. During inference, setting $z=0$ (the prior mean) produces the “average” style that works robustly.$^{[6]}$

How CVAE Works

Training:

The encoder takes the current observation $o_t$ and the actual demonstrated action sequence $a_{t:t+k}$ and compresses them into a latent variable:

\[z \sim q_\phi(z | o_t, a_{t:t+k}) = \mathcal{N}(\mu_\phi(o_t, a_{t:t+k}), \sigma_\phi^2(o_t, a_{t:t+k})) \tag{2}\]

To enable backpropagation through this stochastic sampling, we use the reparameterization trick:

\[z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1) \tag{3}\]

This separates the randomness ($\epsilon$, which has no learnable parameters) from the deterministic transformations of $\mu$ and $\sigma$ (which can be optimized via gradients).

The decoder takes $z$ and the current observation and predicts the action sequence:

\[a_{t:t+k} \sim p_\theta(a_{t:t+k} | o_t, z) \tag{4}\]

The training loss has two terms:

\[\mathcal{L} = \mathcal{L}_{\text{reconst}} + \beta \cdot \mathcal{L}_{\text{KL}} \tag{5}\]

where: - Reconstruction loss ($L_1$ norm): $\mathcal{L}_{\text{reconst}} = \|a_{t:t+k} - \hat{a}_{t:t+k}\|_1$ ensures accurate action prediction - KL divergence: $\mathcal{L}_{\text{KL}} = D_{KL}(q_\phi(z|o_t, a_{t:t+k}) \| \mathcal{N}(0, I))$ regularizes the latent distribution toward a standard Gaussian prior

Inference:

At deployment, the encoder is not used (we don’t have ground-truth actions). Instead:

Sample $z \sim \mathcal{N}(0, 1)$ from the prior (or simply set $z=0$)
Pass $z$ and current observation through the decoder
Get the k-step action sequence

Common Misconception: “CVAE Generates Different Behaviors at Test Time”

At inference, ACT actually uses a deterministic $z=0$, not random samples. The CVAE’s role is during training — it helps the decoder learn to generate valid actions despite variability in demonstrations. By forcing the encoder to compress diverse demonstrations into a well-structured latent space, the decoder becomes robust to demonstration noise. At test time, $z=0$ produces reliable, consistent behavior.$^{[6]}$

What Does z Represent?

The latent variable $z$ captures stylistic variations in execution:

Execution speed (fast vs. slow)
Movement smoothness (jerky vs. fluid)
Force profiles (aggressive vs. gentle)
Task decomposition strategies

By sampling different $z$ values during training, the model learns to generate diverse but valid behaviors. During inference, using $z=0$ (the prior mean) produces the “average” execution style.

Transformer Architecture

ACT adapts the DETR (DEtection TRansformer) architecture$^{[5]}$ for action prediction:

Encoder: - 4 layers, 8 attention heads, 512 hidden dimensions - Input: concatenated vision features (1200×512 from 4 cameras), joint positions (512), and $z$ (512) - Output: fused multi-modal representation (1202×512)

Decoder: - 7 layers, 8 attention heads, 512 hidden dimensions - Uses fixed sinusoidal positional embeddings as queries - Cross-attends to encoder output - Output: k×14 dimensional tensor (k timesteps × 14 joint positions for bimanual robot)

Total parameters: ~80 million (lightweight compared to VLA models with billions)

Temporal Ensembling

Here’s where ACT gets particularly clever. Instead of predicting a chunk once every k timesteps, ACT queries the policy at every timestep, producing overlapping action chunks.

These are combined using exponential weighting:

\[a_t = \frac{\sum_{i=0}^{k-1} w_i \cdot A_t[i]}{\sum_{i=0}^{k-1} w_i}, \quad w_i = \exp(-m \cdot i) \tag{6}\]

where: - $A_t[i]$ is the predicted action for timestep $t$ from the chunk generated $i$ steps ago - $m$ is a hyperparameter controlling how quickly new predictions are incorporated - Smaller $m$ means faster incorporation of new observations

Why this works: Imagine multiple painters sketching the same portrait from slightly different angles. Each sketch captures the subject differently, but averaging them reduces individual errors while preserving essential features. Temporal ensembling does the same for actions — multiple predictions for the same timestep are blended, with more recent predictions weighted higher.

Benefits: - Smoothness: Averages out prediction noise without introducing bias - Consistency: Reduces modeling errors from any single prediction - Performance gain: 3-4% improvement over no ensembling$^{[1]}$

Training ACT

Data Requirements

ACT is remarkably data-efficient:

50 demonstrations per task (8-14 seconds each)
Total collection time: 10-20 minutes of human demonstrations
Control frequency: 50Hz
Action space: 14D absolute joint positions (2 arms × 7 DOF each)

Training Configuration

Key hyperparameters from the official implementation:

chunk_size = 100          # Predict 2 seconds ahead (100 steps at 50Hz)
hidden_dim = 512          # Transformer hidden dimension
num_encoder_layers = 4
num_decoder_layers = 7
nheads = 8                # Multi-head attention heads
dim_feedforward = 3200
kl_weight = 10            # β in the KL divergence term
learning_rate = 1e-5
batch_size = 8
epochs = 2000-5000        # Train longer for smoothness

Training Procedure

python imitate_episodes.py \
  --task_name sim_transfer_cube_scripted \
  --ckpt_dir checkpoints \
  --policy_class ACT \
  --kl_weight 10 \
  --chunk_size 100 \
  --hidden_dim 512 \
  --batch_size 8 \
  --num_epochs 2000 \
  --lr 1e-5

Training time: ~5 hours on RTX 2080 Ti Inference time: 0.01 seconds per forward pass (supports 50Hz control)

Critical Training Tips

From the official repository:

“If your ACT policy is jerky or pauses in the middle of an episode, just train for longer! Success rate and smoothness can improve way after loss plateaus.”

Practical guidelines: - Train for 5000+ epochs for real-world tasks (not just until loss plateaus) - Use L1 loss (not L2) for precise positioning tasks — L1 provides better gradients for exact alignment - Set β=10 for KL weight (balances reconstruction vs. regularization) - CVAE is essential for human data (drops from 35% to 2% success without it)$^{[1]}$

Key Insight: Train Past Loss Plateau

One of the most important practical findings: smoothness and success rate continue improving for thousands of epochs after the reconstruction loss has plateaued. The policy is learning subtle temporal coordination that doesn’t show up in per-timestep error metrics. Always train longer than you think you need.

Performance Results

Benchmark Tasks

ACT was evaluated on 6 real-world fine manipulation tasks:

Threading cable ties through small holes
Slotting batteries with millimeter precision
Opening translucent condiment cups (requires pinching, prying, tearing)
Inserting power cables
Sliding Ziploc bags closed
Wiping spills with bimanual coordination

Success Rates

Task	ACT Success	Best Baseline
Slide Ziploc	88%	27%
Slot Battery	96%	2%
Open Cup	84%	0%
Thread Ties	92%	—

These results represent dramatic improvements over: - BC-ConvMLP (standard behavior cloning) - BeT (Behavior Transformers with history) - RT-1 (discretized actions) - VINN (non-parametric retrieval)

Ablation Study Highlights

Impact of chunk size:

Chunk Size	Success Rate
k=1 (single-step)	1%
k=10	12%
k=50	38%
k=100	44%

The jump from 1% to 44% shows that action chunking is transformative, not just a marginal improvement. This 44× improvement validates the core insight: predicting sequences dramatically reduces compounding errors.

Impact of CVAE: - With CVAE: 35.3% success - Without CVAE: 2% success

Impact of temporal ensembling: - With ensembling: 38.6% - Without ensembling: 35.3%

ACT vs Diffusion Policy

Both ACT and Diffusion Policy (which we’ll cover in Part 2) use action chunking, but differ in the generative model:

Aspect	ACT	Diffusion Policy
Generative Model	Conditional VAE	Diffusion Model
Inference	1 forward pass	16-100 denoising steps
Speed	Fast (0.01s)	Slower (0.16s)
Data Efficiency	High (50 demos)	Moderate (more data needed)
Multimodality	Good (but unimodal at inference)	Excellent
Training Time	Fast (5 hours)	Slower
Parameters	~80M	Similar
Consistency	High (rigid adherence to demos)	Flexible, adaptive

When to Choose Each

Use ACT when: - Fast training and inference are critical - Data is limited (50-100 demonstrations) - Tasks require consistent, repeatable execution - Precision bimanual manipulation is needed

Use Diffusion Policy when: - Highly multimodal behavior is needed - Tasks involve extensive repetitive motions - More training data is available - Flexibility and adaptability matter more than speed

Example: ACT sometimes “freezes” mid-task on repetitive up-down motions, while Diffusion Policy handles these smoothly. However, ACT trains faster and requires less data.$^{[2]}$

The ALOHA System

ACT was developed as part of the ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) project, which makes bimanual manipulation research accessible.

Hardware Specifications

Cost: <$20,000 per system (each arm ~$5,000)
Tracking error: <2cm (average 0.68cm)
Control frequency: 50Hz
Cameras: 4 RGB at 480×640 resolution
DOF: 14 total (2 arms × 7 DOF each)

Mobile ALOHA Results

The system extends to mobile manipulation with co-training: combining static and mobile demonstrations improves both.

Co-training gains: - Average improvement: 34% across tasks - Individual task boosts: +45%, +20%, +80%, +95%, +80%

Mobile tasks (>80% success): - Sautéing and serving shrimp - Opening two-door wall cabinets - Lifting wine glass while wiping with other hand - Multi-step cooking sequences

Using ACT with LeRobot

The easiest way to get started with ACT is through LeRobot, Hugging Face’s robot learning framework.

Installation

pip install lerobot

Training

lerobot-train \
  --dataset.repo_id=${HF_USER}/your_dataset \
  --policy.type=act \
  --output_dir=outputs/train/act_your_dataset \
  --policy.device=cuda \
  --wandb.enable=true \
  --policy.repo_id=${HF_USER}/act_policy

Pre-trained Models

LeRobot provides pre-trained ACT policies on HuggingFace Hub: - lerobot/act_aloha_sim_insertion_human - lerobot/act_aloha_sim_transfer_cube_human

Limitations and Future Work

Current limitations: - Repetitive motions: Can freeze mid-task on cyclic movements - Multi-modal scenarios: Inference uses single $z$ sample (unimodal) - Per-task training: Original ACT is not multi-task (though extensions exist)

Recent extensions: - Bi-ACT: Incorporates force/torque feedback for contact-rich tasks - NL-ACT: Adds natural language conditioning for instruction following - One ACT Play: Learns from single demonstrations - 3D-ACT: Uses point cloud inputs instead of RGB images

What’s Next?

In Part 2, we’ll explore Diffusion Policy in depth — understanding how diffusion models are adapted for action generation, why they handle multimodality better than VAEs, and how they compare to ACT in practice.

Summary

Action Chunking with Transformers (ACT) demonstrates that the right abstraction can transform learning efficiency. By predicting action sequences instead of single actions, we achieve three key benefits:

Horizon reduction: Compounding errors drop by 100× (from 500 decisions to 5 when k=100). This directly addresses the exponential error growth problem in behavioral cloning.

Temporal coherence: Explicit modeling of motion primitives and temporal dependencies. The policy learns to generate smooth, coordinated motion sequences rather than disconnected micro-actions.

Data efficiency: 80-90% success with just 10 minutes of human demonstrations. The CVAE framework handles demonstration variability, while temporal ensembling ensures smooth execution.

Practical deployment: Fast inference (0.01s per forward pass) enables real-time 50Hz control on consumer hardware.

The combination of action chunking (Equation 1), CVAE for handling demonstration variability (Equation 5), and temporal ensembling for smoothness (Equation 6) creates a powerful framework for learning precise manipulation from limited data.

Whether you’re building research systems or deploying real robots, ACT provides a practical, accessible entry point into modern imitation learning. The LeRobot implementation and ALOHA hardware designs make it easier than ever to experiment with these techniques.

Next up: In Part 2, we’ll dive into Diffusion Policy and understand why diffusion models are becoming the gold standard for robot action generation.

References

[1] Zhao, T. Z., et al. “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” Robotics: Science and Systems (RSS), 2023.

[2] Chi, C., et al. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” Robotics: Science and Systems (RSS), 2023.

[3] Miller, G. A. “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information.” Psychological Review 63(2), 1956.

[4] Camacho, E. F., & Alba, C. B. “Model Predictive Control.” Springer, 2013.

[5] Carion, N., et al. “End-to-End Object Detection with Transformers (DETR).” ECCV, 2020.

[6] Kingma, D. P., & Welling, M. “Auto-Encoding Variational Bayes.” ICLR, 2014.

Recommended Resources

ALOHA Project Website — Hardware designs and demonstration videos
ACT Official GitHub Repository — Original implementation
LeRobot ACT Documentation — Easiest way to get started
The importance of action chunking in imitation learning — Technical deep dive by Haonan Yu
Dissecting ACT — Detailed technical breakdown