Lifting Embodied World Models for Planning and Control

Summary

World models predict future observations conditioned on an input action. For complex embodiments such as a human agent, these actions may require specifying motion for each individual joint, greatly increasing the model's input dimensions. This high-dimensionality makes world models difficult to control manually and exponentially more expensive for search-based planning methods like Cross-Entropy Method (CEM).

Our solution to this problem is to lift the world model to a higher level of abstraction. We train a lightweight policy to map a high-level action to sequences of low-level actions. Composing this policy with a frozen world model creates a Lifted World Model (LWM) that predicts a longer-horizon future observation from a single high-level action. This lifting reduces the dimensionality and length of action sequences without the need to finetune or change the world model.

For an egocentric, embodied agent, we introduce waypoints as high-level actions. Each waypoint is a 2D projection of a 3D goal position onto the current observation and is specified for each leaf joint of the embodiment: the pelvis, head, left-hand, and right-hand. Importantly, waypoints are interpretable, low-dimensional, easy to specify and easy to search over.

We explore using waypoints as goal-conditioning for our policy compared to using goal observations and find that waypoints better capture the desired goal pose. We also combine our policy with a PEVA world model and perform CEM planning in high-level waypoint space for an embodied human agent. We find that planning with the LWM reduces mean joint error (MJE) by 3.8× compared to planning in low-level joint space. Our method also performs better at various CEM compute budgets and across longer goal horizons and longer initial distances to the goal pose. For additional experiments on waypoint visibility, ablations using 3D waypoints, decomposing action generation, as well as planning with 3D waypoints and in unseen environments, see our paper.

Overview: all stages of the Lifted World Model labeled together

Input: current observation, pose context, and high-level action (waypoints)

Actions: low-level joint actions decoded by the policy

Observations: future observations rolled out by the world model

Output: predicted observation T steps in the future

Fig. 1: Lifted World Model overview. A high-level action a^HL is mapped by the policy into a sequence of low-level joint actions, which the frozen low-level world model rolls out into a sequence of observations. The final observation o_t+T is the output of the LWM. Meshes are shown for visualization only.

Method

Here we introduce the three components of our method: a high-level waypoint action space for controlling egocentric, embodied agents, a waypoint-conditioned policy that predicts sequences of low-level joint actions, and combining the policy with a low-level PEVA world model to form a Lifted World Model (LWM) for planning and control.

High-level Waypoint Actions

Each waypoint is the 2D projection into the current observation o_t of a near-term 3D goal position for one leaf joint of the embodied agent. For the human XSens embodiment, we use one waypoint for each leaf joint (pelvis, head, left-hand, right-hand), all drawn directly on the current frame. We sample goal poses 8 timesteps (2 seconds) into the future. This reduces the action space dimensionality from 8 timesteps × 48 joint action dimensions for PEVA to just 4 × 2 = 8 dimensions for our high-level Lifted World Model.

During training, ground-truth waypoints are obtained by projecting the goal pose p_g onto observation o_t via forward kinematics and the camera matrix. At test time, waypoints can be specified manually by the user or searched for with CEM.

Fig. 2: Specifying goals using waypoints. Leaf joints from goal pose p_g are drawn as waypoints on the current observation o_t to create the annotated frame o^ann_t. Note how the goal pose p_g is not obvious from goal observation o_g.

Projecting goal pose into the current observation

Fig. 3: Waypoint generation for training. Ground-truth waypoints are obtained by projecting leaf joints of the goal pose p_g (back-left) onto observation o_t seen from the current pose p_t (front-right).

Goal-Conditioned Policy

Conditioned on a high-level waypoint action, the policy uses image and pose context to infer a meaningful sequence of T=8 low-level joint actions. The policy uses this context and the relative spacing of the waypoints to infer and move towards a 3D goal pose. During training, we randomly mask waypoints to teach the policy to infer reasonable goals for the unconditioned leaf joints. This improves performance and allows the policy to handle sparse goals at inference time, for example, specifying only a pelvis waypoint for navigation.

Walking through the hallway Grasping the doorknob Holding a camera outdoors

InputActionObservation

Step 1 / 8

Fig. 4: LWM rollouts. Given waypoints drawn onto the current observation, the policy generates a sequence of joint actions and the world model rolls them out, generating a sequence of observations. Pick an example, then drag the slider to scrub through the eight rollout steps.

Lifted World Model and Search

Composing the policy with the frozen low-level world model yields an LWM, which predicts o_t+T from o_t, pose context, and a single high-level action. With the LWM we can run CEM planning in high-level action space by sampling waypoints, simulating the predicted goal observation, and updating the waypoint prior using the best performing trajectories.

Fig. 5: CEM Planning with the LWM. The action prior, sampling, and distribution updates all occur in the high-level action space (green). The policy translates the selected high-level action to joint actions for execution. Planning with the low-level world model directly searches in the 48-dimensional joint space (blue).

Experimental results

We evaluate on the Nymeria dataset of egocentric human motion. Performance is measured by mean joint error (MJE) in meters between the final predicted pose and the ground-truth goal pose. MJE is reported for leaf joints, intermediate joints, and all joints together.

Goal-Conditioned Policy

We evaluate the policy on both unconditioned (no goal information) and goal-conditioned action generation. Conditioning on a goal observation o_g barely improves MJE: the NoMaD-style baseline reduces all-MJE by only 1.3 cm over unconditioned generation. In comparison, waypoints are an effective goal-conditioning signal; waypoint conditioning improves goal-conditioned all-MJE by 8.8 cm over unconditioned generation. Waypoint masking further improves goal-conditioned performance by teaching the model to infer a goal position when a waypoint is not visible. Moving forward, we use the waypoint-masking model, because lifting the world model relies on strong waypoint goal-conditioning.

Model	Unconditioned MJE			Goal-conditioned MJE
Model	Leaf	Int.	All	Leaf	Int.	All
Initial distance (no action)	0.445	0.419	0.426	0.445	0.419	0.426
Random weights (untrained net)	0.749	0.707	0.718	0.735	0.692	0.703
Base policy (NoMaD)	0.427	0.397	0.405	0.414	0.384	0.392
+ architecture	0.406	0.375	0.384	0.388	0.359	0.367
+ pose context	0.359	0.329	0.337	0.343	0.316	0.323
+ waypoint conditioning	0.353	0.323	0.331	0.262	0.236	0.243
+ waypoint masking	0.437	0.408	0.415	0.243	0.219	0.226

Table 1: Policy ablations. Each addition to the base diffusion policy on the Nymeria validation set; MJE in meters.

Qualitatively, the policy generalizes well to counterfactual waypoints not present in the data. It is also context sensitive: the policy can use image and pose context to generate sensible actions in different scenarios using the same waypoints.

Ground truth

Move to the right

Climbing onto the bench

InputAction InputAction InputAction

Step 1 / 8

Fig. 6: Counterfactual waypoints. Given counterfactual waypoints not present in the data, the policy generates plausible actions. Pick an example, then drag the slider to scrub through the eight rollout steps.

Grasping a pot

Open hallway

InputAction InputAction

Step 1 / 8

Fig. 7: Contextual generation. Given the same waypoints, the policy produces different actions depending on the context. For example, grasping a pot in front of a stove vs. walking forward in an open hallway.

Lifted World Model Planning

We perform search-based planning using our LWM. We sample 128 hybrid navigation + interaction tasks from Nymeria and compare direct CEM in the joint action space (PEVA CEM) against CEM in the high-level waypoint space using our Lifted World Model (Lifted CEM). PEVA CEM reduces all-MJE by only 8.8 cm while Lifted CEM reduces it by 33 cm.

Method	Planning MJE
Method	Leaf	Int.	All
Initial distance (no action)	0.724	0.697	0.704
Unconditioned policy	0.677	0.641	0.650
Image-cond. policy	0.605	0.578	0.585
PEVA CEM (joint space)	0.637	0.608	0.616
Lifted CEM (ours)	0.411	0.359	0.374

Table 4: Search-based planning. 6 CEM iterations, 64 samples per iteration. MJE in meters. Lifted CEM in the 8-d waypoint space substantially outperforms direct CEM in the 48-d joint space.

Navigating bedroom Placing a bag on the counter

Ground truth

InputActionObservation

Lifted World Model

InputActionObservation

Goal

Step 1 / 8

Fig. 8: Planning rollouts. Action sequences and observations generated from waypoints found with CEM planning (middle). Ground truth rollouts are shown on the left and the goal observation on the far right. Pick an example, then drag the slider to scrub through the eight rollout steps.

Lifted CEM also scales better with compute: it outperforms PEVA CEM at every CEM-iteration / sample budget tested. PEVA CEM in joint space performs worse after one CEM step, illustrating how naive search struggles in high-dimensional action spaces.

Fig. 9: Compute scaling. Cumulative-min MJE over CEM iterations at varying samples per iteration (n = 8, 16, 64). Searching in lifted action space scales better than in joint space.

Varying Goal Horizons

We compare Lifted CEM and PEVA CEM when goals are sampled from varying time horizons and at varying distances. PEVA CEM is allowed to vary its plan length to match the horizon; Lifted CEM only predicts 8 timesteps, matching the pretrained policy's training horizon. We find that Lifted CEM performs better on goals sampled from longer time horizons and when the goal pose is further away.

Fig. 10: Goal horizon. MJE vs. goal temporal distance (timesteps). The vertical line marks T=8, the policy's training horizon. Lifted CEM beats PEVA CEM at every horizon, including those well past training.

Fig. 11: Initial distance. MJE vs. goal spatial distance (separated into 20 quantile buckets). The dashed y=x line is the baseline when no action is taken. Lifted CEM consistently outperforms PEVA CEM while both methods struggle at initial MJE ≤ 0.3 m.

Takeaway

We present a method for lifting low-level world models to a higher level of abstraction by training a lightweight policy that maps high-level actions to sequences of low-level actions. We instantiate it for a human-like embodiment using a PEVA world model and a policy conditioned on high-level waypoints. Our approach is lightweight, requires no additional data, and works with off-the-shelf world models. Search-based planning with Lifted CEM is 3.8× more effective than planning in low-level joint space at various compute budgets and goal horizons. We believe that waypoints are one effective action space for egocentric, embodied agents, and moving forward the action spaces of world models will not only define what they learn, but how we use them.

BibTeX

@article{wang2026lifting,
  title   = {Lifting Embodied World Models for Planning and Control},
  author  = {Wang, Alex N. and Darrell, Trevor and Izmailov, Pavel
             and Bai, Yutong and Bar, Amir},
  journal = {arXiv preprint arXiv:2604.26182},
  year    = {2026}
}

Lifting Embodied World Modelsfor Planning and Control

Summary

Method

High-level Waypoint Actions

Goal-Conditioned Policy

Lifted World Model and Search

Experimental results

Goal-Conditioned Policy

Lifted World Model Planning

Varying Goal Horizons

Takeaway

BibTeX

Lifting Embodied World Models
for Planning and Control