Observation
See our colab tutorial for how to customize cameras.
Observation mode
The observation mode defines the observation space.
All ManiSkill2 environments take the observation mode (obs_mode
) as one of the input arguments of __init__
.
In general, the observation is organized as a dictionary (with an observation space of gym.spaces.Dict
).
There are two raw observations modes: state_dict
(privileged states) and image
(raw visual observations without postprocessing). state
is a flat version of state_dict
. rgbd
and pointcloud
apply post-processing on image
.
state_dict
The observation is a dictionary of states. It usually contains privileged information such as object poses. It is not supported for soft-body environments.
agent
: robot proprioceptionqpos
: [nq], current joint positions. nq is the degree of freedom.qvel
: [nq], current joint velocitiesbase_pose
: [7], robot position (xyz) and quaternion (wxyz) in the world framecontroller
: controller states depending on the used controller. Usually an empty dict.
extra
: a dictionary of task-specific information, e.g., goal position, end-effector position.
state
It is a flat version of state_dict. The observation space is gym.spaces.Box
.
image
In addition to agent
and extra
, image
and camera_param
are introduced.
image: RGB, depth, and other images taken by cameras
{camera_uid}
:Color
: [H, W, 4],np.float32
. RGBA.Position
: [H, W, 4],np.float32
. The first 3 dimensions stand for (x, y, z) coordinates in the OpenGL/Blender convension. The unit is meter.
camera_param
: camera parameterscam2world_gl
: [4, 4], transformation from the camera frame to the world frame (OpenGL/Blender convention)extrinsic_cv
: [4, 4], camera extrinsic (OpenCV convention)intrinsic_cv
: [3, 3], camera intrinsic (OpenCV convention)
Unless specified otherwise, there are two cameras: base_camera (fixed relative to the robot base) and hand_camera (mounted on the robot hand). Environments migrated from ManiSkill1 use 3 cameras mounted above the robot: overhead_camera_{i}.
rgbd
We postprocess the raw image observation to obtain RGB-D images.
image
: RGB, depth, and other images taken by cameras{camera_uid}
:rgb
: [H, W, 3],np.uint8
. RGB.depth
: [H, W, 1],np.float32
. 0 stands for an invalid pixel (beyond the camera far).
pointcloud
We postprocess the raw image observation to obtain a point cloud in the world frame. image
is replaced by pointcloud
.
pointcloud
:xyzw
: [N, 4], point cloud fused from all cameras in the world frame. “xyzw” is a homogeneous representation.w=0
for infinite points (beyond the camera far), andw=1
for the rest.rgb
: [N, 3], corresponding colors of the fused point cloud
Note that the point cloud does not contain more information than RGB-D images unless specified otherwise.
+robot_seg
rgbd+robot_seg
or pointcloud+robot_seg
can be used to acquire the segmentation mask of robot links. robot_seg
is appended.
pointcloud+robot_seg
:robot_seg
: [N, 1], a binary mask where 1 for robot and 0 for others.
rgbd+robot_seg
:{camera_uid}
robot_seg
: [N, 1], a binary mask where 1 for robot and 0 for others.
Ground-truth Segmentation
Ground-truth segmentation can be used to generate training data for computer vision, reinforcement learning, and many other applications.
env = gym.make(env_id, camera_cfgs={"add_segmentation": True})
There will be an additional key: Segmentation.
For obs_mode="rgbd"
:
image
:{camera_uid}
Segmentation
: [H, W, 4],np.uint32
. The 1st dimension is mesh-level (part) segmentation. The 2nd dimension is actor-level (object/link) segmentation.
For obs_mode="pointcloud"
:
pointcloud
:Segmentation
: [N, 4],np.uint32
More Details on Mesh and Actor-Level segmentations
An “actor” is a fundamental object that represents a physical entity (rigid body) that can be simulated in SAPIEN (the backend of ManiSkill2). An articulated object is a collection of links interconnected by joints, and each link is also an actor. In SAPIEN, scene.get_all_actors()
will return all the actors that are not links of articulated objects. The examples are the ground, the cube in PickCube, and the YCB objects in PickSingleYCB. scene.get_all_articulations()
will return all the articulations. The examples are the robots, the cabinets in OpenCabinetDoor, and the chairs in PushChair. Below is an example of how to get actors and articulations in SAPIEN.
import sapien.core as sapien
scene: sapien.Scene = ...
actors = scene.get_all_actors() # actors excluding links
articulations = scene.get_all_articulations() # articulations
for articulation in articulations:
links = articulation.get_links() # links of an articulation
In ManiSkill2, our environments provide interfaces to wrap the above SAPIEN functions:
env.get_actors()
: return all task-relevant actors excluding links. Note that some actors might be excluded fromenv._scene.get_all_actors()
.env.get_articulations()
: return all task-relevant articulations. Note that some articulations might be excluded fromenv._scene.get_all_articulations()
.
The segmentation image is a [H, W, 4]
array. The second channel corresponds to the ids of actors. The first channel corresponds to the ids of visual meshes (each actor can consist of multiple visual meshes).
Thus, given the actors, you can use the ids of these actors (actor.id
) to query the actor segmentation to segment out a particular object. For example,
import mani_skill2.envs, gymnasium as gym
import numpy as np
env = gym.make('PickSingleYCB-v0', obs_mode='rgbd', camera_cfgs={'add_segmentation': True})
obs, _ = env.reset()
print(env.get_actors()) # e.g., [Actor(name="ground", id="14"), Actor(name="008_pudding_box", id="15"), Actor(name="goal_site", id="16")]
print([x.name for x in env.get_articulations()]) # ['panda_v2']
# get the actor ids of objects to manipulate; note that objects here are not articulated
target_object_actor_ids = [x.id for x in env.get_actors() if x.name not in ['ground', 'goal_site']]
# get the robot link ids (links are subclass of actors)
robot_links = env.agent.robot.get_links() # e.g., [Actor(name="root", id="1"), Actor(name="root_arm_1_link_1", id="2"), Actor(name="root_arm_1_link_2", id="3"), ...]
robot_link_ids = np.array([x.id for x in robot_links], dtype=np.int32)
# obtain segmentations of the target object(s) and the robot
for camera_name in obs['image'].keys():
seg = obs['image'][camera_name]['Segmentation'] # (H, W, 4); [..., 0] is mesh-level; [..., 1] is actor-level; [..., 2:] is zero (unused)
actor_seg = seg[..., 1]
new_seg = np.zeros_like(actor_seg)
new_seg[np.isin(actor_seg, robot_link_ids)] = 1
for i, target_object_actor_id in enumerate(target_object_actor_ids):
new_seg[np.isin(actor_seg, target_object_actor_id)] = 2 + i
obs['image'][camera_name]['new_seg'] = new_seg
# print(np.unique(new_seg))
However, the actor segmentations do not contain finegrained information on object parts, such as handles and door surfaces of cabinets. In this case, you need to leverage the mesh-level segmentations and the fine-grained visual bodies of actors to obtain the segmentations to e.g., handles and door surfaces. For example,
import mani_skill2.envs, gymnasium as gym
import numpy as np
env = gym.make('OpenCabinetDoor-v1', obs_mode='rgbd', camera_cfgs={'add_segmentation': True})
obs, _ = env.reset()
print(env.get_actors()) # e.g., [Actor(name="ground", id="20"), Actor(name="visual_ground", id="21")], which are not very helpful
print([x.name for x in env.get_articulations()]) # e.g., ['mobile_panda_single_arm', '1017']
# We'd like to obtain fine-grained part segmentations such as handles, so we need to obtain the finegrained visual bodies that correspond to each cabinet link
# get the names and ids of cabinet visual bodies and manually group them based on semantics
cabinet_links = env.cabinet.get_links() # e.g., [Actor(name="base", id="22"), Actor(name="link_1", id="23"), Actor(name="link_0", id="24")]
cabinet_visual_bodies = [x.get_visual_bodies() for x in cabinet_links]
cabinet_visual_body_names = np.concatenate([[b.name for b in cvbs] for cvbs in cabinet_visual_bodies]) # e.g., array(['shelf-10', 'shelf-11', 'frame_horizontal_bar-26', ...])
cabinet_visual_body_ids = np.concatenate([[b.get_visual_id() for b in cvbs] for cvbs in cabinet_visual_bodies]).astype(np.int32) # e.g., array([15, 16, 17, 18, 19, ...])
all_handle_ids = []
all_door_ids = []
all_drawer_ids = []
all_cabinet_rest_ids = []
for i in range(len(cabinet_visual_body_names)):
# print(cabinet_visual_body_names[i], cabinet_visual_body_ids[i])
if 'handle' in cabinet_visual_body_names[i]:
all_handle_ids.append(cabinet_visual_body_ids[i])
elif 'door_surface' in cabinet_visual_body_names[i]:
all_door_ids.append(cabinet_visual_body_ids[i])
elif 'drawer_front' in cabinet_visual_body_names[i]:
all_drawer_ids.append(cabinet_visual_body_ids[i])
else:
all_cabinet_rest_ids.append(cabinet_visual_body_ids[i])
all_handle_ids = np.array(all_handle_ids)
all_door_ids = np.array(all_door_ids)
all_drawer_ids = np.array(all_drawer_ids)
all_cabinet_rest_ids = np.array(all_cabinet_rest_ids)
# get the robot link ids
robot_links = env.agent.robot.get_links() # e.g., [Actor(name="root", id="1"), Actor(name="root_arm_1_link_1", id="2"), Actor(name="root_arm_1_link_2", id="3"), ...]
robot_link_ids = np.array([x.id for x in robot_links], dtype=np.int32)
# get the segmentations of different cabinet parts and the robots
for camera_name in obs['image'].keys():
seg = obs['image'][camera_name]['Segmentation'] # (H, W, 4); [..., 0] is mesh-level; [..., 1] is actor-level; [..., 2:] is zero (unused)
mesh_seg = seg[..., 0]
actor_seg = seg[..., 1]
# visual body ids correspond to mesh-level segmentations
semantic_grouped_seg = np.zeros_like(mesh_seg)
semantic_grouped_seg[np.isin(mesh_seg, all_handle_ids)] = 1
semantic_grouped_seg[np.isin(mesh_seg, all_door_ids)] = 2
semantic_grouped_seg[np.isin(mesh_seg, all_drawer_ids)] = 3
semantic_grouped_seg[np.isin(mesh_seg, all_cabinet_rest_ids)] = 4
# link ids correspond to actor-level segmentations, since a link is a subclass of an actor
semantic_grouped_seg[np.isin(actor_seg, robot_link_ids)] = 5
obs['image'][camera_name]['semantic_grouped_seg'] = semantic_grouped_seg
print(f"Summary of # points in the processed segmentaion map for camera {camera_name}:", [(semantic_grouped_seg == x).sum() for x in range(6)])