Transporter Network

Task : Pick & Place

Contribution

In this work, we show that it is possible to achieve sample efficient end-to-end learning without object-centric representations – enabling the same model architecture to generalize to tasks with unseen objects, variable numbers of objects, deformable objects, and piles of small objects.
We propose a simple model architecture that learns to attend to a local region and predict its spatial displacement, while retaining the spatial structure of the visual input
Our method uses 3D reconstruction to project visual data onto a spatially consistent representation as input, with which it is able to better exploit equivariance for inductive biases that are present within the geometric symmetries of the data for more efficient learning

Model Input & Output

input : RGB-D 카메라로 캡처한 시각적 관찰(observations), demonstration data

시각적 관찰은 일정한 격자(grid of pixels}로 표현되며, 각 픽셀은 장면 내에서 객체를 픽하거나 플레이스할 위치에 대한 정보를 포함 시연 데이터는 모델이 학습할 때 demonstation data로 부터 어떤 물체를 pick 할지, 어디에 place 할지에 대해 학습하게 된다.

Output : Pick 위치(Tpick), Place 위치(Tplace)

Pick 위치의 경우 Fully Convolutional Network(FCN)를 사용해 결정되며, 가장 높은 행동-가치 함수(Qpick)를 가진 픽셀에서 픽 위치가 결정됨 ( Convolution은 trasnslation에 대해 equivariant 하므로)

Place 위치의 경우 템플릿 매칭(Template Matching) 방식을 사용해 픽한 객체의 시각적 특징을 다른 잠재적 플레이스 위치에 이동시켜, 최적의 플레이스 위치를 찾는다. 이 과정은 교차 상관(cross-correlation)을 통해 픽 위치 주변의 로컬 크롭(local crop)을 플레이스 위치에 매칭하는 방식으로 진행

Method

Pick : RGB-D 이미지를 3D point cloud 형태로 unprojection 한 다음에, 직교 투영(orthographic projection)을 통해 2D 이미지로 변환 시킴으로써 픽셀당 일정한 3D 공간을 표현하는 이미지로 변환해 공간적으로 일관된 표현을 제공

Pick 위치는 아래 수식과 같이 시각적 관찰로 부터 가장 성공할 가능성이 높은(Qpick의 값이 큰) 픽셀 위치(u,v)가 선택된다.