Visual Graph Memory with Unsupervised Representation for Visual Navigation

How can an agent navigate through a particular environment to reach target observation, only using visual sensory signals?

SNU AI

Published in

SNU AIIS Blog

7 min readApr 3, 2022

By Seyeon An

How do we navigate our way home from work? We simply take the surrounding environment as our cue, we remember our target, and we navigate the way down — till what we see equals what we target to see. This mechanism matches that of a visual navigation task. In this task, an agent has to navigate through an environment solely based on visual sensory signals, such as RGBD images from camera.

In this piece of work, we would especially focus on visual target navigation, in which the agent has to find the target location based on the image of the target location, in an unfamiliar environment. Here, the agent needs to build an internal representation of the environment — from the sensory inputs and its own actions — to effectively navigate through a large-scale environment.

Left: Metric map (SLAM based), Right: Topological Map

So how can we “represent” in a navigation?

1) Metric Maps (SLAM based)

Metric maps have been widely used as a method to represent the environment for navigation. It is accurate, but the quality of the map is heavily reliant on the provided geometric information. Moreover, building and maintaining an accurate map from raw sensory signals is computationally expensive. This approach is not ideal for embedding task-related information.

2) Topological Maps

While the topological representation of the environment can be less accurate than the metric map, it is more convenient to build and maintain due to its concise and sparse representation mechanism. Thus, this approach could be more ideal in noisy and dynamic environment. Here, topological information can be embedded in each node.

Yet, the existing topological map based approaches have some limitations as well. First, the place recognition using raw sensory inputs requires human annotations. They require manually-specified annotation rules and geometric information. The need of human intervention in the process of vision navigation shows that it might be inappropriate for independently solving tasks. Also, the majority of existing methods can only be executed on a pre-built graph. Consequently, they demand sufficient exploration time before conducting actual navigation tasks.

So we have came up with a new graph-structured memory for visual navigation, visual graph memory (VGM). It consists of unsupervised image representations obtained from navigation history. The proposed VGM does not need any manually-specified, geometry-based labels for training.

We also present a navigation agent, which is designed to utilize the proposed VGM. This navigation agent can be trained in an end-to-end manner, even with the varying size of a graph. Our navigation agent does not need any pre-exploration before the actual navigation task.

Problem Formulation: Visual Target Navigation

First, we’ll give you an overview of what a visual target navigation task is:

The objective of this task is to arrive at the target location based on image observations o_t, with an target image, o_target. Any geometric information such as GPS, odometry, instructions to follow are not provided. We note o_t as an observation at time t, which is an RGBD panoramic image of the current location.

The action of the agent is discrete. There are four options: STOP, MOVE_FORWARD, TURN_LEFT, and TURN_RIGHT.

Visual Graph Memory (VGM)

We would tackle this task using VGM, which consists of the memory update module and the navigation module.

1. Memory Update Module

First, the memory update module builds and maintains the VGM. The current observation image is encoded into an unsupervised representation by F_loc. The image encoder F_loc is trained by unsupervised representation learning.

Specifically, we used prototypical contrastive learning, the recently proposed unsupervised representation learning method:

This method resembles the Expectation-Maximization algorithm. The following is the process:

First, it builds a set of prototypes of the image features by clustering.
Then, the network is trained to encode the image to be similar with the other features in the same cluster.
By iteratively conducting the two stages, the network can encode an image — which preserves the local smoothness of each image instance, but also captures the hierarchical semantic structure of the global dataset — into the feature.

Then, the VGM is updated based on the similarities between e_t and the node embeddings of the VGM, using the pretrained image encoder F_loc.

First, it calculates a set of cosine similarites between the current observation embedding e_t and the nodes in the VGM.

If there is a node vi whose cosine similarity s_i is higher than a threshold s_th, the module determines that the agent is near v_i in the VGM.
If v_i is the same as v_n (if i = n), which is the last localized node, v_n is updated. The embedding of v_i is replaced to the current feature e_t.
If they are different, a new edge between v_i and v_n is added.
If all the similarites are lower than the threshold, we can infer that the agent has never been to a place similar with the current location. In this case, a new node is added.

2. Navigation Module

The navigation module processes the updated VGM and estimates the most appropriate action for the navigation task. We use a graph convolutional network (GCN) F_enc, which is a powerful tool for processing graph-structured data, to encode the visual graph memory VGM. An image encoder F_vis provides two feature embeddings r_t and r_target from the observations.

Then, we conduct the attention mechanism to the encoded nodes using the current observation and the target observation. Two vectors (r_t and r_target), which contain situational context information, are produced from the navigation module. The context vectors and the current embedding features are fed into the recurrent action policy network.

Finally, the recurrent action policy network outputs an action regarding the context vectors and the current observation. This leads to the proposed agent navigating through the environment iteratively updating the VGM, and processing VGM into an action.

Experiments

We conducted several experiments on the Habitat simulator with the Gibson dataset.

All models are trained in 72 scenes and are evaluated on 14 unseen scenes. We evaluated each method in various difficulty settings.
All the baselines are trained with Imitation Learning (IL) first and are fine-tuned with Reinforcement Learning (RL).
Two evaluation metrics are used: the success rate and success weighted by path length.
The navigation becomes a success when the agent takes the stop action within 1m from the target location.

Of all the searching policies, the proposed VGM brings performance improvement over other types of memory models. It achieves SOTA over all the other methods:

Evaluation results of the baselines and our model. (SR: success rate, SPL: success weighted by path length)

Conclusion

We have proposed VGM, which is built during the navigation, with a single image encoder. This encoder is trained using an effective unsupervised representation learning method, which led to the following achievements:

Even though it is trained without any geometric information, it is sufficient to build a graph memory regarding the spatio-temporal relationships of the observations.
Using the VGM, the proposed navigation method has learned a goal-oriented policy directly from the graph representation.

The experiment results have shown that the proposed method has achieved performance improvement over the state-of-the-art approaches with various types of navigation memories. We hope that VGM would provide as a meaningful benchmark for the further search for a better navigation method.

Acknowledgements

We thank Obin Kwon and the co-authors of the paper “Visual Memory with Unsupervised Representation for Visual Navigation” for their contributions and discussions in preparing this blog. The views and opinions expressed in this blog are solely of the authors.

This post is based on the following paper:

Visual Memory with Unsupervised Representation for Visual Navigation, Obin Kwon, Nuri Kim, Yunho Choi, Hwiyeon Yoo, Jeongho Park, Sonhwai Oh, International Conference on Computer Vision (ICCV) 2021, Paper.

This post was originally published on our Notion blog, at January 6, 2022.