The focus of this project is at the intersection of robotics, computer vision, and machine learning. We focus on problems of how to train robots effectively in various tasks. In our ICML 2021 paper, we addressed the question of how robots can imitate experts when there exist differences between the robot and expert domains, e.g., different dynamics, viewpoint, embodiment, etc. We showed how to do this with unpaired and unaligned trajectory observations from the expert, i.e., directly from expert demonstrations without knowing the expert policy. We learn to translate across domain using a proxy task and a cycle consistency constraint. We also impose an additional consistency on the temporal position of states across the two domains. We build upon our previous work in NeurIPS 2019, that performed imitation learning with expert trajectories by decomposing the complex main task into smaller sub-goals. Such approaches open up the possibility of using imitation in robotics and learning from weak-supervision sources such as videos.
In a different line of work in CVPR 2022, we have considered multi-task learning with resource constraints. Multi-task learning commonly encounters competition for resources among tasks, specifically when model capacity is limited. This challenge motivates models which allow control over the relative importance of tasks and total compute cost during inference time. In this work, we proposed such a controllable multi-task network that dynamically adjusts its architecture and weights to match the desired task preference as well as the resource constraints. In contrast to the existing dynamic multi-task approaches that adjust only the weights within a fixed architecture, our approach affords the flexibility to dynamically control the total computational cost and match the user-preferred task importance better.
In our NeurIPS 2022 paper, we proposed AVLEN (Audio-Visual-Language Embodied Navigation), which is an embodied agent that is trained to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone that learns: (a) high-level policies to choose either audio cues for navigation or to query the oracle and (b) lower-level policies to select navigation actions based on its audio-visual and language inputs. The policies are trained via rewarding for the success of the navigation task while minimizing the number of queries to the oracle. To empirically evaluate AVLEN, we present experiments on the SoundSpaces framework for semantic audio-visual navigation tasks. See a demonstration of our system here.
This work has been supported by NSF, UCOP, and MERL.