Мы разработали иерархический алгоритм обучения с подкреплением, который обучает высокоуровневым действиям, полезным для решения ряда задач, что позволяет быстро решать задачи, требующие тысяч временных шагов. Our algorithm, when applied to a set of navigation problems, discovers a set of high-level actions for walking and crawling in different directions, which enables the agent to master new navigation tasks quickly.

Paper Code

Humans solve complicated challenges by breaking them up into small, manageable components. Grilling pancakes consists of a series of high-level actions, such as measuring flour, whisking eggs, transferring the mixture to the pan, turning the stove on, and so on. Humans are able to learn new tasks rapidly by sequencing together these learned components, even though the task might take millions of low-level actions, i.e., individual muscle contractions.

С другой стороны, современные методы обучения с подкреплением работают путем перебора грубой силы по низкоуровневым действиям, требуя огромного количества попыток решения новой задачи, и становятся очень неэффективными при решении задач, которые занимают много времени. большое количество тактов.

Наше решение основано на идее иерархического обучения с подкреплением, когда агенты представляют сложное поведение в виде короткой последовательности высокоуровневых действий, что позволяет нашим агентам решать гораздо более сложные задачи: в то время как для решения может потребоваться 2000 низкоуровневых действий, иерархическая политика превращает их в последовательность из 10 высокоуровневых действий, и гораздо эффективнее выполнять поиск в последовательности из 10 шагов, чем в последовательности из 2000 шагов.

Meta-Learning Shared Hierarchies

MLSH

Our algorithm, meta-learning shared hierarchies (MLSH), learns a hierarchical policy where a master policy switches between a set of sub-policies. The master selects an action every every N timesteps, where we might take N=200. A sub-policy executed for N timesteps constitutes a high-level action, and for our navigation tasks, sub-policies correspond to walking or crawling in different directions.

In most prior work, hierarchical policies have been explicitly hand-engineered. Instead, we aim to discover this hierarchical structure automatically through interaction with the environment. Taking a meta-learning perspective, we define a good hierarchy as one that quickly reaches high reward quickly when training on unseen tasks. Hence, the MLSH algorithm aims to learn sub-policies that enable fast learning on previously unseen tasks.

Мы тренируемся на распределении по задачам, разделяя вложенные политики, изучая новую основную политику для каждой выбранной задачи.Посредством многократного обучения новых основных политик этот процесс автоматически находит вложенные политики, которые соответствуют основной политике. динамика обучения.

Experiments

After running overnight, an MLSH agent trained to solve nine separate mazes discovered sub-policies corresponding to upwards, rightwards, and downwards movement, which it then used to navigate its way through the mazes.

In our AntMaze environment, a Mujoco Ant robot is placed into a distribution of 9 different mazes and must navigate from the starting position to the goal. Our algorithm is successfully able to find a diverse set of sub-policies that can be sequenced together to solve the maze tasks, solely through interaction with the environment. This set of sub-policies can then be used to master a larger task than the ones they were trained on (see video at beginning at post).

Training on separate maze environments leads to the automatic learning of sub-policies to solve the task.

Code

Мы выпускаемcode for training the MLSH agents, as well as the MuJoCo environments we built to evaluate these algorithms.