Decision Transformer: Reinforcement Learning via Sequence Modeling

Lili Chen*, Kevin Lu*, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas*, Igor Mordatch*

UC Berkeley, Facebook AI Research, Google Brain

arXiv / github

Can standard language modeling frameworks train effective policies for reinforcement learning?

Abstract

We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite the simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Offline reinforcement learning as a sequence modeling problem

We investigate shifting our perspective of reinforcement learning (RL) by posing sequential decision making problems in a language modeling framework. While conventional work in RL has utilized specialized frameworks relying on Bellman backups, we propose to instead model trajectories with sequence modeling, enabling us to use strong and well-studied architectures such as transformers to generate behaviors. To illustrate this, we study offline reinforcement learning, where we train a model from a fixed dataset rather than collecting experience in the environment. This enables us to train RL policies using the same code as a language modeling framework, with minimal changes.

Decision Transformer: autoregressive sequence modeling for RL

We take a simple approach: each modality (return, state, or action) is passed into an embedding network (convolutional encoder for images, linear layer for continuous states). The embeddings are then processed by an autoregressive transformer model, trained to predict the next action given the previous tokens using a linear output layer.

Evaluation is also easy: we can initialize by a desired target return (e.g. 1 or 0 for success or failure) and the starting state in the environment. Unrolling the sequence -- similar to standard autoregressive generation in language models -- yields a sequence of actions to execute in the environment.