Analysis of Performance in OfflineRL

Introduction

Preface

Offline RL is renowned for its sample efficiency and has attracted widespread attention in both academia and industry. However, despite the abundance of emerging algorithms, it remains unclear which methods excel at specific tasks or which are superior for most robotic applications.

I benchmarked classic and cutting-edge offline RL algorithms to objectively assess their performance. To this end, we provide the following clarifications:

Given the growing interest in Embodied AI, the testing centered on robotic locomotion and manipulation tasks using MuJoCo (Hopper, Walker2d, Humanoid, Swimmer, HalfCheetah, Ant) and MetaWorld.
The evaluated algorithms range from classics to the latest frontiers, including:

Warning

Disclaimer:solely for academic reference $\color{red} Unauthorized \ use \ for \ any \ other \ purpose \ is \ strictly \ prohibited$ .

Preview

Performance MuJoCo for Offline RL

HalfCheetah-v5_online — MuJoCo: HalfCheetah

Note

The tag 'office' denotes the official source code implementation, while 'self' denotes my implementation from scratch (without using third-party APIs).
The experimental code for this project is not publicly available here. If interested, please contact me or refer to the official implementation

Video MuJoCo for Offline RL

Initial MuJoCo-Hopper

Final MuJoCo-Hopper

Initial MuJoCo-Walker2d

Final MuJoCo-Walker2d

Initial MuJoCo-Humanoid

Final MuJoCo-Humanoid

Initial MuJoCo-Swimmer

Final MuJoCo-Swimmer

Initial MuJoCo-HalfCheetah

Final MuJoCo-HalfCheetah

Initial MuJoCo-Ant

Final MuJoCo-Ant

Analysis of Performance

Analysis of Performance for MuJoCo-Hopper

Training Curve MuJoCo-Hopper

Video of Initial and End Performance for MuJoCo-Hopper

Below is the performance of classic and state-of-the-art Offline RL algorithms on MuJoCo-Hopper. The video on the left shows the initial policy, while the video on the right shows the performance after 1M samples of online training:

2016-ICLR-DDPG-Office Initial MuJoCo-Hopper

2016-ICLR-DDPG-Office Final MuJoCo-Hopper

2016-ICLR-DDPG-self Initial MuJoCo-Hopper

2016-ICLR-DDPG-self Final MuJoCo-Hopper

2018-ICML-SAC-office Initial MuJoCo-Hopper

2018-ICML-SAC-office Final MuJoCo-Hopper

2018-ICML-SAC-self Initial MuJoCo-Hopper

2018-ICML-SAC-self Final MuJoCo-Hopper

2018-ICML-TD3-office Initial MuJoCo-Hopper

2018-ICML-TD3-office Final MuJoCo-Hopper

2018-ICML-TD3-self Initial MuJoCo-Hopper

2018-ICML-TD3-self Final MuJoCo-Hopper

2021-ICLR-REDQ-self Initial MuJoCo-Hopper

2022-ICLR-REDQ-self Final MuJoCo-Hopper

2022-ICLR-DroQ-self Initial MuJoCo-Hopper

2022-ICLR-DroQ-self Final MuJoCo-Hopper

Analysis of Performance for MuJoCo-Walker2d

Training Curve MuJoCo-Walker2d

Video of Initial and End Performance for MuJoCo-Walker2d

Below is the performance of classic and state-of-the-art Offline RL algorithms on MuJoCo-Walker2d. The video on the left shows the initial policy, while the video on the right shows the performance after 1M samples of online training:

2016-ICML-DDPG-office Initial MuJoCo-Walker2d

2016-ICML-DDPG-office Final MuJoCo-Walker2d

2016-ICML-DDPG-self Initial MuJoCo-Walker2d

2016-ICML-DDPG-self Final MuJoCo-Walker2d

2018-ICML-SAC-office Initial MuJoCo-Walker2d

2018-ICML-SAC-office Final MuJoCo-Walker2d

2018-ICML-SAC-self Initial MuJoCo-Walker2d

2018-ICML-SAC-self Final MuJoCo-Walker2d

2018-ICML-TD3-office Initial MuJoCo-Walker2d

2018-ICML-TD3-office Final MuJoCo-Walker2d

2018-ICML-TD3-self Initial MuJoCo-Walker2d

2018-ICML-TD3-self Final MuJoCo-Walker2d

2021-ICML-REDQ-self Initial MuJoCo-Walker2d

2021-ICML-REDQ-self Final MuJoCo-Walker2d

2022-ICML-DroQ-self Initial MuJoCo-Walker2d

2022-ICML-DroQ-self Final MuJoCo-Walker2d

Analysis of Performance for MuJoCo-Humanoid

Training Curve MuJoCo-Humanoid

Video of Initial and End Performance for MuJoCo-Humanoid

Below is the performance of classic and state-of-the-art Offline RL algorithms on MuJoCo-Humanoid. The video on the left shows the initial policy, while the video on the right shows the performance after 1M samples of online training:

2018-ICML-DDPG-office Initial MuJoCo-Humanoid

2018-ICML-DDPG-office Final MuJoCo-Humanoid

2018-ICML-DDPG-self Initial MuJoCo-Humanoid

2018-ICML-DDPG-self Final MuJoCo-Humanoid

2018-ICML-SAC-office Initial MuJoCo-Humanoid

2018-ICML-SAC-office Final MuJoCo-Humanoid

2018-ICML-SAC-self Initial MuJoCo-Humanoid

2018-ICML-SAC-self Final MuJoCo-Humanoid

2018-ICML-TD3-office Initial MuJoCo-Humanoid

2018-ICML-TD3-office Final MuJoCo-Humanoid

2018-ICML-TD3-self Initial MuJoCo-Humanoid

2018-ICML-TD3-self Final MuJoCo-Humanoid

2021-ICML-REDQ-self Initial MuJoCo-Humanoid

2021-ICML-REDQ-self Final MuJoCo-Humanoid

2022-ICML-DroQ-self Initial MuJoCo-Humanoid

2022-ICML-DroQ-self Final MuJoCo-Humanoid

Analysis of Performance for MuJoCo-Swimmer

Training Curve MuJoCo-Swimmer

Video of Initial and End Performance for MuJoCo-Swimmer

Below is the performance of classic and state-of-the-art Offline RL algorithms on MuJoCo-Swimmer. The video on the left shows the initial policy, while the video on the right shows the performance after 1M samples of online training:

2016-ICML-DDPG-office Initial MuJoCo-Swimmer

2016-ICML-DDPG-office Final MuJoCo-Swimmer

2016-ICML-DDPG-self Initial MuJoCo-Swimmer

2016-ICML-DDPG-self Final MuJoCo-Swimmer

2018-ICML-SAC-office Initial MuJoCo-Swimmer

2018-ICML-SAC-office Final MuJoCo-Swimmer

2018-ICML-SAC-self Initial MuJoCo-Swimmer

2018-ICML-SAC-self Final MuJoCo-Swimmer

2018-ICML-TD3-office Initial MuJoCo-Swimmer

2018-ICML-TD3-office Final MuJoCo-Swimmer

2018-ICML-TD3-self Initial MuJoCo-Swimmer

2018-ICML-TD3-self Final MuJoCo-Swimmer

2021-ICML-REDQ-self Initial MuJoCo-Swimmer

2021-ICML-REDQ-self Final MuJoCo-Swimmer

2022-ICML-DroQ-self Initial MuJoCo-Swimmer

2022-ICML-DroQ-self Final MuJoCo-Swimmer

Analysis of Performance for MuJoCo-HalfCheetah

Training Curve MuJoCo-HalfCheetah

Video for MuJoCo-HalfCheetah

Below is the performance of classic and state-of-the-art Offline RL algorithms on MuJoCo-HalfCheetah. The video on the left shows the initial policy, while the video on the right shows the performance after 1M samples of online training:

2016-ICML-DDPG-office Initial MuJoCo-HalfCheetah

2016-ICML-DDPG-office Final MuJoCo-HalfCheetah

2016-ICML-DDPG-self Initial MuJoCo-HalfCheetah

2016-ICML-DDPG-self Final MuJoCo-HalfCheetah

2018-ICML-SAC-office Initial MuJoCo-HalfCheetah

2018-ICML-SAC-office Final MuJoCo-HalfCheetah

2018-ICML-SAC-self Initial MuJoCo-HalfCheetah

2018-ICML-SAC-self Final MuJoCo-HalfCheetah

2018-ICML-TD3-office Initial MuJoCo-HalfCheetah

2018-ICML-TD3-office Final MuJoCo-HalfCheetah

2018-ICML-TD3-self Initial MuJoCo-HalfCheetah

2018-ICML-TD3-self Final MuJoCo-HalfCheetah

2021-ICML-REDQ-self Initial MuJoCo-HalfCheetah

2021-ICML-REDQ-self Final MuJoCo-HalfCheetah

2022-ICML-DroQ-self Initial MuJoCo-HalfCheetah

2022-ICML-DroQ-self Final MuJoCo-HalfCheetah

Analysis of Performance for MuJoCo-Ant

Video of Initial and End Performance for MuJoCo-Ant

Below is the performance of classic and state-of-the-art Offline RL algorithms on MuJoCo-Ant. The video on the left shows the initial policy, while the video on the right shows the performance after 1M samples of online training: