Solving classic control problems with DQN

Apr, 2021

In the previous article, I explained the DQN algorithm and provided a general implementation for arbitrary Gym environments. I recommend reading this article first. Here, I take the algorithm and apply it to the classic control problem CartPole-v0. All the code can be found on GitHub.

Setup

Let us first define some general functions which we will use later. To obtain reproducible results, we define a function which sets all involved random seeds

def set_seed(env, seed=1):
    np.random.seed(seed)
    random.seed(seed)
    tf.random.set_seed(seed)
    env.seed(seed)
    env.action_space.seed(seed)
    env.observation_space.seed(seed)

Next, we define a function which creates a sequential Keras model of dense layers representing the action-value function. The input dimension corresponds to the observation space dimension of our problem and the output dimension to the number of possible actions. In the following examples, we will use relu activation functions and an Adam optimizer to optimize the mean-squared TD error.

def create_model(env, layers, lr):
    state_dim = env.observation_space.shape[0]
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(
        layers[0], input_dim=state_dim, activation="relu"))
    for i in range(1,len(layers)):
        model.add(tf.keras.layers.Dense(layers[i], activation="relu"))
    model.add(tf.keras.layers.Dense(env.action_space.n))
    model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(lr=lr))
    model.summary()
    return model

Next, we define a train function with the environment and training parameters as arguments. The function creates the model, sets up a filewriter for logging and finally calls the DQN function.

def train(env, p):
    set_seed(env, p["seed"])
    model = create_model(env, p["layers"], p["learning_rate"])

    # Setup filewriter
    timestr = datetime.now().strftime("%Y%m%d-%H%M%S")
    fn = timestr + "_" + p["name"]
    logdir = "logs/" + fn
    checkpointdir = "checkpoints/" + fn
    file_writer = tf.summary.create_file_writer(logdir)
    file_writer.set_as_default()

    # Log hyperparameters
    hyperparameters = [tf.convert_to_tensor([k, str(v)]) for k, v in p.items()]
    tf.summary.text("hyperparameters", tf.stack(hyperparameters), 0)

    # Train
    dqn(env, model, p["gamma"], p["epsilon"], p["epsilon_decay"],
        p["epsilon_min"], p["episodes"], p["buffer_size"], p["batch_size"],
        p["target_update_freq"], p["checkpoint_freq"],
        checkpoint_path=checkpointdir)

    return model

The training progress and log file can be visualized by starting a tensorboard instance via

$ tensorboard --logdir logs/

Finally, we define a function to visualize and evaluate the behavior of the trained agent

def play(env, model, episodes=10):
    for i in range(episodes):
        done = False
        state = env.reset()
        steps = 1
        ret = 0
        while not done:
            Qs = model.predict(np.array([state]))[0]
            action = np.argmax(Qs)
            state, reward, done, info = env.step(action)
            env.render()
            steps += 1
            ret += reward
            print("\rEpisode {}: Step {}, Return {}".format(i, steps, ret),
                end="")
        print("")

Having everything set up, let's turn to some concrete examples.

Cartpole-v0

The task of the CartPole-v0 problem is to balance a pole upright on a cart which can move left or right. The episode is over if the angle of the pole deviates $\pm 15$ degree from the vertical or if the cart moves more than 2.4 units from the center. It also ends automatically after 200 steps. Each time step gives a reward of $+1$. The observation or state vector is four dimensional and contains the cart position, the cart velocity, the pole angle and the pole velocity. The possible actions are moving left or moving right. These information can also be obtained by inspecting the environment directly

import gym
env = gym.make("CartPole-v0")
print(env.observation_space)
print(env.observation_space.high)
print(env.observation_space.low)
print(env.action_space)
# > Box(4,)
# > [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
# > [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
# > Discrete(2)

If we would want construct a policy by hand, we would have to analyze the observation in more detail to determine what is important and how to handle each situation. Using reinforcement learning, we can ignore the details of the observation and just take the numeric value of the observation and feed it to the algorithm which finds out by itself how to use the information.

As a model, we choose a two-layer neural network with 24 hidden units each. This choice of network architecture is arbitrarily set and could be further tuned. However, the input dimension of the network is always 4 (four dimensional observation space) and the output dimension is 2 (two possible actions). With the help of the functions defined above, the training of the model is as simple as

env = gym.make("CartPole-v0")
p = {
    "seed": 1,
    "gamma": 0.9,
    "epsilon": 1,
    "epsilon_decay": 0.95,
    "epsilon_min": 0.01,
    "episodes": 100,
    "buffer_size": 2000,
    "batch_size": 32,
    "target_update_freq": 20,
    "checkpoint_freq": 100,
    "layers": [24,24],
    "learning_rate": 1e-3,
    "name": "test"
}
model = train(env, p)
play(env, model)

After the training, the trained model can be reloaded with

model = create_model(env, p["layers"], p["learning_rate"])
model.load_weights("checkpoints/XXXXXXXX-XXXXXX_name/weights-00000XXX-0000XXXX")

Using the parameters shown above, the return over time is shown in the following figure

This is not too bad, after 55 episodes, the maximum episode length of 200 steps is reached for the first time. After 77 episodes a longer plateau at 200 is reached, yet after 90 episodes it drops again. The training of these 100 episodes with roughly 10000 steps takes about 25 min on a Core i5-7600 CPU. In principle, one could stop here and just use, for example, the parameters of episode 80. The rendered result of 10 episodes is shown below. The maximum episode length of 200 steps is always reached with this learned policy.

Problem solved! Maybe not completely. As we can see in the training progress figure above, the learning is not stable. But how stable is it in general with respect to random noise? And how sensitive is the learning with respect to hyperparameters like gamma, epsilon_decay, batch_size etc., which were chosen so far by best guesses. To get more insight, I ran a small hyperparameter study to find out more. I chose to vary the parameters gamma, epsilon_decay and target_update_freq while keeping the rest of the parameters as before. Furthermore, for each set of parameters I ran the training with multiple different seeds

def hyperparameter_search(env):
    list_gamma = [0.9,0.95,0.99]
    list_epsilon_decay = [0.8,0.9,0.95]
    list_target_update_freq = [1,10,20,50]

    for gamma in list_gamma:
        for epsilon_decay in list_epsilon_decay:
            for target_update_freq in list_target_update_freq:
                for seed in range(5):
                    p = {
                        "seed": seed,
                        "gamma": gamma,
                        "epsilon": 1,
                        "epsilon_decay": epsilon_decay,
                        "epsilon_min": 0.01,
                        "episodes": 100,
                        "buffer_size": 2000,
                        "batch_size": 32,
                        "target_update_freq": target_update_freq,
                        "checkpoint_freq": 100,
                        "layers": [24,24],
                        "learning_rate": 1e-3,
                        "name": (str(gamma) + "-" + str(epsilon_decay) + "-"
                            + str(target_update_freq)) + "_" + str(seed)
                    }
                    train(env, p)

This small study results already in $3\times 3 \times 4 \times 5 = 180$ training runs. Estimating each run to take on average 15-20 min, it takes roughly 2 days for the training on the above mentioned CPU.

The result of the hyperparameter study is shown below

Each row corresponds to a fixed value of gamma, each column to a fixed value of epsilon_decay. In each cell, the return curves for the four different values of the target_update_freq are shown. Hereby, the solid line is the average over the five runs with different random seeds, while the shaded background corresponds to the respective standard deviation.

One conclusion one can immediately draw by looking at the figure is that there is a huge variance for all parameter sets. This means that for a fixed set of hyperparameters, a training might be successful in one run, but actually might fail completely for another run with different random initial condition. However, one trend is observable. The training seems to become better going from the lower left to the upper right. Namely, the most successful trainings can be found for gamma = 0.9 and epsilon_decay=0.95. Hereby, I define a good training by three points: Fast increase of return, high end return and low variance. One can further see that the parameter target_update_freq doesn't seem to have a big influence on the training.

What can one learn from this study? First of all, the difference between the results for the hyperparameters is not too big. Especially due to the high variance, one should run a training with different random initial conditions anyhow as a training might fail even for the best parameters. This also means that the specific choice of hyperparameters is not so critical (at least in the here explored range) as for any parameters one can successfully train the agent within 100 episodes. For some parameters it just might take a few more tries. Furthermore, I just trained the agent on a maximum number of 100 episodes. Training longer, will probably lead to a stable training result for a large set of parameters.

Of course, the hyperparameter search is by far not exhaustive. The range of the varied parameters was quite limited and only a few parameters were tried. For example, the network architecture (number of layers and hidden units) was not varied at all. So I cannot exclude that there is a much more stable training possible. Yet, as the training takes quite some time, I stopped further experiments.