Reinforcement Learning : Creating a Custom Environment


In the previous blog, I had gone through the training of an agent for a mountain car environment provided by gym library. But what if we need the training for an environment which is not in gym. Sometimes we will need to create our own environments. This blog is all about creating a custom environment from scratch.

The environment which we will be creating here will be a grid containing two policemen, one thief and one bags of golds. The goal of the thief is to get the bag without being caught by the policemen.

All the parameters for training the model will be similar to that of the earlier post except some codes for the custom environment.

Now, let us write a python class for our environment which we will call a grid.

import numpy as np
from PIL import Image
import cv2
import pickle
import time
SIZE = 10

class Grid:
  def __init__(self,size=SIZE):
    self.x = np.random.randint(0, size)
    self.y = np.random.randint(0, size)

  def subtract(self, other):
    return (self.x-other.x, self.y-other.y)
  def isequal(self, other):
    if(self.x-other.x==0 and self.y-other.y==0):
      return True
      return False

  def action(self, choice):
    Gives us 8 total movement options. (0,1,2,3,4,5,6,7)
    if choice == 0:
      self.move(x=1, y=1)
    elif choice == 1:
      self.move(x=-1, y=-1)
    elif choice == 2:
      self.move(x=-1, y=1)
    elif choice == 3:
      self.move(x=1, y=-1)
    elif choice == 4:
    elif choice == 5:
      self.move(x=0, y=1)
    elif choice == 6:
      self.move(x=-1, y=0)
    elif choice == 7:
      self.move(x=0, y=-1)
  def move(self, x=False, y=False):
    if not x:
      self.x += np.random.randint(-1, 2)
      self.x += x
    if not y:
      self.y += np.random.randint(-1,2)
      self.y += y
    if self.x<0:
    if self.x>=SIZE:
      self.x = SIZE-1
    if self.y<0:
    if self.y>=SIZE:
      self.y = SIZE-1

Now, that we have defined our grid, we will assign one grid to each of the policemen, thief and the bag of gold. One more thing to note is that we will give a negative penalty to the thief when it touches the police and some positive reward when it touches the gold bag. In this case, our observation space will be the difference of coordinates of the thief from each of the policemen and the gold bag. We will also assign some colours for the police, thief and gold bag.

Let us write the code for training our game.

episodes = 10000
move_penalty = -1
police_penalty = -100
gold_penalty = 50
show_every = 1000
learning_rate = 0.2
gamma = 0.9

# for coloring 
thief_key = 1
police_key = 2
gold_key = 3

# RGB color coding
d = {1:(255, 0, 0), 2:(0,255,0), 3:(0,0,255)}

q_table = {}
for a in range(-SIZE+1, SIZE):
  for b in range(-SIZE+1, SIZE):
    for c in range(-SIZE+1, SIZE):
      for d in range(-SIZE+1, SIZE):
        for e in range(-SIZE+1, SIZE):
          for f in range(-SIZE+1, SIZE):
            q_table[((a,b),(c,d),(e,f))]= [np.random.uniform(-8, 0) for i in range(8)]

for eps in range(episodes):
  police1 = Grid()
  police2 = Grid()
  gold = Grid()
  thief = Grid()
  show = False
    show = True
  for i in range(200):
    dstate = (police1.substract(thief), police2.subtract(thief), gold.subtract(thief))
    action = np.random.randint(0,8)
    if(thief.x==police1.x and thief.y==police1.y):
      reward = police_penalty
    elif(thief.x==police2.x and thief.y==police2.y):
      reward = police_penalty
    elif(thief.x==gold.x and thief.y==gold.y):
      reward = gold_penalty
      reward = move_penalty
    new_dstate = (police1.substract(thief), police2.subtract(thief), gold.subtract(thief))
    max_future_qval = np.max(q_table[new_dstate])
    current_qval = q_table[dstate][action]
    if reward == gold_penalty:
      new_qval = gold_penalty
      new_qval = (1 - learning_rate) * current_qval + learning_rate * (reward + gamma * max_future_qval)
    q_table[dstate][action] = new_qval
      env = np.zeros((SIZE, SIZE, 3), dtype=np.uint8) # 3 is the number of channels for RGB image
      env[gold.x][gold.y] = d[gold_key]
      env[thief.x][thief.y] = d[thief_key]
      env[police1.x][police1.y] = d[police_key]
      env[police2.x][police2.y] = d[police_key]
      image = Image.fromarray(env, 'RGB')
      image = image.resize((300, 300))
      cv2.imshow("ENV", np.array(image))
      if reward == gold_penalty or reward == police_penalty:
        if cv2.waitKey(500) and 0xFF == ord('q'):
        if cv2.waitKey(1) & 0xFF == ord('q'):
      if reward == gold_penalty or reward == police_penalty:

I will quickly go through the code once again. Initially, I have defined the different penalties, number of episodes, learning rate and the discount factor. Then I created a q-table in the form of a dictionary with random values in it for all the 8 possible actions. Then I have written the code for training the model in which for each episode, the thief takes a maximum of 200 steps and stops before if it reaches the gold bag. After every 1000 episodes, I am showing the trace of the path of the thief using cv2 and PIL image library. One important thing to note was that what happens if the thief reaches one of the policemen or the gold bag. It may cause a pause and that’s why we are breaking the loop in case those things occur.

Note: The environment we just created requires a q-table which will be huge in memory and thus it may take too much time to create the q-table itself. The recommended solution is to reduce the number of policemen or the grid size so that you can visualise it in a usual computer. If you already have the q-table, it is recommended to save it as a binary pickle file and load while training the model.

I have trained the model for 1 policeman and 5X5 grid and the results per 100 episodes are as shown in the video below:


So in this video, the blue colour box is the thief, green is the gold and red is the police. Whenever you see a combination of red and blue but no green, this means the thief has reached the gold bag.

Hope the second blog in the series was fun too. Keep Learning, Keep Sharing 😊