You are on page 1of 3

Volume 6, Issue 4, April – 2021 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Deep Reinforcement Learning on Atari 2600


T Manoj, J Rohit, A Sai Sashank & Asst. Prof. BJ Praveena
Dept. of Computer Science and Engineering, Matrusri Engineering College Saidabad, Telangana, India.

Abstract:- In reinforcement learning, the traditional Q II. PRELIMINARIES


Learning method solves the game by iterating over the
full set of states. Using the Q-Table to implement Q- In Q learning, the Q function, also called a state-
Learning is fine in small discrete environments. However action value function, specifies how good an action a is in
often we realize that we have too many states to track. the state s. Most of the environments have very large
An example is Atari games, that can have a large variety number of states and, in each state, we have a lot of actions
of different screens, and in this case, the problem cannot to try. It would be time-consuming to go through all the
be solved with a Q-table. This paper uses a deep neural actions in each state. A better approach would be to
network instead of a Q-table to solve it. Atari games are
approximate the Q function with some parameter . We can
displayed at a resolution of 210 by 160 pixels, with 128
possible colors for each pixel. This is still technically a use a neural network with weights to approximate the Q
discrete state space but very large to process. To reduce value for all possible actions in each state. As we are using
this complexity, we performed some minimal image neural networks to approximate the Q function, we can call
preprocessing. Finally, from the experimental results, it it a Q network.
is concluded that DQN can make the agent get high
scores from Atari game, and experience replay can Assume the environment is in a state s from the state
make the model training better. space S. The agent takes action a from the action space A by
obeying the policy  (a | s). A may be discrete or
Keywords:- Deep Reinforcement Learning; Artificial continuous. When an action is performed, the agent
Intelligence; Image Processing;
transitions into a new state s ' receiving a scalar reward r. If
for every action, the reward and the next state can be
I. INTRODUCTION
observed, we can formulate the following iterative algorithm
The combination of Reinforcement Learning and Deep to learn the Q value:
Learning is Deep Reinforcement Learning (DRL). Since the
improvement of computer calculation power and processing Q(s, a) will be updated follow:
technology, deep learning has gained significant advantage Q(s, a)  Q(s, a) (r  Q(s, a))
than traditional methodsin the field of artificial intelligence. (1)
To use reinforcement learning successfully in situations And if the round ends, it will follow:
approaching real-world complexity, however, agents are
Q(s, a)  Q(s, a)  [r   max Q(s ', a ')  Q(s, a)]
confronted with a difficult task, they must derive efficient
a 'A
representations of the environment from high- dimensional (2)
sensory inputs, and use these to generalize past experience
to new situations. In reinforcement learning the agent does Eq. (2), which is the maximum sum of rewards r
not have knowledge on what actions to take instead, the discounted by  (0    1) at each time step, achievable by a
agent learns from the consequence of its actions. Deep policy  (a | s), after making an observation s and taking an
learning has strong perceptual ability but it is weak in action a with learning rate  (0   1).
decision making. In contrast, reinforcement learning
performs well in decision-making, but has weak perceptual
ability [1]. Therefore, combination of these two provide a The optimal action-value function obeys the Bellman
way to solve the problem of perceptual decision-making in equation, following the intuition: if the optimal value Q(s ',
complex systems. a ') of the sequence s ' at the next time-step was known for
Here we use recent advances in training deep neural all possible actions a ', then the optimal strategy is to select
networks to develop a novel artificial agent, termed a deep the action a ' maximizing the expected value of r Q*(s ',
Q-network, that can learn successful policies directly from a ') [10]. In the Q-Learning, a policy called  - greedy, that
high-dimensional sensory inputs using end-to-end
reinforcement learning on Atari 2600 games [10]. follows a greedy strategy with probability 1 -  and selects a
random action with probability .  will continue to decrease
with continuous training.

IJISRT21APR384 www.ijisrt.com 487


Volume 6, Issue 4, April – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
The algorithm of DQN is presented as Figure 1. nonlinearity. The second hidden layer convolves 64 filters
of 4x4 with stride 2, again followed by a rectifier
nonlinearity. This is followed by a third convolutional layer
that convolves 64 filters of 3x3 with stride 1 followed by a
rectifier. The final hidden layer is fully-connected and
rectifier units. The output layer is a fully- connected linear
layer with a single output for each valid action [10]. The
number of valid actions in pong are 6.

Fig. 2. The RGB image with 210 x 160 pixels

Fig. 1. The Algorithm of DQN.

III. PROPOSED METHOD

The resolution of the Atari game is 210 x 160 as


shown in Fig. 2, it will clearly take a lot of computation and
memory if we feed the raw pixels directly. So, we down
sample the pixels to 84 x 84 and convert the RGB values to
grayscale values and we feed this pre-processed game
screen as the input to the model.
Fig. 3. The GRAY image with 84 x 84 pixels
Pong is a table tennis-themed arcade video game
featuring simple two-dimensional graphics, manufactured
by Atari and originally released in 1972. In Pong, one
player scores if the ball passes by the other player. An
episode is over when one of the players reaches 21 points.
In the OpenAI Gym framework version of Pong, the Agent
is displayed on the right and the enemy on the left. let’s
think carefully if with this fixed image we can determine
the dynamics of the game. There is certainly ambiguity in
the observation, we cannot know in which direction the ball
is going.

The solution is maintaining several observations from


the past and using them as a state. In the case of Atari
games, the authors of the paper suggested to stack 4
subsequent frames together and use them as the observation
at every state. For this reason, the preprocessing stacks four
frames together resulting in a final state space size of 84 x
84 x 4. The final input to the neural network is shown in
Fig. 5, which are 4 stacked grayscale images of 84 x 84
pixels each. After the images are processed, the neural
network needs to be built.

The first hidden layer convolves 32 filters of 8x8 with Fig. 4. The four subsequent frames of pong with 84 x 84
stride 4 with the input image and applies a rectifier pixels each.

IJISRT21APR384 www.ijisrt.com 488


Volume 6, Issue 4, April – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
V. CONCLUSION

In this paper, the agent model has been created to


verify the effectiveness of experience replay. From the
training result (the agent’s average score-per-step, see Fig.
6), it can be seen that, the effect of training can grow very
slowly after about 300,000 steps. Then the agent learns
faster and more steadily.

We used the network architecture, hyperparameter


Fig. 5. Final input size (84 x 84 x 4). values and learning procedure taking high-dimensional data
as input (visual images), and the number of actions available
IV. EXPERIMENTAL RESULT in each game with only very minimal prior knowledge. Our
method was able to train large neural networks using a
In this paper, we demonstrate that DQN architecture reinforcement learning and stochastic gradient descent in a
can successfully learn control policies in the pong stable manner.
environment with only very minimal prior knowledge,
receiving only the pixels and the game score as inputs, the Finally, it can be concluded that the DQN algorithm
reward structure of the pong is as follows, Reward is 0 for combined with deep convolution neural network and some
every frame and +1 for every ball missed by other player. image processing methods can make the agent play the
The reward is -1 when ball is missed by agent. Then, In the video game well and the experience replay can improve the
process of training the model, it is possible that there will be training quality.
many cases of ineffective training.
REFERENCS
Following previous approaches to playing Atari 2600
games, we also use a simple frame-skipping technique. [1]. R. Tan, J. Zhou, H. Du, et al., An modeling processing
More precisely, the agent sees and selects actions on every method for video games based on deep reinforcement
4th frame instead of every frame, and its last action is learning, in: 2019 IEEE 8th Joint International
repeated on skipped frames. Information Technology and Artificial Intelligence
Conference, ITAIC, IEEE, 2019, pp. 939–942.
The values of all the hyperparameters and optimization [2]. SILVER D, HUANG A and et al. Mastering the game
parameters were selected by performing an informal search of Go with deep neural networks and tree search.
on the game Pong. The following are values and Nature, 2016.
descriptions of hyperparameters used in training, the [3]. Wu Xiru, Huang Guoming and Sun Lining. Fast visual
discount factor  = 0.99, the planned training is 1,000,000 identification and location algorithm for industrial
steps, the initial exploration ( initial value) and the final sorting robots based on deep learning. Jiqiren/Robot,
November 1, 2016.
exploration ( final value) are 1.0 and 1e-2, respectively,
[4]. Riedmiller, M. Neural fitted Q iteration - first
and the random sampling batch size is 32. experiences with a data efficient neural reinforcement
learning method. Mach. Learn.: ECML, 3720, 317–328
Note during the experiment, the data of the average
(Springer, 2005). & Hutter, M. Universal Intelligence:
scores and the weights of neural network have been stored.
a definition of machine intelligence. Minds Mach. 17,
391–444 (2007).
Then the agent model was put to test on pong game, [5]. Moore, A. & Atkeson, C. Prioritized sweeping:
the highest scores were recorded. The Fig. 6 shows the data reinforcement learning with less data and less real
of average rewards over 1,000,000 steps trained by the time. Mach. Learn. 13, 103–130 (1993).
agent. [6]. O’Neill, J., Pleydell‐Bouverie, B., Dupret, D. &
Csicsvari, J. (2010) Play it again: reactivation of
waking experience and memory. Trends Neurosci., 33,
220– 229.
[7]. Lange, S. & Riedmiller, M. Deep auto-encoder neural
networks in reinforcement learning. Proc. Int. Jt. Conf.
Neural. Netw. 1–8 (2010).
[8]. Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning. MIT Press, 1 edition, 1998.
[9]. Eric Wiewiora. Potential-based shaping and q-value
initialization are equivalent. J. Artif. Intell. Res.
(JAIR), 2003.
[10]. MNIHV, KAVUKCUOGLUK, SILVERD, et al.
Human-level control through deep reinforcement
Fig. 6. The graph plot of episode reward over 1M steps. learning[J]. Nature,2015.

IJISRT21APR384 www.ijisrt.com 489

You might also like