Gated Recurrent Unit (GRU) is a simplified version of Long Short-Term Memory (LSTM). Let’s see how it works in this article.
This article will explain the working of gated recurrent units (GRUs). Since the GRUs can be understood easily if we have prior knowledge of Long Short-Term Memory (LSTMs), I strongly recommend learning about LSTMs beforehand. You can check out my article on LSTMs.
Gated recurrent units aka GRUs are the toned-down or simplified version of Long Short-Term Memory (LSTM) units. Both of them are used to make our recurrent neural network retain useful information longer. Both of them are equally good. The performance of both will vary for different use cases. For one use case, LSTM might work better and for another GRU might work better, we will have to try both of them and then use the one with higher performance for our actual model training.
However, there are some advantages to using GRUs in your network.
As we will see in this article, GRUs have two gates, making it faster to train. This is useful when you have less memory and processing power.
Also, they give great results on small datasets.
Now let’s understand the working of GRUs.
Note the meaning of the symbols in the below diagrams:
Meaning of variables in the below equations
Wxz, Wxr, Wxg are the weight matrices of each of the three gates for their connection to the input vector x(t).
Whz, Whr, Whg are the weight matrices of each of the three gates for their connection to the previous state h(t-1).
bz, br, bg are the bias terms for each of the three gates.
Basic info about the GRUs
On a high level, GRUs can be considered the better version of simple RNN units with the same number of inputs and outputs. However, the internal structure of GRUs is slightly different.
Unlike LSTMs, GRUs have only one state vector. One might say that long-term and short-term vectors are combined together into one in the case of GRUs.
The GRU is made of three blocks. They are
Basic RNN block, g(t)
A gate that performs the work of forgetting useless information and remembering the new important information, z(t)
A gate that determines what portion of the previous state should be used as input, r(t)
The GRU takes two inputs. They are
Previous cell’s state, h(t-1)
Training data input, x(t)
The GRU cell outputs two terms. They are
Current cell’s state, h(t)
Prediction for the current cell, y(t)
Now, let’s understand how each of the outputs is calculated using the inputs.
Calculating the current cell’s state
Basically, the cell’s state is found by the addition of two quantities. First quantity tells us how much should be forgotten from the previous state. And the second quantity tells us how much should be remembered from the training data input.
Let’s understand the first quantity.
As you can see in the above diagram, z(t) takes the previous state and the training data as input. It kind of acts as a forget gate here. The value of z(t) gate determines what portion of the previous state should be forgotten in this time step. The previous state and the training input are multiplied with their corresponding weights and the bias is added to their sum. After applying a sigmoid function over this sum, we will get the value of z(t). The element-wise multiplication of the previous state and the value of z(t) gives us the first quantity that we require to calculate the current time step state. The first quantity is shown in green color in the above diagram.
Let’s understand the second quantity.
The value of z(t) is calculated as explained in the first quantity calculation. But here (1 — z(t)) acts as an input gate. The basic RNN cell g(t) takes two inputs. They are
Training data input, x(t)
Element-wise multiplication of r(t) and the previous state, h(t-1)
The calculation to find the r(t) is the same as that of z(t) except for the weight and biases used in the calculation. The value of r(t) tells us what portion of the previous state should be given as input to the simple RNN cell g(t).
Both the inputs of g(t) are multiplied by their corresponding weights and then the bias term is added to their sum. Then this final sum value is passed to the hyperbolic tangent function giving us the value of g(t).
Now the element-wise multiplication of g(t) with (1 — z(t)) aka input gate will give us the second quantity required to calculate the current time step state. The second quantity is shown in green color in the above diagram.
Now that we have found out both quantities, their addition will give us the current time step state.
In the GRU cells, the value of the current state is equal to the prediction. Thus, since we have already calculated the current state, we have also calculated the prediction for the current time step.
Until now we have seen the different parts that make up the whole GRU to understand it better. So, let’s combine all the parts that we have looked at until now.
This is the whole GRU network diagram.
I hope you like the article. The diagrams in the article are drawn by me by hand. I hope they are intuitive enough (and not too messy) to understand the GRU clearly. If you have any thoughts on the article then please let me know.
Have a great day!
Comentários