Week 4: Fun Architectures

    7 minute read    

Week 4

I’ll be sharing some wonderful neural network architectures for working with sequences of text! Here they are:

  • Vanilla Recurrent Neural Net (RNN)
  • Gated Recurrent United (GRU)
  • Long Term Short Memory (LSTM)

Note that GRU’s and LSTM’s are also RNN’s. An RNN is a neural network architecture where the connections between nodes form a time-dependent sequence. Moreover, an RNN utilizes an internal memory for processing sequences in order to recognize patterns in the data. GRU’s and LSTM’s are RNN’s with more complicated architectures encoding inductive biases, allowing them capture more sophisticated patterns in data, unlike traditional feed-forward networks.

Vanilla RNN

Vanilla RNN

ht=σ(Whht1+Wxxt)

ˆy=softmax(Wsht)

Suppose we wish to to give an RNN some text, in this case the lyrics to touching song, so that it can predict the next sequence of words, in this case so the RNN can keep giving us heartfelt lyrics. Using a vanilla RNN we would start off with a sequence of words represented as a sequence of vectors. Let’s say we have the sentence: “I love you for so many reasons, which means I love you for all seasons” (The Fuzz - I Love You For All Seasons). We would represent this as the sequence X=x1,x2,...xn, where x1 is a word vector representing “I”, x2 is a word vector representing “love”, xt is the word at the t‘th time step.

We would give the first unit of the the RNN the word vector x1, and some arbitrary initial past hidden state h0 (usually just the zero vector). It would create h1, a hidden representation the our current word sequence “I”. We could then use that hidden representation to predict the next word in our sequence, ^y1. We would pass h1 to the second unit, and it would use the next word in the sequence x2(“love”), to produce h2, which is the hidden representation of our current word sequence “I love”. We could then use h2 to predict the next word in our sequence ^y2. At any given time step t, our RNN will take the representation of our past sequence ht1, the current word xt, and produce a hidden representation of our current sequence ht, and can give us a prediction for the next word in our sequence ^yt.

Thus, by feeding our RNN some lyrics to a song, a sequence of words, it could continuously produce another predicted word, and take that as an input to itself to produce another word. With this continuous generation, we could end up with new sentences or paragraphs, or essays or maybe even a book if we let the RNN run long enough!

Since we process data in a sequential manner, representing our previous sequence as a hidden state vector ht1, the inductive bias encoded in this neural network architecture is that we can use what we have seen in the past determine what we might see in the future. This is further encoded in our matrices Wh and Wx. Since we multiply Wh with our past hidden state ht1, Wh encodes what features of the past are important to determine the future. Since we multiply Wx with our current word xt, Wx encodes what features of the present are important to determine the future.

This model has the right inductive biases for dealing with sequences and is thus successful, but it has very few parameters and thus has less flexibility than a model with more parameters. Moreover, Vanilla RNN’s have something called the vanishing/exploding gradient problem.

As time progresses the gradient of our loss functions tends to either zero, which means our model does not update during gradient descent and thus no learning occurs, or the gradient of our loss function explodes to such large numbers that we cannot compute, which also prevents us from updating our model. Let me give you some insight into why this occurs. If we have a loss function E, then we need compute the following gradient to update our model for time T:

EW=Tt=1EtW

This total loss at time step T requires us to sum our loss at each previous time step t. Now let’s expand the loss at each time step t: EtW=tk=1EtytythththkhkW

Let’s expand another crucial term: hthk=tj=k+1hjhj1

We have an upper-bound for hjhj1 given by the bounds of two matrix norms: ||hjhj1||||WT||||diag[f(hj1)]||βWβh

Thus we can say ||hthk||=||tj=k+1hjhj1||(βWβh)tk

If βWβh is less than one, (βWβh)tk and thus hthk will tend towards zero, and if (βWβh) is greater than one, than (βWβh)tk and thus hthk will tend towards and an incredibly large number that we can’t compute, and thus our total gradient: EW=Tt=1tk=1Etytytht(tj=k+1hjhj1)hkW

will tend towards either zero or an uncomputable number.

Alas, fantastic folks have devised other great models that mitigate the vanishing gradient problem, while incorporating more inductive biases and flexibility into those models!

Gated Recurrent Unit (GRU)

GRU

rt=σ(Wrxt+Urht1)

~ht=tanh(rtUht1+Wxt)

zt=σ(Wzxt+Uzht1)

ht=(1zt)~ht+ztht1

With GRU’s, we processes sequences of text in the same sequential manner, but the model has some extra parameters to it that help alleviate our vanishing gradient problem, all while encoding inductive biases we shall discuss!

Reset Gate:

The reset gate utilizes the current word in our sequence, with the hidden state representing our past sequence to determine how much to consider the past sequence in forming a new memory of our sequence.

New Memory:

New memory is formed by taking our current word in the sequence and determines its importance in the new memory with the W matrix. It then utilizes the reset gate and the U matrix to determine how important the past sequence is in forming a new memory. By explicitly utilizing including a reset gate and the the U matrix, we have a more expressive model that is better able to utilize the important components of the past sequence while discarding irrelevant components.

Update Gate:

The update gate is utilized to determine how much our newly formed memory and our past sequence should contribute to representing our new sequence.

Hidden State:

Our new hidden state for our current sequence is computed by taking a weighted sum of the the newly formed memory, and our past sequence’s hidden state. These weights are determined by the update gate. Since the update gate is computed with a sigmoid function, our values are between zero and one. Thus zt and 1zt are between zero and one, and add up to one. Given these weights, when features of the previous hidden state are given importance in determining the new hidden state, those features in the newly formed memory are given less importance for determining the new hidden state.

Long Term Short Memory (LSTM)

lstm

it=σ(Wixt+Uiht1)

ft=σ(Wfxt+Ufht1)

ot=σ(Woxt+Uoht1)

~ct=tanh(Wcxt+Ucht1)

ct=ftct1+it~ct

ht=ottanh(ct)

Input Gate:

Uses the input word and past hidden state to determine how the new word in our sequence should influence our final new memory formation.

Forget Gate:

Determines how much the past memory should influence our new final memory formation. Can give us to completely forget what we learned before, or to continue to propagate our past memory!

New Memory:

We form a new memory utilizing our past hidden state representing our past sequence, and the new word we encountered in our sequence.

Final Memory:

For our final memory formation, we use the input gate to determine what’s important from our new memory, and the forget gate to determine what’s important from our past memory to form a final memory!

Output Gate:

Our finalized memory is rich in information, so the the output gate determines what aspects of this new memory are important for representing a new hidden state for our current sequence.

Hidden State:

The new hidden state new representation of our current sequence!

Updated: