Jekyll2019-05-21T04:53:15+00:00https://edy-barraza.github.io/feed.xmlHomeEdgar BarrazaEdgar Barrazaeab326@cornell.eduFinal Project: Transformer Knowledge Distillation2019-05-07T00:00:00+00:002019-05-07T00:00:00+00:00https://edy-barraza.github.io/week12<p>After learning about the Transformer and it’s incredible capacity to work with language, I was I excited to integrate it
into a tool for people to use. I had thought about using it for question answering, text summarization, or natural language generation,
so I decided to play with BERT on my laptop. My computer was huffing and puffing, and running out of RAM. Given all this trouble
a laptop was having, I couldn’t have imagine what would happen if I tried to run BERT on a phone. This gave me inspiration to consider training
smaller transformer language models with similar levels of performance to the big models.</p>
<ul class="table-of-content" id="markdown-toc">
<li><a href="#introduction-transformer-success--limitations" id="markdown-toc-introduction-transformer-success--limitations">Introduction: Transformer Success & Limitations</a></li>
<li><a href="#background--knowledge-distillation" id="markdown-toc-background--knowledge-distillation">Background: Knowledge Distillation</a></li>
<li><a href="#interpretation-more-training" id="markdown-toc-interpretation-more-training">Interpretation: More Training?</a></li>
<li><a href="#approach-truncated-softmax" id="markdown-toc-approach-truncated-softmax">Approach: Truncated Softmax</a></li>
</ul>
<h2 id="introduction-transformer-success--limitations">Introduction: Transformer Success & Limitations</h2>
<p><img src="/assets/images/transformer_gang_white.png" alt="Recent Transformers" /></p>
<p>Recent transformer models have reached unparalleled success in Natural Language Processing. After GPT, BERT, and GPT-2 are released,
they keep pushing the state of the art in question answering, translation, reading comprehension, text summarization, and more!
As they become more powerful, they also become larger and larger, causing them to take up more RAM and have longer wait time for inference.
To use a transformer back a resource for someone seeking resources online in the United States, it is imperative that this resource be available on mobile as most
people have access to the internet on their smartphones but not everyone has a laptop. With the newest development of transformers
Natural Langauge Processing GPT-2 reaching 1.5 billion parameters, it is necessary to reduce the size of these transformers to ensure
mobile support. I will describe the transformer model to present ways we can reduce it’s size!</p>
<p><img src="/assets/images/transformer_block.png" alt="Transformer Block" /></p>
<p>We shall look at the transformer at the highest level and then zoom into its components. The transformer is composed of stacking
multiple transformer blocks! Words are first represented as vectors (rank 1 tensor), and thus a sequence of words is represented
as a list of tensors (rank 2 tensor). This data flowing through a transformer block encounters a multi-headed attention unit, and residually connects
to a Layer Normalization Unit. The output of this layer normalization connects directly to a Feed-Forward Network, and residually connects to
the output of the Feed-Forward Network to another Layer Normalization Unit.</p>
<p>Layer normalization takes inputs, and each input is normalized along the feature dimension. The Feed-Forward network takes an input and combines
each of it’s features for a new representation. The residual connections help propagate features that could be altered or lost during the
The expressive power of the Transformer comes from its use of attention, so we will pay special attention to the mutli-headed attention unit.</p>
<p><img src="/assets/images/multiheaded_attention.png" alt="Multi-Headed Attention Unit" /></p>
<p>The multi-headed attention unit first takes the vector representation of words, and then projects these vectors to many lower-dimensional sub-spaces.
Each of these lower dimensional subspaces can be interpreted as a vector space that emphasizes on certain aspects of human language for a word. So if our multi-headed
attention unit has 8 heads, it’s possible that one of these heads projects our word vectors into a space that focuses on the sad aspect of words.</p>
<p>For each of these subspaces, the Transformer computes attention on the projected vectors, and concatenates the output of these attention
computations so we are in the same dimensional vector space as we started off in. The attention computation itself gives us the inductive
bias that makes transformers so much more successful than Recurrent Neural Networks. This inductive bias is that each word in a sequence pays
a certain amount of attention to the other words in the sequence, paying more attention to the other words that are relevant while paying
little to no attention to the irrelevant words. This is structured into the scaled dot-product attention. Scaled dot-product attention is computed as such:</p>
<p>The sequences of word vectors (rank 2 tensors) that were projected to lower dimensional subspaces are dotted together, they are scaled by the
dimension of the vector, and then the softmax across the sequence is computed. This softmax vector is dotted with projected word vectors. This assigns a weight to each word in the sequence, the amount
of attention to pay to that word.</p>
<p><img src="/assets/images/attention_concat.png" alt="Attention Concat" />
<img src="/assets/images/attention_compute.png" alt="Attention Compute" /></p>
<p>Thus, in order to have smaller transformer model, we have the option of reducing the number of transformer blocks, we can reduce the
original dimension of the word vectors, we can reduce the projected word vector dimension, or we can reduce the number of heads
in the multi-headed attention unit.</p>
<h2 id="background--knowledge-distillation">Background: Knowledge Distillation</h2>
<p><img src="/assets/images/knowledge_distillation.png" alt="Knowledge Distillation" /></p>
<p>You can think of knowledge distillation as having a larger, well trained teacher neural network that teaches a smaller neural network
how to perform like itself. You feed both neural networks the same data, and in the loss function you reward the student for producing
an output similar to the teacher, and update the parameters of the student network via backpropagation when it’s output is not similar to the teacher’s.
The idea is that if we can get the smaller student network to produce similar outputs as the larger teacher network, it is essentially
performing just as well.</p>
<p>To be more specific, in this case when training the transformer as a language model, we start off with a sequence of words and we
mask some of those words and tell the transformer to try to predict the right word.</p>
<p><img src="/assets/images/word_masks.png" alt="Word Masking" /></p>
<p>After predicting a word, normally the loss function is the cross-entropy loss. You would dot a one-hot encoded vector <script type="math/tex">\mathbf{1}\{ y_j=k\}</script> with the log output of the
neural network <script type="math/tex">\log p(y_j=k | \mathbf{x};\theta)</script>, which is a log probability vector assigning each word in the vocabulary a probability.</p>
<p>However, for knowledge distillation you dot the output of the teacher network <script type="math/tex">q(y_j=k | \mathbf{x};\theta)</script> with the log output of the
student neural network <script type="math/tex">\log p(y_j=k | \mathbf{x};\theta)</script>. This means that you want the student to predict not just the right masked word,
and you also want the student network to consider the other words the teacher was considering. We would hope that teacher network has developed a
rich understanding of human language, and would thus not just consider the correct word with high probability but also reasonable synonyms, and thus
we want to distill this knowledge to the student network.
<img src="/assets/images/lm_distill_loss.png" alt="Loss Functions" /></p>
<h2 id="interpretation-more-training">Interpretation: More Training?</h2>
<p>Knowledge distillation also has an intuitive interpretation. When considering a particular model, we can say it has a certain capacity to represent functions in solution space.
Bigger models with more parameters are more flexible and have a higher capacity to learn more, and can thus represent more functions in solution space. Thus, transformers with more blocks,
higher vector dimensions or higher projection dimensions can represent more functions in solution space and can thus converge to better solutions in this
expanded solution space.</p>
<p><img src="/assets/images/og_solution_space.png" alt="Original Convergence" /></p>
<p>When considering a smaller network, we know that it can represent fewer functions in solution space, and if we train it the same way as
a larger network we know it converges to a different solution than the larger network since it’s performance is different. However, we know there is
an overlap between the solution space of the larger and smaller networks. If we hope to have successful knowledge distillation,
we imagine that there is a region in solution space that the larger neural network tends to converge to that has some overlap with the
solution space of the smaller network. Thus, by training the student network with the outputs of the teacher network, we hope to drive the
smaller student network to converge to a solution in the region where the teacher could converge.</p>
<p><img src="/assets/images/converged_solution_space.png" alt="Original Convergence" /></p>
<h2 id="approach-truncated-softmax">Approach: Truncated Softmax</h2>
<p>When performing knowledge distillation for a language model, we have to deal with memory constraints as GPU’s, TPU’s, or even CPU’s only
have so much RAM. If we save the outputs of the teacher network as data, and load it into RAM, for each word we are loading a vector of
the same dimension as your vocabulary. In this case, for each word we had 30,522 dimensional vector.</p>
<p><img src="/assets/images/sparse_vectors.png" alt="Sparse Vectors" /></p>
<p>This large vector made it impossible to have adequate batch-sizes and sequence lengths, parameters crucial training a neural network.
Moreover, each of these probability vectors was actually really low in information. In a 30,522 word vocabulary, when predicting a masked word
most of those words are going to have essentially zero probability. There are only so many reasonable words or synonyms that a well
trained teacher network would assign any considerable probability to. Thus it was best to save the top-k probable words and their probabilities
from the teacher network instead of the whole probability network.</p>
<p><img src="/assets/images/top_k_vector.png" alt="Top-K Vector" /></p>Edgar Barrazaeab326@cornell.eduAfter learning about the Transformer and it’s incredible capacity to work with language, I was I excited to integrate it into a tool for people to use. I had thought about using it for question answering, text summarization, or natural language generation, so I decided to play with BERT on my laptop. My computer was huffing and puffing, and running out of RAM. Given all this trouble a laptop was having, I couldn’t have imagine what would happen if I tried to run BERT on a phone. This gave me inspiration to consider training smaller transformer language models with similar levels of performance to the big models.Week 5: Fun Implementations2019-03-08T00:00:00+00:002019-03-08T00:00:00+00:00https://edy-barraza.github.io/week5<h2 id="week-5">Week 5</h2>
<p>After reading about wonderful neural net architectures for working with sequences of texts, I decided to try some implementations,
complementing my theoretical understanding with a practical understanding. I decide to give TensorFlow a try
since I had already had experience with the lovely PyTorch. Let me start you off with my RNN:</p>
<h2> RNN </h2>
<p>You can find my colab notebook for this implementation <a href="https://colab.research.google.com/drive/1cYqJqlm54Hu3KiHB94WAnfwOrwyVKVma">here!</a>.
Out of curiosity, I wanted to see the TensorBoard visualization, so let me share 2-unit wide visualization with you!</p>
<p><img src="/assets/images/rnn_graph.png" alt="Vanilla RNN Graph" /></p>
<h2> LSTM </h2>
<p><a href="https://colab.research.google.com/drive/1mL3arSzK8yU74hy81lKivHeDvya6z0fh">Here</a> is my google colab notebook for this implementaiton.
I also visualized a 2-unit wide version of my implementation out of curiosity. Take a look!</p>
<p><img src="/assets/images/LSTM_class_maingraph.png" alt="LSTM Graph1" /></p>
<p><img src="/assets/images/LSTM_class_auxilary_graph.png" alt="LSTM Graph2" /></p>
<h2> LSTM TensorFlow Implimentaiton </h2>
<p>For a sanity check, I utilized TensorFlow’s implementation on the same task as my own implementation( Penn Tree Bank).
Thankfully, I had identical loss functions :)You can see the colab notebook <a href="https://colab.research.google.com/drive/1dZFOcHB2TqBcy2mvdkPERffo-V290MMa">here</a>.
Here’s the visualization:</p>
<p><img src="/assets/images/LSTM_premade_graph.png" alt="LSTM Premade Graph" /></p>Edgar Barrazaeab326@cornell.eduWeek 5 After reading about wonderful neural net architectures for working with sequences of texts, I decided to try some implementations, complementing my theoretical understanding with a practical understanding. I decide to give TensorFlow a try since I had already had experience with the lovely PyTorch. Let me start you off with my RNN:Week 4: Fun Architectures2019-02-28T00:00:00+00:002019-02-28T00:00:00+00:00https://edy-barraza.github.io/week4<h2 id="week-4">Week 4</h2>
<p>I’ll be sharing some wonderful neural network architectures for working with sequences of text! Here they are:</p>
<ul>
<li> Vanilla Recurrent Neural Net (RNN) </li>
<li> Gated Recurrent United (GRU) </li>
<li> Long Term Short Memory (LSTM)</li>
</ul>
<p>Note that GRU’s and LSTM’s are also RNN’s. An RNN is a neural network architecture where the connections between
nodes form a time-dependent sequence. Moreover, an RNN utilizes an internal memory for processing sequences in order
to recognize patterns in the data. GRU’s and LSTM’s are RNN’s with more complicated architectures encoding inductive
biases, allowing them capture more sophisticated patterns in data, unlike traditional feed-forward networks.</p>
<h2> Vanilla RNN </h2>
<p><img src="/assets/images/vanilla_rnn.png" alt="Vanilla RNN" /></p>
<p>\begin{equation} \tag{Hidden State}
h_t = \sigma(W_h h_{t-1} + W_x x_t)
\end{equation}</p>
<p>\begin{equation} \tag{Prediction}
\hat{y} = softmax(W_s h_t)
\end{equation}</p>
<p>Suppose we wish to to give an RNN some text, in this case the lyrics to touching song, so that it can predict the next sequence
of words, in this case so the RNN can keep giving us heartfelt lyrics. Using a vanilla RNN we would start off with a sequence of words
represented as a sequence of vectors. Let’s say we have the sentence: “I love you for so many reasons, which means I love you
for all seasons” (The Fuzz - I Love You For All Seasons). We would represent this as the sequence <script type="math/tex">X = {x_1,x_2,...x_n}</script>,
where <script type="math/tex">x_1</script> is a word vector representing “I”, <script type="math/tex">x_2</script> is a word vector representing “love”, <script type="math/tex">x_t</script> is the word at the <script type="math/tex">t</script>‘th time step.</p>
<p>We would give the first unit of the the RNN the word vector <script type="math/tex">x_1</script>, and some arbitrary initial past hidden state <script type="math/tex">h_0</script> (usually just the zero vector).
It would create <script type="math/tex">h_1</script>, a hidden representation the our current word sequence “I”. We could then use that hidden representation
to predict the next word in our sequence, <script type="math/tex">\hat{y_1}</script>. We would pass <script type="math/tex">h_1</script> to the second unit, and it would
use the next word in the sequence <script type="math/tex">x_2</script>(“love”), to produce <script type="math/tex">h_2</script>, which is the hidden representation of our current word
sequence “I love”. We could then use <script type="math/tex">h_2</script> to predict the next word in our sequence <script type="math/tex">\hat{y_2}</script>. At any given time step
<script type="math/tex">t</script>, our RNN will take the representation of our past sequence <script type="math/tex">h_{t-1}</script>, the current word <script type="math/tex">x_t</script>, and produce a hidden representation of our current
sequence <script type="math/tex">h_t</script>, and can give us a prediction for the next word in our sequence <script type="math/tex">\hat{y_t}</script>.</p>
<p>Thus, by feeding our RNN some lyrics to a song, a sequence of words, it could continuously produce another predicted word, and take that
as an input to itself to produce another word. With this continuous generation, we could end up with new sentences or paragraphs,
or essays or maybe even a book if we let the RNN run long enough!</p>
<p>Since we process data in a sequential manner, representing our previous sequence as a hidden state vector <script type="math/tex">h_{t-1}</script>,
the inductive bias encoded in this neural network architecture is that we can use what we have seen in the past determine
what we might see in the future. This is further encoded in our matrices <script type="math/tex">W_h</script> and <script type="math/tex">W_x</script>. Since we multiply <script type="math/tex">W_h</script> with our past
hidden state <script type="math/tex">h_{t-1}</script>, <script type="math/tex">W_h</script> encodes what features of the past are important to determine the future. Since we multiply <script type="math/tex">W_x</script> with our
current word <script type="math/tex">x_t</script>, <script type="math/tex">W_x</script> encodes what features of the present are important to determine the future.</p>
<p>This model has the right inductive biases for dealing with sequences and is thus successful, but it has very few parameters and thus has
less flexibility than a model with more parameters. Moreover, Vanilla RNN’s have something called the vanishing/exploding gradient problem.</p>
<p>As time progresses the gradient of our loss functions tends to either zero, which means our model does not update during gradient descent and thus
no learning occurs, or the gradient of our loss function explodes to such large numbers that we cannot compute, which also prevents
us from updating our model. Let me give you some insight into why this occurs. If we have a loss function <script type="math/tex">E</script>, then we need compute the
following gradient to update our model for time <script type="math/tex">T</script>:</p>
<p>\begin{equation}
\frac{\partial E}{\partial W} = \sum_{t=1 }^{T }\frac{\partial E_t}{\partial W}
\end{equation}</p>
<p>This total loss at time step <script type="math/tex">T</script> requires us to sum our loss at each previous time step <script type="math/tex">t</script>. Now let’s expand the loss at each time step <script type="math/tex">t</script>:
\begin{equation}
\frac{\partial E_t}{\partial W} = \sum_{k=1 }^{t }\frac{\partial E_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_t} {\partial h_k} \frac{\partial h_k} {\partial W}
\end{equation}</p>
<p>Let’s expand another crucial term:
\begin{equation}
\frac{\partial h_t} {\partial h_k} = \prod_{j=k+1}^{t} \frac{\partial h_j} {\partial h_{j-1}}
\end{equation}</p>
<p>We have an upper-bound for <script type="math/tex">\frac{\partial h_j} {\partial h_{j-1}}</script> given by the bounds of two matrix norms:
\begin{equation}
|| \frac{\partial h_j} {\partial h_{j-1}} || \leq ||W^T|| ||diag[f\prime(h_{j-1}) ] || \leq \beta_W \beta_h
\end{equation}</p>
<p>Thus we can say
\begin{equation}
|| \frac{\partial h_t} {\partial h_k} || = || \prod_{j=k+1}^{t} \frac{\partial h_j} {\partial h_{j-1}} || \leq (\beta_W \beta_h)^{t-k}
\end{equation}</p>
<p>If <script type="math/tex">\beta_W \beta_h</script> is less than one, <script type="math/tex">(\beta_W \beta_h)^{t-k}</script> and thus <script type="math/tex">\frac{\partial h_t} {\partial h_k}</script> will tend towards zero,
and if <script type="math/tex">(\beta_W \beta_h)</script> is greater than one, than <script type="math/tex">(\beta_W \beta_h)^{t-k}</script> and thus <script type="math/tex">\frac{\partial h_t} {\partial h_k}</script>
will tend towards and an incredibly large number that we can’t compute, and thus our total gradient:
\begin{equation}
\frac{\partial E}{\partial W} = \sum_{t=1 }^{T }\sum_{k=1 }^{t } \frac{\partial E_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} (\prod_{j=k+1}^{t} \frac{\partial h_j} {\partial h_{j-1}}) \frac{\partial h_k} {\partial W}
\end{equation}</p>
<p>will tend towards either zero or an uncomputable number.</p>
<p>Alas, fantastic folks have devised other great models that mitigate the vanishing gradient problem, while incorporating more
inductive biases and flexibility into those models!</p>
<h2> Gated Recurrent Unit (GRU) </h2>
<p><img src="/assets/images/gru.png" alt="GRU" /></p>
<p>\begin{equation} \tag{Reset Gate}
r_t = \sigma(W_r x_t + U_r h_{t-1})
\end{equation}</p>
<p>\begin{equation} \tag{New Memory }
\tilde{h_t} = \tanh(r_t \odot U h_{t-1} + W x_t)
\end{equation}</p>
<p>\begin{equation} \tag{Update Gate}
z_t = \sigma(W_z x_t + U_z h_{t-1})
\end{equation}</p>
<p>\begin{equation} \tag{Hidden State}
h_t = (1-z_t)\odot \tilde{h_t} + z_t \odot h_{t-1}
\end{equation}</p>
<p>With GRU’s, we processes sequences of text in the same sequential manner, but the model has some extra parameters to
it that help alleviate our vanishing gradient problem, all while encoding inductive biases we shall discuss!</p>
<h4>Reset Gate:</h4>
<p>The reset gate utilizes the current word in our sequence, with the hidden state representing our past sequence to determine
how much to consider the past sequence in forming a new memory of our sequence.</p>
<h4>New Memory:</h4>
<p>New memory is formed by taking our current word in the sequence and determines its importance in the new memory with the <script type="math/tex">W</script> matrix.
It then utilizes the reset gate and the <script type="math/tex">U</script> matrix to determine how important the past sequence is in forming a new memory.
By explicitly utilizing including a reset gate and the the <script type="math/tex">U</script> matrix, we have a more expressive model that is better able to
utilize the important components of the past sequence while discarding irrelevant components.</p>
<h4>Update Gate:</h4>
<p>The update gate is utilized to determine how much our newly formed memory and our past sequence should contribute to
representing our new sequence.</p>
<h4>Hidden State:</h4>
<p>Our new hidden state for our current sequence is computed by taking a weighted sum of the the newly formed memory, and our past
sequence’s hidden state. These weights are determined by the update gate. Since the update gate is computed with a sigmoid function,
our values are between zero and one. Thus <script type="math/tex">z_t</script> and <script type="math/tex">1-z_t</script> are between zero and one, and add up to one. Given these weights,
when features of the previous hidden state are given importance in determining the new hidden state, those features in the
newly formed memory are given less importance for determining the new hidden state.</p>
<h2> Long Term Short Memory (LSTM) </h2>
<p><img src="/assets/images/lstm.png" alt="lstm" /></p>
<p>\begin{equation} \tag{ Input Gate}
i_t = \sigma(W_i x_t + U_i h_{t-1})
\end{equation}</p>
<p>\begin{equation} \tag{Forget Gate }
f_t = \sigma(W_f x_t + U_f h_{t-1})
\end{equation}</p>
<p>\begin{equation} \tag{Output Gate }
o_t = \sigma(W_o x_t + U_o h_{t-1})
\end{equation}</p>
<p>\begin{equation} \tag{New Memory }
\tilde{c_t} = \tanh(W_c x_t + U_c h_{t-1})
\end{equation}</p>
<p>\begin{equation} \tag{Final Memory}
c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c_t}
\end{equation}</p>
<p>\begin{equation} \tag{Hidden State}
h_t = o_t \odot \tanh(c_t)
\end{equation}</p>
<h4>Input Gate: </h4>
<p>Uses the input word and past hidden state to determine how the new word in our sequence should influence our final new memory formation.</p>
<h4>Forget Gate: </h4>
<p>Determines how much the past memory should influence our new final memory formation. Can give us to completely forget what
we learned before, or to continue to propagate our past memory!</p>
<h4>New Memory: </h4>
<p>We form a new memory utilizing our past hidden state representing our past sequence, and the new word we encountered in our sequence.</p>
<h4>Final Memory: </h4>
<p>For our final memory formation, we use the input gate to determine what’s important from our new memory, and the forget gate
to determine what’s important from our past memory to form a final memory!</p>
<h4>Output Gate: </h4>
<p>Our finalized memory is rich in information, so the the output gate determines what aspects of this new memory are
important for representing a new hidden state for our current sequence.</p>
<h4>Hidden State: </h4>
<p>The new hidden state new representation of our current sequence!</p>Edgar Barrazaeab326@cornell.eduWeek 4 I’ll be sharing some wonderful neural network architectures for working with sequences of text! Here they are: Vanilla Recurrent Neural Net (RNN) Gated Recurrent United (GRU) Long Term Short Memory (LSTM) Note that GRU’s and LSTM’s are also RNN’s. An RNN is a neural network architecture where the connections between nodes form a time-dependent sequence. Moreover, an RNN utilizes an internal memory for processing sequences in order to recognize patterns in the data. GRU’s and LSTM’s are RNN’s with more complicated architectures encoding inductive biases, allowing them capture more sophisticated patterns in data, unlike traditional feed-forward networks. Vanilla RNNWeek 3: Refined Focus2019-02-22T00:00:00+00:002019-02-22T00:00:00+00:00https://edy-barraza.github.io/week3<h2 id="week-3">Week 3</h2>
<p>I have enjoyed an wonderful breadth of knowledge the past two weeks, and now it’s time to refine my plan and focus!
To accelerate my progression, I will be focusing on Natural Language Processing for at least the next week. Taking in a
breadth of knowledge provides me with a firm foundation, and having depth of knowledge will allow me to produce
a significant project. As I acquire this depth of knowledge, I will be able to add to my studies courses that will hopefully
help me develop a project that explores the boundary of current NLP approaches! I hope to share some exciting new with you all
soon!</p>Edgar Barrazaeab326@cornell.eduWeek 3 I have enjoyed an wonderful breadth of knowledge the past two weeks, and now it’s time to refine my plan and focus! To accelerate my progression, I will be focusing on Natural Language Processing for at least the next week. Taking in a breadth of knowledge provides me with a firm foundation, and having depth of knowledge will allow me to produce a significant project. As I acquire this depth of knowledge, I will be able to add to my studies courses that will hopefully help me develop a project that explores the boundary of current NLP approaches! I hope to share some exciting new with you all soon!Week 2: Getting A Groove2019-02-15T00:00:00+00:002019-02-15T00:00:00+00:00https://edy-barraza.github.io/week2<h2 id="week-2">Week 2</h2>
<p>After kicking this week off, it took some time but I feel that I’ve got a groove I can get with to get things done!
I was exposed to a lot of new knowledge this week, so let me share with you my friends, some of the highlights of what I enjoyed learning!
I will organize this for you with the following categories:</p>
<ul>
<li> Natural Language Processing: Word Embeddings </li>
<li> Reinforcement Learning: Multi-Armbed Bandit Problems </li>
<li> Unsupervised Learning: AutoEncoders, K-Means Clustering </li>
</ul>
<p>Let’s get started!</p>
<h3> Natural Language Processing: Word Embeddings</h3>
<p>My focus for NLP this week was learn vector representations of words. The initial approach was creating count based
matrices. The Word-Document matrix was created by looping through all available documents to create a matrix <script type="math/tex">X</script>,
where element <script type="math/tex">X_{i,j}</script> tells us if word <script type="math/tex">i</script> appears in document <script type="math/tex">j</script>. This approach is not ideal because of how
sparse the matrix is, even if it makes use of global statistics of the corpus.</p>
<p>The next attempt is is a Word-Based Co-Occurence Matrix. This matrix <script type="math/tex">X_{i,j}</script> is created by looping through the corpus
and counting how many times each word <script type="math/tex">j</script> appears in the context of word <script type="math/tex">i</script>. This is another sparse matrix, and
is thus still not ideal. This idea puts us in the line of reasoning of using the context of words. We are interested
in doing so because of the distributional hypothesis.</p>
<p>The distributional hypothesis is the idea that the a word’s meaning is given by the words that frequently appear close
by to it. This gave rise to Word2Vec</p>
<p>Word2Vec is composed of two approaches to creating word embeddings, followed by two training methods. Word embeddings can
be generated using the Continuous Bag Of Words (CBOW) model, which aims to the predict a center word given some surrounding
context, or the Skip-Gram Model, which aims to predict the distribution of context word given a center word. These embeddings can
be trained using Negative Sampling, which defines an objective by sampling negative samples negative examples from a
false corpus. These embeddings can also be trained using Hierarchical Soft-Max, which defines an objective using a
tree structure to compute probabilities for all words in our vocabulary.</p>
<p>These embeddings had strong performance, but folks wanted to make an improvement by taking into account the distributional
hypothesis AND utilizing global corpus statistics. This gave rise to GloVe: Global Vectors for Word Representations.
GloVe considers global corpus statistics and the context of words with a co-occurence matrix and a least squares objective.</p>
<h3> Reinforcement Learning: Multi-Armbed Bandit Problems</h3>
<p>We can define elements of reinforcement learning as</p>
<ul>
<li> Policy:
<ul>
<li>A policy is a mapping from perceived states of the environemnt to the actions meant to be taken in those states </li>
</ul>
</li>
<li> Reward Signal
<ul>
<li> During each time step, the environment sends the agent some # whih is the reward signal</li>
<li> The agent seeks to maximize the reward it receives in the long run</li>
</ul>
</li>
<li> Value Function
<ul>
<li>While the reward specifies what's good in the intermediate sense, the value function specifies what's good in the long run </li>
<li>Value = amount of reqard the agent can expect to accumulate over the future starting from it's current state</li>
<li>Allows us to delay gratification and go to a low reward state if we believe we will get more reward in the long run</li>
</ul>
</li>
<li> Optional: Model Of The Environment
<ul>
<li> Something that mimics the environment</li>
<li>Allows us to make inferences about how the environment might behave</li>
<li>Model might help predict the next state or reward</li>
<li>Model-Based Methods: Uses models and planning</li>
<li>Explicitly trail and error learners </li>
</ul>
</li>
</ul>
<p>I started my reinforcement learning studies by learning about multi-armed bandits, which is a particular problem
where we only have one state, and have <script type="math/tex">k</script> possible actions that have some expected reward. The action selected at
some time <script type="math/tex">t</script> is <script type="math/tex">A_t</script> and returns some reward <script type="math/tex">R_t</script>.
The value of action <script type="math/tex">a</script> is <script type="math/tex">q_{*}(a) = \mathbb{E}[R_t | A_t=a]</script>.</p>
<p>We don’t really know <script type="math/tex">q_{*}(a)</script>, but by the law of large numbers, with enough samples we can get a pretty good
estimate <script type="math/tex">Q_t(a)</script>. Our principle problem in this setting is balancing exploiting methods we estimate to be the best strategy
with exploring methods we wish to try out and see if they are successful.</p>
<h3> Unsupervised Learning: AutoEncoders, K-Means Clustering </h3>
<p>Autoencoders are nueral networks that seek to seek to learn the funciton <script type="math/tex">h_{w,b}(x) \approx x</script>. This sounds trivial
as an identity matrix, but it’s not since the number of hidden units in the network is less than the output of the network.
Thus, we have essentially assigned the network a compression task, where it has to find correlations between the inputs.
We can also have a larger number of hidden units than input and output units, but enforce a sparsity constraint on the hidden
units. This means we want most of the hidden units to be inactive (output close to zero) most of the time. This
will allow the network to learn correlations between the input of the data.</p>
<p>K-Means Clustering means to cluster points around some centroid points. The number of centroids you define is arbitray.
Here’s the procedure</p>
<ol>
<li>Initialize k cluster centroids randomly </li>
<li> Repeat the following until convergence: </li>
<li> For every point, assign it to the category defined by it's closest centroid</li>
<li> For every centroid, define a new centroid which is the mean of all the points assigned to that centroid </li>
</ol>Edgar Barrazaeab326@cornell.eduWeek 2 After kicking this week off, it took some time but I feel that I’ve got a groove I can get with to get things done! I was exposed to a lot of new knowledge this week, so let me share with you my friends, some of the highlights of what I enjoyed learning! I will organize this for you with the following categories: Natural Language Processing: Word Embeddings Reinforcement Learning: Multi-Armbed Bandit Problems Unsupervised Learning: AutoEncoders, K-Means Clustering Let’s get started!Week 1: Syllabus & Orientation2019-02-08T00:00:00+00:002019-02-08T00:00:00+00:00https://edy-barraza.github.io/week1<h2 id="week-1">Week 1</h2>
<p>I just finished a fulfilling first week for the OpenAI Scholars program. Here’s a
sweet pic of our time in the office:</p>
<p><img src="/assets/images/group_photo.jpg" alt="group photo" /></p>
<p>Going into the office and speaking with the folks working there really helped me
change the way I think about the work I will be doing. They all set a benchmark for the
depth of knowledge I am seeking to attain in this field. I also learned from
speaking to fellow cohort members. They all have such varied backgrounds, so
hearing their project ideas helped me understand different possibilities within the
realm of AI, and helped me think of new ways I could utilize AI for others.</p>
<p>I seek to utilize AI to create resources for other people. I believe the most powerful
thing is when people are able to help other people. People strive to help
each other, but are not always able to due to circumstances out of their control. With
advances in AI, computers are now able to do a lot of things people can, and sometimes
they can do those things better! I wish to use AI to provide people with resources
they don’t have access to that they previously needed another person to provide,
all for the free!</p>
<p>To come closer to this goal, I will be studying Natural Language Processing
during Scholar’s program since people interact with each other using language.
I want to develop a system that’s responsive to human needs, so I will also be
studying Reinforcement Learning. In this age of big-data, we have access to more unlabelled text data than
we know what to do with, so to develop a system to take advantage of this abundance
I will also study Unsupervised Learning.</p>
<p>My goal for my project is to understand recent developments in natural language understanding
and question answering systems to contribute a new development, and to deploy
this project on the web for people to use. I’m excited to share my progress as time
passes. In the mean time, here’s <a href="/assets/pdfs/OpenAI_Scholars_Syllabus_Edgar_Barraza.pdf">my syllabus</a>
for the next few months.</p>Edgar Barrazaeab326@cornell.eduWeek 1 I just finished a fulfilling first week for the OpenAI Scholars program. Here’s a sweet pic of our time in the office: