# Week 2: Getting A Groove

4 minute read

## Week 2

After kicking this week off, it took some time but I feel that I’ve got a groove I can get with to get things done! I was exposed to a lot of new knowledge this week, so let me share with you my friends, some of the highlights of what I enjoyed learning! I will organize this for you with the following categories:

• Natural Language Processing: Word Embeddings
• Reinforcement Learning: Multi-Armbed Bandit Problems
• Unsupervised Learning: AutoEncoders, K-Means Clustering

Let’s get started!

### Natural Language Processing: Word Embeddings

My focus for NLP this week was learn vector representations of words. The initial approach was creating count based matrices. The Word-Document matrix was created by looping through all available documents to create a matrix $X$, where element $X_{i,j}$ tells us if word $i$ appears in document $j$. This approach is not ideal because of how sparse the matrix is, even if it makes use of global statistics of the corpus.

The next attempt is is a Word-Based Co-Occurence Matrix. This matrix $X_{i,j}$ is created by looping through the corpus and counting how many times each word $j$ appears in the context of word $i$. This is another sparse matrix, and is thus still not ideal. This idea puts us in the line of reasoning of using the context of words. We are interested in doing so because of the distributional hypothesis.

The distributional hypothesis is the idea that the a word’s meaning is given by the words that frequently appear close by to it. This gave rise to Word2Vec

Word2Vec is composed of two approaches to creating word embeddings, followed by two training methods. Word embeddings can be generated using the Continuous Bag Of Words (CBOW) model, which aims to the predict a center word given some surrounding context, or the Skip-Gram Model, which aims to predict the distribution of context word given a center word. These embeddings can be trained using Negative Sampling, which defines an objective by sampling negative samples negative examples from a false corpus. These embeddings can also be trained using Hierarchical Soft-Max, which defines an objective using a tree structure to compute probabilities for all words in our vocabulary.

These embeddings had strong performance, but folks wanted to make an improvement by taking into account the distributional hypothesis AND utilizing global corpus statistics. This gave rise to GloVe: Global Vectors for Word Representations. GloVe considers global corpus statistics and the context of words with a co-occurence matrix and a least squares objective.

### Reinforcement Learning: Multi-Armbed Bandit Problems

We can define elements of reinforcement learning as

• Policy:
• A policy is a mapping from perceived states of the environemnt to the actions meant to be taken in those states
• Reward Signal
• During each time step, the environment sends the agent some # whih is the reward signal
• The agent seeks to maximize the reward it receives in the long run
• Value Function
• While the reward specifies what's good in the intermediate sense, the value function specifies what's good in the long run
• Value = amount of reqard the agent can expect to accumulate over the future starting from it's current state
• Allows us to delay gratification and go to a low reward state if we believe we will get more reward in the long run
• Optional: Model Of The Environment
• Something that mimics the environment
• Allows us to make inferences about how the environment might behave
• Model might help predict the next state or reward
• Model-Based Methods: Uses models and planning
• Explicitly trail and error learners

I started my reinforcement learning studies by learning about multi-armed bandits, which is a particular problem where we only have one state, and have $k$ possible actions that have some expected reward. The action selected at some time $t$ is $A_t$ and returns some reward $R_t$. The value of action $a$ is $q_{*}(a) = \mathbb{E}[R_t | A_t=a]$.

We don’t really know $q_{*}(a)$, but by the law of large numbers, with enough samples we can get a pretty good estimate $Q_t(a)$. Our principle problem in this setting is balancing exploiting methods we estimate to be the best strategy with exploring methods we wish to try out and see if they are successful.

### Unsupervised Learning: AutoEncoders, K-Means Clustering

Autoencoders are nueral networks that seek to seek to learn the funciton $h_{w,b}(x) \approx x$. This sounds trivial as an identity matrix, but it’s not since the number of hidden units in the network is less than the output of the network. Thus, we have essentially assigned the network a compression task, where it has to find correlations between the inputs. We can also have a larger number of hidden units than input and output units, but enforce a sparsity constraint on the hidden units. This means we want most of the hidden units to be inactive (output close to zero) most of the time. This will allow the network to learn correlations between the input of the data.

K-Means Clustering means to cluster points around some centroid points. The number of centroids you define is arbitray. Here’s the procedure

1. Initialize k cluster centroids randomly
2. Repeat the following until convergence:
3. For every point, assign it to the category defined by it's closest centroid
4. For every centroid, define a new centroid which is the mean of all the points assigned to that centroid

Updated: