Defining AI: Word embeddings
Hey there! It's been a while. After writing my thesis for the better part of a year, I've been getting a bit burnt out on writing - so unfortunately I had to take a break from writing for this blog. My thesis is almost complete though - more on this in the next post in the PhD update blog post series. Other higher-effort posts are coming (including the belated 2nd post on NLDL-2024), but in the meantime I thought I'd start a series on defining various AI-related concepts. Each post is intended to be relatively short in length to make them easier to write.
Normal scheduling will resume soon :-)
As you can tell by the title of this blog post, the topic for today is word embeddings.
AI models operate fundamentally on numerical values and mathematics - so naturally to process text one has to encode said text into a numerical format before it can be shoved through any kind of model.
This process of converting text to a numerically-encoded value is called word embedding!
As you might expect, there are many ways of doing this. Often, this involves looking at what other words a given word often appears next to. This could be for example framed as a task in which a model has to predict a word given the words immediately before and after it in a sentence, and then take the output of the last layer before the output layer as the embedding (word2vec).
Other models use matrix math to calculate this instead, producing a dictionary file as an output (GloVe | paper). Still others use large models that are trained to predict randomly masked words, and process entire sentences at once (BERT and friends) - though these are computationally expensive since every bit of text you want to embed has to get pushed through the model.
Then there's contrastive learning approaches. More on contrastive learning later in the series if anyone's interested, but essentially it learns by comparing pairs of things. This can lead to a higher-level representation of the input text, which can increase performance in some circumstances, and other fascinating side effects that I won't go into in this post. Chief among these is CLIP (blog post).
The idea here is that semantically similar words wind up having similar sorts of numbers in their numerical representation (we call this a vector). This is best illustrated with a diagram:

I used GloVe to embed some words with GloVe (really easy to use since it's literally just a dictionary), and then used cosine distance to compute the similarity between the different words. Once done, I plotted this in a heatmap.
As you can see, rain and water are quite similar (1 = identical; 0 = completely different), but rain and unrelated are not really alike at all.
That's about the long and short of word embeddings. As always with these things, you can go into an enormous amount of detail, but I have to cut it off somewhere.
Are there any AI-related concepts or questions you would like answering? Leave a comment below and I'll write another post in this series to answer your question.