Basic Overview of LLMs

July 23, 2023

Basic Overview of LLMs

Introduction

Language Model is a type of AI used in NLP. It is a probablity distribution over sequence of words. Given a sequence of n words, it assigns probablity to the whole sequence. Language models are trained to predict likelihood of a sequence of words occuring in a given context. These can be used for various tasks like text generation, machine translation, data analysis etc. It is one example of Generative AI.

Some examples of LLMs are chatgpt and bard.

Generative AI

Generative AI generates new data based on trained dataset. Eg: Train with data about music → Input prompt about music → Generates new music

AI hierarchy (Click on image to enlarge)

Large Language Models

How keyboard suggestions work? How search suggestions work? It all depends on how much we or other people who use similar set of words, right? This is the goal of Language Modeling, assigning probablity to every sentence. Calculating frequency by how many other people have used.

But this doesnt allow you to score new sentences.

So what about new sentences? Given the number of internet posts everyday, wont we exhaust all possible combination of words soon? Let's consider all possible combination of words in dictionary, you will get a huge number and majority of them will never be used. So we need to do things more than just counting sentences.

Consider this sentence/paragraph/rhyme

Beneath the moon's glow, the river was flowing fast, Whispering and glowing, wondering if the night would forever last

Let's train our model so that it can predict the next word based on previous word. Will it predict words properly? Just the word "the" appears thrice in the sentence → it can predict any of the three words "moon", "river" and "night" after the word "the" which will mess up the sentence we want it to predict. We can represent this as P(Xn|Xn-1) [Probablity of nth word conditioned on (n-1)th word].

So instead of one word before, how about two? We can map them as ["Beneath the" → "moon", "the moon" → "glow"........"would forever" → "last"]. Predicting the next word based on previous 2 words can work pretty well for this sentence most cases and obviously better than predicting by one previous word. But it still is not good enough to be applicable everywhere. We can represent this as P(Xn|Xn-1, Xn-2) [Probablity of nth word conditioned on (n-1)th and (n-2)th word].

So how can we solve this? if you take the example paragraph, to predict the last word "last", you will have to go 10 words back to get access to fast, so that you can rhyme with it, otherwise you could replace "last" with any other similar word. Here we are considering 10 previous words to predict the next word. We represent this as P(Xn|Xn-1, Xn-2.....Xn-10) [Probablity of nth word conditioned on (n-1)th, (n-2)th.....(n-10)th word].

So we need to calculate functions like this P(Xn|Xn-1, Xn-2.....Xn-10) or even longer which are exceedingly complex. So we try to approximate these.

Function Approximation

There are various methods of function approximation like,

Fourier Series
Taylor Series
Neural Network etc..

Here we will see about Neural Networks. One advantage of neural network is we dont need to know anything about the given function except input and output pairs. They consist of input variable, connections with weights(random weights at start), activation function and output. we take on example function

x → input variable

w1, w2,..., w8 → weights

S like figure → activation function

y → output

Take any variable x from the x axis, send it thru the neural network, get the output say y. Check error of y with respect to f(x), error = (f(x) - y)^2 [we are squaring it, so that larger errors are magnified and smaller errors are minimized]. The role of neural network here is to make E = Σ(f(x)-y)^2 minimum by adjusting the weights according to error. After updating 1000s of time, we get a pretty good approximate function.

Step 1 of calculating from the neural network → multiply input with all weights before activation func

Step 2 → Calculate function value of activation function wrt weight x input

Step 3 → multiply the remaining weights with function value

Step 4 → add them all

Since our goal is to make E = Σ(f(x)-y)^2 minimum but, we dont know the function. What if we know some negative gradients? We can just follow it to find the minimum point. This process is called gradient descent. (Gradient tells the direction of steepest ascent, by moving opposite to the gradient we get the steepest descent) To train the network we use gradient of error function.

Gradient of E wrt weights

Further Explanation

Let's take our example statement, say you want to predict the last word, first thing we have to do is to convert the words to vectors for our neural network to understand. Hence we use word embedding. Now that our network understands words, it is ready to be trained. Note that in our example, We dont need all words from "forever" to "fast" but rather we need only the previous three words and the word "fast". So instead of P(Xn|Xn-1, Xn-2.....Xn-10), We can use P(Xn|Xn-1, Xn-2, Xn-3, Xn-10) ie; just taking those 4 words to predict the next word. How we do that? In this neural network we use 2 layers, namely attention and prediction. Working: Words → Attention Layer → Prediction Layer → Next word. Attention layer chooses relevant words from the given input of words and feeds it to Prediction layer. Prediction layer uses the words from Attention layer and predicts the next word. We train both these layers with different sentences and end word. We reduce "attention" of words that lead to wrong prediction and increase "attention" of words which will lead to correct word. This combined network is called a Transformer.

Basic LLM (Click on image to enlarge)

Implementation of Attention:

Attention layer works with one word a time. It assigns attention score to every word based on how much every word relates to that/influences that. This attention scores are encoded as context vector. These words and vectors are fed into prediction vector, hence used to generate texts.

This will helping text generation. But what about facts? Like name of capitals, presidents, currency etc... Which would mean it has to memorize all the facts.

Therefore more capacity is required → We have to stack more layers in our transformer. For example GPT-3 has 96 layers.

Today's LLMs have read most of internet and most of publicly available books. Training them takes only about a month with a few 1000 GPU.

Problems with LLM

1) Since a large amount of data in a LLM for example Chatgpt, it can contain biases based on data source

2) Training LLM requires large amount of data, which may include personal and sensitive information from internet. Ensuring data privacy becomes a concern.

3) LLMs can be misused to spread misinformation/rumours leading to political ethical and societal issues. The technology can be misused for spreading propaganda, phishing attacks, or generating deepfake content.

4) LLM Hallucination

Hallucination in LLMs:

Output of LLM that deviate from actual facts and logic is called hallucination. LLMs are prone to make up stuffs. It can vary from simple to high level mistakes.

Some types of hallucinations:-

Sentence Contradiction - When LLM contradicts it's own statement made previously
Prompt Contradiction - When LLM contradicts the prompt given
Factual Contradiction - When LLM contradicts any fact
Non sensicle Answers - Randomly replying with things which arent relevent/asked.

Why?

Data quality ie; Say you feed all data from reddit and twitter, is all data from there 100% reliable?
Generation method ie; The generator may select higher probablity word than a relevant word
Input context ie; Your input prompt may confuse or mislead the LLM to generate completely different output

Conclusion:

Just a small explanation of LLM, might put another blog explaining them in detail.

Search This Blog

Gato in Tech