Hypercharge Your AI Interview
Wed 23 October 2024 by Andrew AthanWhat's In This Post?
In this article I present AI terminology. My hope is that you will find the information helpful in conversations with hiring managers, your colleagues, or practitioners and consultants delivering AI projects.
The definitions herein will also serve as a warm-up to upcoming content reviewing recent advancements in AI.
This document is not AI generated.
Please Support My Work
I would appreciate a like, follow, or connection request on my LinkedIn. Also, please consider leaving a thumbs up and/or a comment at the link to this post: LinkedIn Post.
The AI Lexicon
The terms below are presented such that the concepts build on each other. You can read it in order and it will make sense.
I plan to add to this lexicon over time. Please let me know if you'd like me to add specific terms or write a separate post on any specific topic. You can reach me in the comments at the LinkedIn post linked above.
-
AI
"Artificial Intelligence" is a heavily overloaded and overused term. It frequently refers to a very broad set of concepts, tools, capabilities, methodologies, etc. Beyond the obvious meaning from the component words (artificial, and intelligence), AI today frequently refers to the technology related to statistical models who's parameters are found by a machine learning (training) process with the input of large datasets. More on these terms below.
-
AGI: Artificial General Intelligence
A highly autonomous system that outperforms humans at most economically viable (digital) work.
This is paraphrased from Andrej Karpathy. Exactly how to define AGI is controversial. Loosely, the term refers to a computer system that has human level intelligence.
Manipulating the physical world requires robotic or biological (human) agents. Herein, we are only discussing intelligence in the information sphere. Arguably, however, an AGI could convince a human to perform tasks.
-
Model / Statistical Model
A formalized mathematical object that describes a (typically larger) set of data. This "description" consists of the model's parameters. Parameters are typically numbers that, in a sense, "configure" the model. On your stereo, when you set the volume knob to 5 out of 10, the knob itself is a model for the stereo's loudness, and its parameter is set at 50%.
Similarly, a mathematical object that is a model consists of the formulas (the knob) plus the parameters of those formulas (the knob settings). Some of those parameters remain unset, and are treated as inputs to the model when we use the model to do things like making predictions or sampling (generating new data). For example, a very simple model that describes how a student has performed on a set of exams is a flat horizontal line. It has one parameter: How high up it is. When the height above the x-axis is set to the average score the model is a simple average model.
That simple model obviously does not capture all of the features of the data. If the real scores are all similar to each other, or better yet, if they are all the same, it's not a bad model (in the latter case, it exactly describes the data). But if the real scores vary widely, the information about their variance is lost.
Let me also mention that AI models are frequently depicted using a "connected graph." These diagrams are known as "graphical models." This does not mean that they are models about graphics.
-
Graphical Model
"A graphical model is a probabilistic model in which dependencies and independencies of variables are represented by edges in a graph whose nodes are the variables." -- Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003
-
-
Generative Model / GenAI
What it means for a model to "describe" the data, depends on the model. For example, some models are well suited to creating new data that mimic the data set they "describe." These are generative models. Other models are not as good for such tasks.
The example model described above (under "Statistical Model") is not a very good generative model because it cannot create a "new" example test score that looks like one of the scores in the set of scores it describes, unless all the scores in the original set are all the same: It can only produce the average itself, which is arguably not very useful.
A better generative model might be a Gaussian (normal, bell curve) model. Gaussian distributions are completely described by two parameters: the average and the standard deviation.
Just like our simple flat line model, there is also a hidden element to the model: The formula that describes the model itself. It is this formula that has the parameters (variables) inside it. We use the formula to do things with the model.
We can generate a bunch of new test scores using the formula for a Gaussian "curve." The generated data (scores) would more closely mimic the characteristics of the original dataset. In other words, very few outliers will be generated, and the distribution of the scores will be similar to that of the original dataset. More on this below under "Sampling."
-
ML: Machine Learning (aka statistical learning)
A term encompassing a very broad set of methodologies, tools, and models grounded in the formalized math of multivariate statistics aimed at being able to process training data to derive model parameters in a mechanistic manner.
Machine Learning involves using a learning algorithm to find the parameters of a model. More on this below under "Learning Algorithm."
-
Deep Learning
A class of Machine Learning based on neural networks with many layers. "Layers" refers to the sometimes repetitive structure of large number of nodes in a neural network. More on this later.
-
Neural Network
Neural network models consist of layers of interconnected nodes, each node of which performs a simple function on its inputs and passes its output to the next neuron. Simulated neurons behave similarly to biological neurons, which pass signals to other neurons after modifying them through chemical processes and/or (as posited in recent research) quantum effects.
Each layer of a neural network may have a similar, or even identical, structure. Frequently, the "top" and "bottom" layers differ, because they serve specific functions, such as converting the inputs and outputs of the model.
Different structures (such as the ones named in the following bullet points) have been empirically found to be good at different tasks.
-
Model Architecture (in the context of Deep Learning)
The structure of a neural network, including the number of layers, the number of nodes per layer, and the types of activation functions used.
-
Convolutional Neural Network (CNN)
A class of Neural Networks that are particularly good at processing visual data. Convolutional networks apply the same function over a small region of an input image, in a sense "scanning" across the entire image.
-
Recurrent Neural Network (RNN)
A class of Neural Networks that are particularly good at processing sequential data. Recurrent networks have a "memory" that allows them to take as input not just the current input, but also the previous inputs. The outputs of some neurons can become the inputs to other, prior neurons.
-
Transformer
A class of Neural Networks that are particularly good at processing sequential data and doing sequence transduction (such as translation), natural language processing, etc. Transformers are the backbone of LLMs such as ChatGPT, Claude, and Gemini.
The seminal paper on Transformers is "Attention is All You Need," which introduced the Transformer architecture and the concept of self-attention, ushering in the present era of deep learning models and LLMs.
Transformers characteristically include multiple stacked layers each of which includes neurons set up to perform self-attention, followed by a feed-forward network (neurons that only have connections going forward).
-
Generative Pre-trained Transformer (GPT)
A class of Transformer models that are pre-trained on a large corpus of data, such as natural language, and are then fine-tuned on a specific task, such as translation or question answering.
You may have heard that LLMs such as ChatGPT, Claude, and Gemini are consist of transformers that have billions of parameters. The number of parameter directly relates to the number of neurons.
-
Diffusion Model
See "Diffusion Model" below. We need to define additional terms before we can discuss it in detail.
-
-
Learning Algorithm
A procedure by which the parameters of a probabilistic model can be found, such as "stochastic gradient descent." Typically, learning algorithms consist of a loop that repeats a fixed process, taking training data as inputs. During each step one or more of the training data are used to adjust the parameters step-wise toward a better fit. More on "fit" below.
The "learning" algorithm for the simple flat (average) model described above is "add up the scores and divide by the number of tests." It's a stretch to call this a "learning" algorithm. It's more like a formula which can derive the optimal model parameter in one step. It is a "closed form" solution which can be "read out" by plugging in the inputs to a single non-iterative formula.
Note that complex models may have a single optimal solution of the parameters (one that causes the model to be the "best" possible fit to the data). However, complex models frequently either do not have a closed form solution, or direct calculation of the solution is intractable. In such cases, we need a multi-step learning algorithm, typically involving a loop that "converges" on the optimal solution. This is the case for most machine learning models and training sets.
-
Learning Algorithm Classifications
As with many terms, "learning algorithm" can be used to refer to a class of related concepts or to specific fine-grained methodologies. Stochastic gradient descent is a learning algorithm, as is back propagation, in the sense that these are used during the learning process to converge on a solution. These are discussed below under "Loss Function."
However, those are fine-grained methodologies frequently used in the context of deep learning. We may also classify learning algorithms by the type of objective they optimize, by the type of data they use during the learning process, or by the way they score the fitness of a given set of parameters.
-
Reinforcement Learning
A class of learning algorithms that are based on training a model to behave like an agent in an environment. The agent learns by performing actions and receiving feedback in the form of rewards or punishments. The agent's parameters are adjusted to maximize the rewards.
-
(Self) Supervised Learning
A class of learning algorithms that are based on training a model using data that are labeled. When supervised learning is used to train a model that can tell cats and dogs apart, the training data consists of pictures of cats and dogs, where each picture is correctly labeled. The labels are used to score the output of the model during training, so that the parameters can be moved in the right direction.
Self-supervised learning is a type of supervised learning where the model learns to predict future inputs based on past outputs, labeling its own data.
Given millions of input sentences, a self-supervised learning algorithm for a model that predicts missing words would randomly replace words with blanks, and then use the model to predict the missing words, moving parameters such that more and more correct predictions are made.
-
-
Convergence
A property of a learning algorithm that causes it to calculate parameters that more and more closely approximate the optimal solution as the algorithm iterates.
A learning algorithm that does not converge is one that does not get any better at the task as it iterates. We avoid these.
-
Loss function
Typically, the fitness of a given set of parameters is scored using a "loss function," which evaluates the model's performance towards an objective. Loss functions are typically differentiable, which means we can determine whether we should increase or decrease the values of specific parameters in order to improve the loss.
Yes, this comes from calculus. Remember, neurons are mathematical objects. That is, they are described by a function. Gradient descent is a technique for finding the minimum of a function. Gradient descent learning algorithms use the gradient of the loss function to iteratively adjust the parameters of the model according to how the outputs of the neurons in the model would influence the loss function though a process known as back propagation.
The word "loss" is a bit of a misnomer. It's more like a score. However, it is a fitting name because loss functions are typically used to measure how far off a model is from its objective. This means that the value of the loss gets smaller as the model gets better.
For example, we might simply count how often a model correctly categorizes a picture as a dog vs cat. The "objective" is to categorize pictures into dogs vs cats, and the % of correct categorizations when categorizing pictures is the score (or loss).
-
Gradient Descent
In gradient descent, we move each parameter in the right direction to reduce the loss until it reaches an acceptable value. "Gradient" refers to the slope, or derivative, of the loss function.
-
Back Propagation
A technique used with specific learning algorithms in neural networks to calculate the gradients of the loss function with respect to the model parameters.
-
Stochastic Gradient Descent
A variant of gradient descent that uses a random sample of the training data in order to calculate the gradient. "Stochastic" refers to the fact that we are using a random sample of the data, rather than the entire data set and to the fact that the gradient we follow is probabilistically related to the true gradient.
-
-
Function Surface / Solution Surface
As we vary the parameters of a model, we can plot the loss function as a graph. Since it is multi-variate (has multiple dimensions) the shape it forms is called a surface. If there were three dimensions it would be a bumpy or smooth "skin."
In gradient descent we essentially start at some random point on this (sometimes hyper-dimensional) surface, and we try to move "down" any hill we may be on, hopefully finding the lowest possible point. The solution surface is the subset of that function surface that represents all of the solution or optimal parameter values.
Depending on the mathematical form of the loss function, the solution surface may be a flat plane, a curved (hyper-dimensional) surface, a lower dimensional manifold, or a single optimal point.
One difficulty with the function surface is that it may have many local minima. That is, there may be many points that appear to be optimal, but are not the global minimum. Gradient descent can get "stuck" in a local minimum, and never find the global minimum.
Regularization, dropout, and other techniques can help a model avoid local minima by "jumping" out of them.
-
Manifold
A manifold is a lower dimensional surface embedded in a higher dimensional space. For example, a sphere is a two dimensional surface embedded in three dimensional space. A circle is a one dimensional manifold embedded in a two dimensional space -- it is a slice through the sphere.
-
Training Set / Corpus
A set of data samples that are used to train a model.
By definition "models," whether statistical or not, aim to mimic something. A mathematical model such as a Gaussian (bell curve) parameterized by its mean and standard deviation mimics the samples from which its parameters were derived.
When those parameters are based on all the test scores from a class, those test scores are the "training set."
In more complex scenarios, such as training ChatGPT, Claud, Midjourney, etc. the training set is a very large "corpus" of multi-media materials.
-
Holdout Set / Test Set
A test set consists of data samples that are used to evaluate the performance of a model. It is frequently best to to use samples that are distinct from those used during training, particularly when the objective requires some amount of generalization.
If we were to test the model on the same data that we used to train it, we might find that the model performs very well. However, that does not mean the model will generalize well to new data. It may have "memorized" the training set instead of learning the underlying patterns in the data. This is known as "overfitting." More on this below under "Overfitting."
When training a model that does not have a tractable closed form optimal solution, we need to have some way to decide when the model is performing well enough to stop training. We can "hold out" some of the training set to instead use it as a test set.
By checking how the model performs on data it has never seen before, we can have a better idea about how well it will do on new data that even the model's developers have not yet seen.
-
Regression
Do you know how to find the line that most closely passes through a set of points? How about doing that even when those points do not lie on a line?
Regression in general refers to deriving a set of parameters from some data, where those parameters typically describe the relationship between the input data by "shaping" the model that we assume can best fit the data.
In the questions I posed above, we're assuming a line is the best model to fit the data (but it may very well not be). Irrespective of whether it is a good model, we may be able to sort all possible lines into a line that is the best possible (but still "bad") line.
-
Why is statistical math central to AI?
Regression, viewed in this way, is a form of compression. We are compressing the inputs set of points into the parameters of a line. We can't get back the original points, but we can use the line to make a guess about what a point in the set might be, if we are given partial information about the point (e.g., plug in the X coordinate and get out the Y coordinate).
One way to conceptualize what an AI model is, is to think of it as a compressed form of the training set; one that we can use to make guesses about new data, even if the new data was not in the training set.
-
How does "guessing" relate to asking questions?
When we ask a question, we are essentially asking the model to guess the answer. If everything about our question could be compressed into a single number X, then the Y may be a compressed form of everything we need to know about the answer.
Regressions may also be used as categorizers, in the sense that if we have selected a line such that it maximizes the distance between itself and all of the input points, while still passing through the "center" of the points, then we can use the line to categorize new points (are they on one side, or the other side, of the line?).
-
-
Generalization
Finding the two parameters for a line that passes through a cloud of points that are not co-linear is a regression problem. It is often done with a least squares error objective. In other words, we find the line that minimizes the sum of the squared distances between the points and the line.
Regression problems in machine learning have to do with predicting a dependent variable given an independent variable. If we know the formula for the optimal regression this can be as simple as plugging in the independent data into that formula. For co-linear points, there is such a formula: We take any two points and use them in the formula for a line. Then, we can plug in any one coordinate (independent data), and the formula spits out the other one (dependent data).
Statistical regression models, which are common in machine learning/AI contexts, are able to describe data that do not necessarily have such a simple relationship. The complex model, once its parameters are learned, can generate samples that "make sense" given the training data. The parameters learned during training allow it to do so, but those generated samples will not necessarily exist in the training set.
In that sense, a statistical regression "generalizes." We call this a regressive model because it is assumed to be an inexact representation of the training data. However, when a model has enough capacity, it can learn parameters that in fact allow it to exactly match its training data. This is not generally desirable, and is typically called "over fitting" because we usually aim for the model to learn patterns in the data, rather than what is effectively a copy of the data itself. More on this below.
-
Sampling
Once we have created a model, particularly a "generative" one, we then want to draw samples (make a guess) using it. That's a fancy way of saying that we want to use the model to generate new data. Sampling usually involves calculating a value using the model's formula, given a random input.
With a statistical model, such as a Gaussian, we don't generate samples that are certain. For example, if we've generate the Gaussian (bell curve) that fits a set of test scores, we can only use that model to ask how likely a certain score is, or we can use it to generate new scores. The new scores it generates will have likelihoods that match the Gaussian.
For example, if the mean is 50 and the standard deviation is 1, it's extremely unlikely the model would generate a score of 1000, but very likely it would generate a score between 49 and 51. However, it may be very unlikely to generate a score that comes directly from the training set.
Similarly, a generative model trained of cat pictures may never generate one of the pictures it was trained on. That's usually a good thing. It means the model has generalized well (so long as the picture looks like a cat).
-
Random Number Generator
To sample from a Gaussian, we must have a process that can generate random data, so that we can then shape (transform) that random data to conform to the Gaussian. Good sources of random data may be hard to come by. Some computer systems employ physical random number generators, such as measuring radioactive decay, avalanche diode noise, or radio noise. In fact, the word "noise" means random in this context. By the way, yes, noise comes in different flavors. Some noise is Gaussian, other noise is not.
Once we have a source of random data, we can feed it into a statistical model to generate an output sample that conforms to that model.
-
Diffusion Model
A class of generative models that inject increasing levels of noise into training examples during training. The model is then trained to find parameters that cause it to successfully "de-noise" (i.e., remove) the injected noise.
During sampling (generation), the model is given a sample that is often purely noise, which it then de-noises to generate a novel image. By conditioning the training on text labels, the model can be used to generate novel images based on text prompts.
-
Relationship Between Model That Generalizes and Training Set
As we've discussed, a model that is able to generalize well will be able to generate samples that are similar to but not exactly those in its training set. The characteristics of a model that can generalize are discussed in the following sections. Loosely speaking, a model must have enough capacity, enough parameters, and enough dimensions.
You may have been wondering why AI models are statistical in nature, rather than trying to learn an exact mapping function. The reason is that statistical models are very good at generating new data that "looks like" the data used to train it. A mapping function would be too rigid to do so.
For example, a statistical classifier, trained on a set of images of dogs and cats, will be able to make good guesses as to whether a new image it has never seen before is a dog or a cat. In this sense it is a regression model that is able to generalize outside its training set. By contrast, a mapping function would only be able to tell you whether it has seen this specific picture, and whether it was labeled as a cat during its training.
Is human creativity based on sampling the statistical model that is our brain? That's a provocative question.
-
Dimensionality
One of the ways to characterize a Machine learning model is to measure the dimensionality of its latent space. The "latent space" is the set of all possible combinations of the parameters that define the inputs to the model.
-
Latent Space
The set of all possible combinations of the parameters that define the inputs to the model--not the inputs themselves, but the parameters that define the inputs.
-
Encoder
A model that is used to transform input data from a high dimensional space into a lower dimensional ("latent") space. Encoders are frequently used in conjunction with decoders, and usually appear at the input of a generative model.
For example, an encoder for an image might take as input a picture and return the parameters that define a lower dimensional "embedding" of that image.
-
Embedding
A lower dimensional "encoding" of a data sample into the latent space of a model.
-
Decoder
A model that is used to transform data from a lower dimensional ("latent") space back into a higher dimensional space. Decoders are frequently used in conjunction with encoders, at the output of a generative model.
-
A Concrete Example of Latent Space and Embedding
Let's say we are encoding words into a latent space. We might find that the latent space consists of two dimensions, so that each word is represented by a point in a two dimensional space. We might find that the words "cat" and "dog" are near each other in the latent space, but that the word "fork" is far away from both. We might find that the "X" dimension is strongly correlated with whether the word is the name of an animal, and the "Y" dimension is strongly correlated with the presence of the letter "A" in the word.
In most natural language encoding schemes, the latent space is continuous, meaning that nearby points in the latent space represent words that are similar to each other; and there are many dimensions, so that the difference between words is a matter of moving a small amount in one direction or another, and the embedding of words can carry many semantic meanings even though the embedding is a single point (represented as a vector, i.e. a set of coordinates) in this high dimensional space.
-
Capacity
When we have a data set that is very complex, the model that describes that data must also be complex. We can't describe the entirety of the works of Shakespeare with one number
If our model is very complex, then it may be able to describe our data exactly. For example, if we can describe a squiggly line (i.e., we have more than two parameters and our model includes non-linear formulas) then the training process of a regression model might be able to learn the parameters for a super squiggly line that hits every point in our training set.
The degree to which a model can exactly learn independent training data samples is known as its capacity. In general, the fewer parameters and the lower the dimensionality of a model's latent space, the lower its capacity.
This may sound very abstract. You may be thinking, how does a squiggly line relate to dogs and cats. To resolve this conundrum, remember that during learning, concepts end up being represented by points in the latent space of the model. In other words, elements of the training set, such as concepts, become vectors.
Ideas are turned into geometric objects, and we can talk about the "distance" between ideas in a concrete way. This is radically powerful, particularly because we don't start out by telling a machine how to represent those concepts. Instead, the learning algorithm does that simply by processing enough examples.
Is that how an infant's brain works?
-
Arithmetization
An encoding of any possible statement as a number. This involves assigning a number to each word in the source language. E.g. We could assign a unique number to each word in English, and then write down all the works of Shakespeare using only those numbers. The long string of digits that results could be interpreted as a single number that inside it, contains his works.
Arithmetization is a form of compression, and it is a key concept in information theory. It was famously used by Claude Shannon to prove that there is a fundamental limit to the amount of information that can be transmitted through a communication channel, and by Godel to prove that there are fundamental limits to the power of any logical system.
Arithmetization is another way that we can turn information into numbers. I mention it here so that you can form better intuitions about how math can be used to represent elements of the human experience.
-
Over fitting (vs. Generalization)
If a model becomes "locked in" (that's not a technical term) to its training set, it will tend to act much like "when all you have is a hammer, everything looks like a nail."
If we want to use a statistical model to generate samples, we don't want it to only parrot or repeat examples from its training set. If a model has enough capacity it can learn to do just that. When this happens, we say that the model is overfit to its training set.
If you're trying to build a statistical model that predicts stock prices, you don't want it to just repeat the last pattern it saw. You want it to make a likely guess that it may never have seen before, but which is consistent with the current state of the world, and its training.
-
Regularization
A technique used to prevent overfitting by adding a penalty to the loss function for having large parameter values. A fascinating result from studying the theory of deep learning is that regularization happens anyway, without the need for explicit regularization techniques, when using stochastic gradient descent. Regardless, it is important to control for overfitting, and regularization is an effective technique for doing so.
-
L2 Regularization
A form of regularization that adds a penalty to the loss function for having large parameter values.
-
Early Stopping
A form of regularization that stops the training process before the model has a chance to overfit.
-
Dropout
A form of regularization that randomly drops out (i.e., sets to zero) a number of neurons in the network during each training step.
-
-
Natural Language
Text in the form of a language understood by, and evolved over time through cultural processes of people. A counterexample would be a programming language intended to be parsed by, and control the operation of, a machine.
Non natural language is often characterized by having a rigid syntax, and often being highly formalized. Mathematical notation and programming languages such as Python, are good examples of non natural language.
-
Vector Embeddings
The technique of representing data, including words, parts of words, punctuation, etc. as vectors in a high dimensional space. Vector embeddings may be prescribed, but more frequently, they are learned.
-
-
LLM: Large Language Model
E.g. ChatGPT, Claude, Gemini, Llama
LLMs derive their name from both the size of the training data set and the number of parameters, with some of the larger models ranging into the 100B (and soon even 1T) order of magnitude. Of course, the adjective "large," colloquially also refers to the time, resources and resultant capabilities of training an LLM. However, even smaller models, having parameters ranging into, or even somewhat below, the 10B range, are called LLMs. This is therefore somewhat of a context dependent term, referring to the non-trivial size of the model.
-
Foundational Model
The cost of training a high parameter count model is prohibitive. Foundational models are those that have been trained on the largest possible training sets by the largest organizations, and then made available for public use.
Typically, these models have reached record breaking levels of accuracy or performance scores, and may therefore serve as a foundation from which to build on.
-
Fine Tuning
A technique for adapting a foundational model to a specific task by additionally training it on a smaller dataset. Usually, the fine tunning data set comes from the same domain as the task, and is highly relevant to the task. For example, a foundational model trained on natural language might be fine-tuned on legal documents that come from the document store of a specific law firm.
-
Prompt Engineering
The techniques for best authoring prompts to achieve a desired outcome with an LLM. An example of prompt engineering is to ask a foundational model to generate a list of questions and answers, and then feed those results back into the model in order to improve its performance on a certain task.
-
Prompt Injection
A technique for injecting instructions into an LLM, often with the intent of influencing the behavior of the model. Retrieval Augmented Generation (RAG) is an example of prompt injection used for good. The term, however, is more frequently associated with malicious intent.
-
Direct Prompt Injection
- Simple Overwriting
- Appending Commands
- Instruction Reversal or Hijacking
-
Overriding instructions with new ones
- Input Escaping
- Exploiting structured input formats (e.g., JSON)
-
Role Confusion Exploitation / Roleplay and Persona Attacks
- Confusing or altering role-based instructions
- Convincing the model to take on a different role
-
Data Poisoning
- Injecting harmful or misleading training data
-
Meta Prompt Injection
- Exploiting multi-layered or hierarchical instructions
-
Output Manipulation via Repetitive or Loop Injection
- Forcing the model into repetitive responses
-
Prompt Shaping (Prompt Obfuscation)
- Providing inputs that obscure or shape the true intent
-
Negation or Misdirection
- Asking the model to provide opposite or indirect information
-
Contextual Coercion
- Leading the model with context to extract restricted information
-
ASCII/Unicode/Token Injection
- Using special characters or encodings to confuse token-based models
-
Ethical Override Injection
- Posing ethical dilemmas to bypass restrictions
-
RAG: Retrieval Augmented Generation
- Not all prompt injection is malicious. Some is simply a case of prompting the model to behave in a certain way. RAG is an example of this, and is described below under "RAG."
-
-
RAG: Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is a technique that analyzes the content of a prompt and then uses a retrieval system to find relevant data with which to augment the prompt. This sounds easier than it is, because the prompt needs to be analyzed in a way that is linguistically consistent with the data found by the retrieval system. RAG is a form of prompt injection.
It is important to note that RAG systems generally operate outside the LLM. They are bespoke systems separate from the LLM, and the LLM is used to process the output of the RAG system. RAG may employ the same, or another, LLM to analyze the prompt itself (rather than generate a response to the prompt). A smaller LLM fine-tuned on the specific task of prompt analysis for retrieval is sometimes used.
Frequently, the retrieval system employs a form of embedding of the prompt in order to do similarity analysis of the prompt with the data available to the retrieval system. This implies that the data in the retrieval system is also embedded, and that the embedding is the means by which the similarity analysis is performed. Thus, the data subject to retrieval must have been pre-processed and made ready for similarity search, such as through the use of vector embeddings and vector databases.
A technique known as "cross-attention" is often used to allow the LLM to pay attention to the most relevant portions of the retrieved data.
-
Graph Retrieval Augmented Generation
A form of RAG that uses a graph database to store and retrieve data.
-
-
Vector Databases
Databases that store data in the form of vectors, and use vector embeddings to perform similarity search. Vectors are basically a list of numbers, such as the (x,y) coordinates of a point in a two dimensional space. Therefore, it is possible to perform similarity analysis on vectors by measuring the distance between them. Vector databases index the vectors and use similarity metrics to search for and retrieve data.
-
Examples of Vector Databases
Examples include Chroma, Faiss, and Hologres.
-
-
Graph Databases
Databases that store data in the form of graphs, and use graph algorithms to perform similarity search. For example, a graph database might be used to store the relationships between concepts in a domain, such as the relationships between people and the companies they have worked for. Each company might be represented by a node, and the relationships between people and companies might be represented by a connection (line, edge, connection) between nodes.
Graph databases are a form of non-linear data structure, meaning that the data is not stored in a linear fashion, but rather in a network of nodes and connections between nodes. Graphs are reduced to some computable representation such as assigning a unique number (id) to each node, and representing the connections between two nodes as a pair of those ids.
-
Graph Embeddings
The technique of representing data as graphs and then using graph algorithms to perform similarity search.
-
Examples of Graph Databases
Graph databases that are publicly available for use. Examples include AnzoDB, Neo4j, JanusGraph, Dgraph, and TigerGraph.
-
-
Cross-Attention
A technique used in LLMs to allow a transformer model to pay attention to the most relevant portions of the retrieved data.
How To Reach Me
Please feel free to reach me in the comments of this LinkedIn post: LinkedIn post
Your comments are welcome!
© 2024 Andrew Athan