Introduction to Building AI Applications with Foundation Models

The Rise of AI Engineering

Trace how decades of advances in language models, self-supervision, and multimodality produced foundation models — and turned AI engineering into a discipline of its own.

The Rise of AI Engineering

Foundation models emerged from large language models, which, in turn, originated as just language models. While applications like ChatGPT and GitHub Copilot may seem to have come out of nowhere, they are the culmination of decades of technology advancements — with the first language models emerging in the 1950s.

This section traces the key breakthroughs that enabled the evolution from language models to AI engineering.

From Language Models to Large Language Models

Language models have been around for a while, but they've only been able to grow to the scale they are today with self-supervision. This section gives a quick overview of what language model and self-supervision mean.

If you're already familiar with these concepts, feel free to skip ahead.

Language Models

A language model encodes statistical information about one or more languages. Intuitively, this information tells us how likely a word is to appear in a given context. For example, given the context My favorite color is, a language model that encodes English should predict blue more often than car.

The statistical nature of languages was discovered centuries ago.

1905 — Sherlock Holmes

In The Adventure of the Dancing Men, Holmes used simple statistics of English to decode mysterious stick figures. Since the most common letter in English is E, Holmes deduced the most common stick figure must stand for E.

1951 — Claude Shannon

Used more sophisticated statistics to decipher enemies' messages during WWII. His landmark paper "Prediction and Entropy of Printed English" introduced concepts like entropy that are still used for language modeling today.
In the early days, a language model involved one language. Today, a language model can involve multiple languages.

Tokens

The basic unit of a language model is a token. A token can be a character, a word, or a part of a word (like -tion), depending on the model. For non-English languages, a single Unicode character can sometimes be represented as multiple tokens.

For example, GPT-4 — the model behind ChatGPT — breaks the phrase I can't wait to build AI applications into nine tokens. Note that the word can't is broken into two tokens, can and 't. You can see how different OpenAI models tokenize text on the OpenAI website.

Figure 1-1. An example of how GPT-4 tokenizes a phrase.

The process of breaking the original text into tokens is called tokenization. For GPT-4, an average token is approximately ¾ the length of a word. So, 100 tokens are approximately 75 words.

The set of all tokens a model can work with is the model's vocabulary. You can use a small number of tokens to construct a large number of distinct words, similar to how you can use a few letters in the alphabet to construct many words.

Mixtral 8x7B

Vocabulary size of 32,000 tokens.

GPT-4

Vocabulary size of 100,256 tokens.

The tokenization method and vocabulary size are decided by model developers.

Why do language models use token as their unit instead of word or character? There are three main reasons:
  1. Compared to characters, tokens allow the model to break words into meaningful components. For example, cooking can be broken into cook and ing, with both components carrying some meaning of the original word.
  2. Because there are fewer unique tokens than unique words, this reduces the model's vocabulary size, making the model more efficient (as discussed in Chapter 2).
  3. Tokens also help the model process unknown words. For instance, a made-up word like chatgpting could be split into chatgpt and ing, helping the model understand its structure.
Tokens balance having fewer units than words while retaining more meaning than individual characters.

Two Main Types of Language Models

There are two main types of language models. They differ based on what information they can use to predict a token.

Masked Language Model

Trained to predict missing tokens anywhere in a sequence, using context from both before and after the missing tokens. Essentially trained to fill in the blank.

Example: given My favorite __ is blue, predict color.

A well-known example is BERT (Devlin et al., 2018). Today, masked language models are commonly used for non-generative tasks like sentiment analysis, text classification, and code debugging — where understanding the overall context matters.

Autoregressive Language Model

Trained to predict the next token in a sequence, using only the preceding tokens. It predicts what comes next in My favorite color is __.

An autoregressive model can continually generate one token after another. Today, autoregressive language models are the models of choice for text generation and are far more popular than masked language models. (Sometimes referred to as causal language models.)

Technically, a masked language model like BERT can also be used for text generation — if you try really hard.

Figure 1-2. Autoregressive language model and masked language model.

In this book, unless explicitly stated, language model will refer to an autoregressive model.

Language Models as Completion Machines

The outputs of language models are open-ended. A language model can use its fixed, finite vocabulary to construct infinite possible outputs. A model that can generate open-ended outputs is called generative — hence the term generative AI.

You can think of a language model as a completion machine: given a text (prompt), it tries to complete that text.

Prompt (from user):       "To be or not to be"
Completion (from model):  ", that is the question."
Completions are predictions, based on probabilities — not guaranteed to be correct. This probabilistic nature of language models makes them both so exciting and frustrating to use. We explore this further in Chapter 2.

As simple as it sounds, completion is incredibly powerful. Many tasks — translation, summarization, coding, and solving math problems — can be framed as completion tasks.

Translation

Prompt: How are you in French is …

Completion: Comment ça va

Spam Classification

Prompt: Is this email likely spam? Here's the email: <email content>. Answer:

Completion: Likely spam

Completion isn't the same as engaging in a conversation. If you ask a completion machine a question, it can complete what you said by adding another question instead of answering. "Post-Training" on page 78 discusses how to make a model respond appropriately to a user's request.

Self-Supervision

Language modeling is just one of many ML algorithms. There are also models for object detection, topic modeling, recommender systems, weather forecasting, stock price prediction, and more. What's special about language models that made them the center of the scaling approach behind the ChatGPT moment?

The answer is that language models can be trained using self-supervision, while many other models require supervision. Self-supervision overcomes the data labeling bottleneck, allowing models to scale up.

Supervision vs. Self-Supervision

Supervision

You label examples to show the behaviors you want the model to learn, then train the model on these examples.

Example: to train a fraud detection model, you use transactions each labeled fraud or not fraud.

The success of AI models in the 2010s lay in supervision. AlexNet (Krizhevsky et al., 2012), the model that started the deep learning revolution, was supervised — trained on ImageNet to classify over 1 million images into 1,000 categories such as car, balloon, or monkey.

Self-Supervision

Instead of requiring explicit labels, the model infers labels from the input data. Language modeling is self-supervised because each input sequence provides both the labels (tokens to be predicted) and the context for predicting them.

Because text sequences are everywhere — books, blog posts, articles, Reddit comments — it's possible to construct a massive amount of training data, allowing language models to scale up to LLMs.

The labeling bottleneck. If it costs 5¢ for one person to label one image, it'd cost $50,000 to label a million images for ImageNet. With cross-checking by a second labeler, it'd cost twice as much. Scaling to 1 million categories would push the labeling cost alone to $50 million.The actual cost varies — Amazon SageMaker Ground Truth charges 8¢ per image for fewer than 50,000 images, dropping to 2¢ above 1 million (as of September 2024). And not all labeling is cheap. Generating Latin translations is harder than tagging everyday objects. Labeling whether a CT scan shows signs of cancer would be astronomical.

For example, the sentence I love street food. gives six self-supervised training samples:

Input (context)Output (next token)
<BOS>I
<BOS>, Ilove
<BOS>, I, lovestreet
<BOS>, I, love, streetfood
<BOS>, I, love, street, food.
<BOS>, I, love, street, food, .<EOS>

Table 1-1. Training samples from the sentence I love street food. for language modeling.

<BOS> and <EOS> mark the beginning and the end of a sequence. These markers are necessary for a language model to work with multiple sequences. Each marker is typically treated as one special token by the model. The end-of-sequence marker is especially important — it helps language models know when to end their responses (similar to how it's important for humans to know when to stop talking).

Self-supervision differs from unsupervision. In self-supervised learning, labels are inferred from the input data. In unsupervised learning, you don't need labels at all.

From Language Models to LLMs

Self-supervised learning means language models can learn from text sequences without requiring any labeling. Because text is everywhere, it's possible to construct massive training datasets — allowing language models to scale up to become LLMs.

LLM is hardly a scientific term. How large does a language model have to be to be considered large? What is large today might be considered tiny tomorrow.

A model's size is typically measured by its number of parameters. A parameter is a variable within an ML model that is updated through the training process. (In school, parameters were taught as a combination of weights and biases. Today, we generally use model weights to refer to all parameters.) In general — though not always — the more parameters a model has, the greater its capacity to learn desired behaviors.

June 2018 — GPT-1

117 million parameters. Considered large at the time.

February 2019 — GPT-2

1.5 billion parameters. 117 million was downgraded to "small."

Today

A model with 100 billion parameters is considered large. Perhaps one day, this size will be considered small.

Why do larger models need more data? It seems counterintuitive — if a model is more powerful, shouldn't it need fewer examples to learn from? But we're not trying to match a small model's performance with the same data; we're trying to maximize model performance. Larger models have more capacity to learn, and therefore need more training data to maximize it. You can train a large model on a small dataset, but it'd be a waste of compute — you could have achieved similar or better results with a smaller model.

From Large Language Models to Foundation Models

While language models are capable of incredible tasks, they are limited to text. As humans, we perceive the world not just via language but also through vision, hearing, touch, and more. Being able to process data beyond text is essential for AI to operate in the real world.

For this reason, language models are being extended to incorporate more data modalities. GPT-4V and Claude 3 can understand images and texts. Some models even understand videos, 3D assets, protein structures, and more.

Incorporating additional modalities (such as image inputs) into LLMs is viewed by some as a key frontier in AI research and development. — OpenAI, GPT-4V system card, 2023

While many people still call Gemini and GPT-4V LLMs, they're better characterized as foundation models. The word foundation signifies both the importance of these models in AI applications and the fact that they can be built upon for different needs.

A Breakthrough From the Old Structure of AI Research

For a long time, AI research was divided by data modalities. Each branch handled its own type of input, with little overlap.

NLP

Text-only. Translation, spam detection.

Computer Vision

Image-only. Object detection, image classification.

Audio

Audio-only. Speech recognition (STT), speech synthesis (TTS).

A model that can work with more than one data modality is also called a multimodal model. A generative multimodal model is also called a large multimodal model (LMM). If a language model generates the next token conditioned on text-only tokens, a multimodal model generates the next token conditioned on both text and image tokens — or whichever modalities the model supports.

Figure 1-3. A multimodal model can generate the next token using information from both text and visual tokens.

This book uses the term foundation models to refer to both large language models and large multimodal models.

Just like language models, multimodal models need data to scale up. Self-supervision works for them too. OpenAI used a variant called natural language supervision to train CLIP (OpenAI, 2021). Instead of manually generating labels for each image, they found (image, text) pairs that co-occurred on the internet — yielding a dataset of 400 million pairs, 400× larger than ImageNet, with no manual labeling cost.

CLIP became the first model that could generalize to multiple image classification tasks without requiring additional training.

CLIP isn't a generative model — it wasn't trained to generate open-ended outputs. CLIP is an embedding model, trained to produce joint embeddings of texts and images. "Introduction to Embedding" on page 134 discusses embeddings in detail; for now, you can think of embeddings as vectors that aim to capture the meanings of the original data. Multimodal embedding models like CLIP are the backbones of generative multimodal models such as Flamingo, LLaVA, and Gemini (previously Bard).

From Task-Specific to General-Purpose

Foundation models also mark the transition from task-specific models to general-purpose models. Previously, models were often developed for specific tasks, such as sentiment analysis or translation. A model trained for sentiment analysis wouldn't be able to do translation, and vice versa.

Foundation models, thanks to their scale and the way they are trained, are capable of a wide range of tasks. An LLM can do both sentiment analysis and translation. However, you can often tweak a general-purpose model to maximize its performance on a specific task.

Figure 1-4. The range of tasks in the Super-Natural-Instructions benchmark (Wang et al., 2022).

Adapting a Model to Your Needs

Imagine you're working with a retailer to build an application to generate product descriptions for their website. An out-of-the-box model might generate accurate descriptions but fail to capture the brand's voice or highlight the brand's messaging. The generated descriptions might even be full of marketing speech and clichés.

There are three common AI engineering techniques to adapt a model to your needs. The rest of the book will discuss all of them in detail.

Prompt Engineering

Craft detailed instructions with examples of the desirable outputs.

Retrieval-Augmented Generation (RAG)

Connect the model to a database (e.g., customer reviews) that it can leverage to generate better outputs.

Finetuning

Further train the model on a dataset of high-quality examples.
Adapting an existing powerful model to your task is generally a lot easier than building one from scratch — for example, ten examples and one weekend versus 1 million examples and six months. Foundation models make it cheaper to develop AI applications and reduce time to market. Exactly how much data is needed depends on the technique you use.
There are still many benefits to task-specific models — for example, they might be a lot smaller, making them faster and cheaper to use. Whether to build your own model or leverage an existing one is a classic buy-or-build question that teams will have to answer for themselves.

From Foundation Models to AI Engineering

AI engineering refers to the process of building applications on top of foundation models. People have been building AI applications for over a decade — a process often known as ML engineering or MLOps (short for ML operations). Why do we talk about AI engineering now?

If traditional ML engineering involves developing ML models, AI engineering leverages existing ones.

The availability and accessibility of powerful foundation models lead to three factors that, together, create ideal conditions for the rapid growth of AI engineering as a discipline.

Three Factors Driving the Growth of AI Engineering

Factor 1 — General-Purpose AI Capabilities

Foundation models are powerful not just because they can do existing tasks better — they can do more tasks. Applications previously thought impossible are now possible, and applications not thought of before are emerging. Even applications not thought possible today might be possible tomorrow. This vastly increases both the user base and the demand for AI applications.

Since AI can now write as well as humans (sometimes even better), it can automate or partially automate every task that requires communication — which is pretty much everything. AI is used to write emails, respond to customer requests, and explain complex contracts. Anyone with a computer has access to tools that can instantly generate customized, high-quality images and videos to create marketing materials, edit professional headshots, visualize art concepts, illustrate books, and more. AI is even used to synthesize training data, develop algorithms, and write code — all of which will help train even more powerful models in the future.

Factor 2 — Increased AI Investments

The success of ChatGPT prompted a sharp increase in investments in AI, both from venture capitalists and enterprises. As AI applications become cheaper to build and faster to go to market, returns on investment for AI become more attractive. Companies rush to incorporate AI into their products and processes.

Matt Ross, a senior manager of applied research at Scribd, told me that the estimated AI cost for his use cases has gone down two orders of magnitude from April 2022 to April 2023.

Goldman Sachs Research estimated that AI investment could approach $100 billion in the US and $200 billion globally by 2025. (For comparison, the entire US expenditure on public elementary and secondary schools is around $900 billion — only nine times the projected investment in AI in the US.)

AI is often mentioned as a competitive advantage. FactSet found that one in three S&P 500 companies mentioned AI in their earnings calls for Q2 2023 — three times more than the year earlier. According to WallStreetZen, companies that mentioned AI in their earnings calls saw their stock price increase more than those that didn't — an average 4.6% increase compared to 2.4%. It's unclear whether it's causation (AI makes these companies more successful) or correlation (companies are successful because they're quick to adapt to new technologies).

Factor 3 — Low Entrance Barrier to Building AI Applications

The model-as-a-service approach popularized by OpenAI and other model providers makes it easier to leverage AI to build applications. Models are exposed via APIs that receive user queries and return outputs — giving you access to powerful models via single API calls, without the infrastructure to host and serve them yourself.

AI also makes it possible to build applications with minimal coding:

  • AI can write code for you, allowing people without a software engineering background to quickly turn their ideas into running applications and put them in front of their users.
  • You can work with these models in plain English instead of a programming language.

Anyone, and I mean anyone, can now develop AI applications.

Figure 1-5. The number of S&P 500 companies that mention AI in their earnings calls reached a record high in 2023. Data from FactSet.

Because of the resources it takes to develop foundation models, this process is possible only for big corporations (Google, Meta, Microsoft, Baidu, Tencent), governments (Japan, the UAE), and ambitious, well-funded startups (OpenAI, Anthropic, Mistral). In a September 2022 interview, Sam Altman, CEO of OpenAI, said the biggest opportunity for the vast majority of people will be to adapt these models for specific applications.

The Fastest-Growing Engineering Discipline

The world is quick to embrace this opportunity. AI engineering has rapidly emerged as one of the fastest — quite possibly the fastest — growing engineering disciplines. Tools for AI engineering are gaining traction faster than any previous software engineering tools.

Faster Than Bitcoin

Within just two years, four open-source AI engineering tools — AutoGPT, Stable Diffusion Web UI, LangChain, Ollama — have already garnered more GitHub stars than Bitcoin.

Catching React and Vue

These tools are on track to surpass even the most popular web development frameworks, including React and Vue, in star count.

75% Monthly Profile Growth

A LinkedIn survey (Aug 2023) shows the number of professionals adding terms like Generative AI, ChatGPT, Prompt Engineering, and Prompt Crafting to their profiles increased on average 75% each month.

Figure 1-6. Open source AI engineering tools are growing faster than any other software engineering tools, according to their GitHub star counts.

Teaching AI to behave is the fastest-growing career skill. — ComputerWorld

Why the Term "AI Engineering"?

Many terms are used to describe the process of building applications on top of foundation models — ML engineering, MLOps, AIOps, LLMOps, and so on. Why did I choose to go with AI engineering for this book?

The rapidly expanding community of AI engineers has demonstrated remarkable creativity with an incredible range of exciting applications. The next section will explore some of the most common application patterns.
Copyright © 2026