Introduction to Building AI Applications with Foundation Models

The AI Engineering Stack

The three layers of the AI stack, how AI engineering differs from ML engineering and full-stack development, and how foundation models reshape model and application development.

The AI Engineering Stack

AI engineering's rapid growth induced an incredible amount of hype and FOMO. The number of new tools, techniques, models, and applications introduced every day can be overwhelming.

Instead of chasing constantly shifting sand, let's look into the fundamental building blocks of AI engineering.

Where AI Engineering Comes From

To understand AI engineering, it's important to recognize that AI engineering evolved out of ML engineering. When a company starts experimenting with foundation models, it's natural that its existing ML team should lead the effort. Some companies treat AI engineering the same as ML engineering, as shown in Figure 1-12.

Figure 1-12. Many companies put AI engineering and ML engineering under the same

Figure 1-12. Many companies put AI engineering and ML engineering under the same

Some companies have separate job descriptions for AI engineering, as shown in Figure 1-13.

Figure 1-13. Some companies have separate job descriptions for AI engineering, as shown in the job headlines on LinkedIn from December 17, 2023.

Figure 1-13. Some companies have separate job descriptions for AI engineering, as shown in the job headlines on LinkedIn from December 17, 2023.

Regardless of where organizations position AI engineers and ML engineers, their roles have significant overlap. Existing ML engineers can add AI engineering to their lists of skills to expand their job prospects. However, there are also AI engineers with no previous ML experience.

To best understand AI engineering and how it differs from traditional ML engineering, the following section breaks down different layers of the AI application building process and looks at the role each layer plays in AI engineering and ML engineering.

Three Layers of the AI Stack

There are three layers to any AI application stack. When developing an AI application, you'll likely start from the top layer and move down as needed:

Application Development

With models readily available, anyone can use them to develop applications. This is the layer that has seen the most action in the last two years, and it is still rapidly evolving. Application development involves providing a model with good prompts and necessary context. This layer requires rigorous evaluation. Good applications also demand good interfaces.

Model Development

This layer provides tooling for developing models, including frameworks for modeling, training, finetuning, and inference optimization. Because data is central to model development, this layer also contains dataset engineering. Model development also requires rigorous evaluation.

Infrastructure

At the bottom of the stack is infrastructure, which includes tooling for model serving, managing data and compute, and monitoring.

These three layers and examples of responsibilities for each layer are shown in Figure 1-14.

Figure 1-14. Three layers of the AI engineering stack.

Figure 1-14. Three layers of the AI engineering stack.

To get a sense of how the landscape has evolved with foundation models, in March 2024, I searched GitHub for all AI-related repositories with at least 500 stars. Given the prevalence of GitHub, I believe this data is a good proxy for understanding the ecosystem. In my analysis, I also included repositories for applications and models, which are the products of the application development and model development layers, respectively. I found a total of 920 repositories. Figure 1-15 shows the cumulative number of repositories in each category month-over-month.

Figure 1-15. Cumulative count of repositories by category over time

Figure 1-15. Cumulative count of repositories by category over time.

The data shows a big jump in the number of AI toolings in 2023, after the introduction of Stable Diffusion and ChatGPT. The categories that saw the highest increases were applications and application development. The infrastructure layer saw some growth, but it was much less. Even though models and applications have changed, the core infrastructural needs — resource management, serving, monitoring, etc. — remain the same.

While the level of excitement and creativity around foundation models is unprecedented, many principles of building AI applications remain the same. For enterprise use cases, AI applications still need to solve business problems, and, therefore, it's still essential to map from business metrics to ML metrics and vice versa. You still need to do systematic experimentation. With classical ML engineering, you experiment with different hyperparameters. With foundation models, you experiment with different models, prompts, retrieval algorithms, sampling variables, and more. (Sampling variables are discussed in Chapter 2.) We still want to make models run faster and cheaper. It's still important to set up a feedback loop so that we can iteratively improve our applications with production data.

This means that much of what ML engineers have learned and shared over the last decade is still applicable. This collective experience makes it easier for everyone to begin building AI applications. However, built on top of these enduring principles are many innovations unique to AI engineering, which we'll explore in this book.

AI Engineering Versus ML Engineering

While the unchanging principles of deploying AI applications are reassuring, it's also important to understand how things have changed. This is helpful for teams that want to adapt their existing platforms for new AI use cases and developers who are interested in which skills to learn to stay competitive in a new market.

At a high level, building applications using foundation models today differs from traditional ML engineering in three major ways:

Pre-trained models replace training your own

Without foundation models, you have to train your own models for your applications. With AI engineering, you use a model someone else has trained for you. This means that AI engineering focuses less on modeling and training, and more on model adaptation.

Bigger models, more compute pressure

AI engineering works with models that are bigger, consume more compute resources, and incur higher latency than traditional ML engineering. This means more pressure for efficient training and inference optimization. A corollary is that many companies now need more GPUs and work with bigger compute clusters than they previously did, creating more need for engineers who know how to work with GPUs and big clusters.¹

Open-ended outputs make evaluation harder

AI engineering works with models that can produce open-ended outputs. Open-ended outputs give models the flexibility to be used for more tasks, but they are also harder to evaluate. This makes evaluation a much bigger problem in AI engineering.

In short, AI engineering differs from ML engineering in that it's less about model development and more about adapting and evaluating models. Before we move on, let's clarify what model adaptation means. In general, model adaptation techniques can be divided into two categories, depending on whether they require updating model weights.

Prompt-Based Techniques

Adapt a model without updating model weights. You adapt a model by giving it instructions and context instead of changing the model itself.

Prompt engineering is easier to get started and requires less data. Many successful applications have been built with just prompt engineering. Its ease of use lets you experiment with more models, increasing your chance of finding one that's unexpectedly good for your application.

However, it might not be enough for complex tasks or applications with strict performance requirements.

Finetuning

Adapt a model by updating model weights. You adapt a model by making changes to the model itself.

In general, finetuning techniques are more complicated and require more data, but they can improve quality, latency, and cost significantly.

Many things aren't possible without changing model weights, such as adapting a model to a new task it wasn't exposed to during training.

Now, let's zoom into the application development and model development layers to see how each has changed with AI engineering, starting with what existing ML engineers are more familiar with.

Model Development

Model development is the layer most commonly associated with traditional ML engineering. It has three main responsibilities: modeling and training, dataset engineering, and inference optimization. Evaluation is also required, but because most people will come across it first in the application development layer, I'll discuss evaluation in the next section.

Modeling and training

Modeling and training refers to the process of coming up with a model architecture, training it, and finetuning it. Examples of tools in this category are Google's TensorFlow, Hugging Face's Transformers, and Meta's PyTorch.

Developing ML models requires specialized ML knowledge. It requires knowing different types of ML algorithms (such as clustering, logistic regression, decision trees, and collaborative filtering) and neural network architectures (such as feedforward, recurrent, convolutional, and transformer). It also requires understanding how a model learns, including concepts such as gradient descent, loss function, regularization, etc.

With the availability of foundation models, ML knowledge is no longer a must-have for building AI applications. I've met many wonderful and successful AI application builders who aren't at all interested in learning about gradient descent. However, ML knowledge is still extremely valuable, as it expands the set of tools that you can use and helps troubleshooting when a model doesn't work as expected.

On the Differences Among Training, Pre-Training, Finetuning, and Post-TrainingTraining always involves changing model weights, but not all changes to model weights constitute training. For example, quantization — reducing the precision of model weights — technically changes the model's weight values but isn't considered training.The term training can often be used in place of pre-training, finetuning, and post-training, which refer to different training phases.

Dataset engineering

Dataset engineering refers to curating, generating, and annotating the data needed for training and adapting AI models.

Closed-Ended (Traditional ML)

Most use cases are close-ended — a model's output can only be among predefined values. For example, spam classification with only two possible outputs, "spam" and "not spam".

Open-Ended (Foundation Models)

Foundation models are open-ended. Annotating open-ended queries is much harder than annotating close-ended queries — it's easier to determine whether an email is spam than to write an essay. So data annotation is a much bigger challenge for AI engineering.

Another difference is that traditional ML engineering works more with tabular data, whereas foundation models work with unstructured data. In AI engineering, data manipulation is more about deduplication, tokenization, context retrieval, and quality control, including removing sensitive information and toxic data. Dataset engineering is the focus of Chapter 8.

Many people argue that because models are now commodities, data will be the main differentiator, making dataset engineering more important than ever. How much data you need depends on the adapter technique you use:Training from scratch > Finetuning > Prompt engineeringRegardless of how much data you need, expertise in data is useful when examining a model, as its training data gives important clues about that model's strengths and weaknesses.

Inference optimization

Inference optimization means making models faster and cheaper. Inference optimization has always been important for ML engineering. Users never say no to faster models, and companies can always benefit from cheaper inference. However, as foundation models scale up to incur even higher inference cost and latency, inference optimization has become even more important.

One challenge with foundation models is that they are often autoregressive — tokens are generated sequentially. If it takes 10 ms for a model to generate a token, it'll take a second to generate an output of 100 tokens, and even more for longer outputs. As users are getting notoriously impatient, getting AI applications' latency down to the 100 ms latency expected for a typical internet application is a huge challenge. Inference optimization has become an active subfield in both industry and academia.

A summary of how the importance of different categories of model development change with AI engineering is shown in Table 1-4.

Table 1-4. How different responsibilities of model development have changed with foundation models.

Category	Building with traditional ML	Building with foundation models
Modeling and training	ML knowledge is required for training a model from scratch	ML knowledge is a nice-to-have, not a must-have⁴
Dataset engineering	More about feature engineering, especially with tabular data	Less about feature engineering and more about data deduplication, tokenization, context retrieval, and quality control
Inference optimization	Important	Even more important

Inference optimization techniques, including quantization, distillation, and parallelism, are discussed in Chapters 7 through 9.

Application Development

With traditional ML engineering, where teams build applications using their proprietary models, the model quality is a differentiation. With foundation models, where many teams use the same model, differentiation must be gained through the application development process.

The application development layer consists of three responsibilities: evaluation, prompt engineering, and AI interface.

Evaluation

Evaluation is about mitigating risks and uncovering opportunities. Evaluation is necessary throughout the whole model adaptation process — to select models, to benchmark progress, to determine whether an application is ready for deployment, and to detect issues and opportunities for improvement in production.

While evaluation has always been important in ML engineering, it's even more important with foundation models. The challenges of evaluating foundation models are discussed in Chapter 3. To summarize, these challenges chiefly arise from foundation models' open-ended nature and expanded capabilities.

Closed-Ended Tasks

In tasks like fraud detection, there are usually expected ground truths to compare your model's outputs against. If a model's output differs from the expected output, you know the model is wrong.

Open-Ended Tasks

For a task like chatbots, there are so many possible responses to each prompt that it is impossible to curate an exhaustive list of ground truths to compare a model's response to.

The existence of so many adaptation techniques also makes evaluation harder. A system that performs poorly with one technique might perform much better with another. When Google launched Gemini in December 2023, they claimed that Gemini is better than ChatGPT in the MMLU benchmark (Hendrycks et al., 2020). Google had evaluated Gemini using a prompt engineering technique called CoT@32. In this technique, Gemini was shown 32 examples, while ChatGPT was shown only 5 examples. When both were shown five examples, ChatGPT performed better, as shown in Table 1-5.

Table 1-5. Different prompts can cause models to perform very differently, as seen in Gemini's technical report (December 2023).

	Gemini ultra	Gemini Pro	GPT-4	GPT-3.5	PaLM 2-L	Claude 2	Inflection-2	Grok 1	Llama-2
MMLU performance	90.04% CoT@32	79.13% CoT@8	87.29% CoT@32 (via API)	70% 5-shot	78.4% 5-shot	78.5% 5-shot CoT	79.6% 5-shot	73.0% 5-shot	68.0%
	83.7% 5-shot	71.8% 5-shot	86.4% 5-shot (reported)

Prompt engineering and context construction

Prompt engineering is about getting AI models to express the desirable behaviors from the input alone, without changing the model weights. The Gemini evaluation story highlights the impact of prompt engineering on model performance. By using a different prompt engineering technique, Gemini Ultra's performance on MMLU went from 83.7% to 90.04%.

It's possible to get a model to do amazing things with just prompts. The right instructions can get a model to perform the task you want, in the format of your choice. Prompt engineering is not just about telling a model what to do. It's also about giving the model the necessary context and tools to do a given task. For complex tasks with long context, you might also need to provide the model with a memory management system so the model can keep track of its history.Chapter 5 discusses prompt engineering, and Chapter 6 discusses context construction.

AI interface

AI interface means creating an interface for end users to interact with your AI applications. Before foundation models, only organizations with sufficient resources to develop AI models could develop AI applications. These applications were often embedded into the organizations' existing products. For example, fraud detection was embedded into Stripe, Venmo, and PayPal. Recommender systems were part of social networks and media apps like Netflix, TikTok, and Spotify.

With foundation models, anyone can build AI applications. You can serve your AI applications as standalone products or embed them into other products, including products developed by other people. For example, ChatGPT and Perplexity are standalone products, whereas GitHub Copilot is commonly used as a plug-in in VSCode, Grammarly as a browser extension for Google Docs, and Midjourney can be used via its standalone web app or via its integration in Discord.

Here are some of the interfaces that are gaining popularity for AI applications:

Standalone Apps

Web, desktop, and mobile apps.⁵

Browser Extensions

Let users quickly query AI models while browsing.

Chat App Integrations

Chatbots integrated into chat apps like Slack, Discord, WeChat, and WhatsApp.

Plug-ins & APIs

Many products — VSCode, Shopify, Microsoft 365 — provide APIs to integrate AI as plug-ins and add-ons. These APIs can also be used by AI agents to interact with the world (Chapter 6).

While the chat interface is the most commonly used, AI interfaces can also be voice-based (such as with voice assistants) or embodied (such as in augmented and virtual reality).

These new AI interfaces also mean new ways to collect and extract user feedback. The conversation interface makes it so much easier for users to give feedback in natural language, but this feedback is harder to extract. User feedback design is discussed in Chapter 10.

A summary of how the importance of different categories of app development changes with AI engineering is shown in Table 1-6.

Table 1-6. The importance of different categories in app development for AI engineering and ML engineering.

Category	Building with traditional ML	Building with foundation models
AI interface	Less important	Important
Prompt engineering	Not applicable	Important
Evaluation	Important	More important

AI Engineering Versus Full-Stack Engineering

The increased emphasis on application development, especially on interfaces, brings AI engineering closer to full-stack development.⁶ The rising importance of interfaces leads to a shift in the design of AI toolings to attract more frontend engineers.

Then — Python-centric ML

Traditionally, ML engineering is Python-centric. Before foundation models, the most popular ML frameworks supported mostly Python APIs.

Now — JavaScript Joins

Today, Python is still popular, but there is increasing support for JavaScript APIs: LangChain.js, Transformers.js, OpenAI's Node library, and Vercel's AI SDK.

While many AI engineers come from traditional ML backgrounds, more are increasingly coming from web development or full-stack backgrounds. An advantage that full-stack engineers have over traditional ML engineers is their ability to quickly turn ideas into demos, get feedback, and iterate.

With traditional ML engineering, you usually start with gathering data and training a model. Building the product comes last. However, with AI models readily available today, it's possible to start with building the product first, and only invest in data and models once the product shows promise, as visualized in Figure 1-16.

Figure 1-16. The new AI engineering workflow rewards those who can iterate fast. Image recreated from "The Rise of the AI Engineer"

Figure 1-16. The new AI engineering workflow rewards those who can iterate fast. Image recreated from "The Rise of the AI Engineer" (Shawn Wang, 2023).

In traditional ML engineering, model development and product development are often disjointed processes, with ML engineers rarely involved in product decisions at many organizations. However, with foundation models, AI engineers tend to be much more involved in building the product.

As the head of AI at a Fortune 500 company told me: his team knows how to work with 10 GPUs, but they don't know how to work with 1,000 GPUs. ↩
And they are offered incredible compensation packages. ↩
If you find the terms "pre-training" and "post-training" lacking in imagination, you're not alone. The AI research community is great at many things, but naming isn't one of them. We already talked about how "large language models" is hardly a scientific term because of the ambiguity of the word "large". And I really wish people would stop publishing papers with the title "X is all you need." ↩
Many people would dispute this claim, saying that ML knowledge is a must-have. ↩
Streamlit, Gradio, and Plotly Dash are common tools for building AI web apps. ↩
Anton Bacaj told me that "AI engineering is just software engineering with AI models thrown in the stack." ↩

Edit this pageorReport an issue

Planning AI Applications

How to evaluate use cases, build vs buy, set success metrics, plan milestones, and maintain AI products in a fast-moving landscape.

Summary

A recap of how foundation models gave rise to AI engineering, the application patterns enabled, and the framework this book provides.