[{"data":1,"prerenderedAt":1472},["ShallowReactive",2],{"navigation_docs_en":3,"\u002Fen\u002Fai-engineering\u002Fintro\u002Fch011-the-rise-of-ai-engineering":46,"\u002Fen\u002Fai-engineering\u002Fintro\u002Fch011-the-rise-of-ai-engineering-surround":1467},[4],{"title":5,"icon":6,"path":7,"stem":8,"children":9,"page":45},"AI Engineering",null,"\u002Fen\u002Fai-engineering","en\u002F1.ai-engineering",[10],{"title":11,"icon":12,"path":13,"stem":14,"children":15,"page":45},"Introduction to Building AI Applications with Foundation Models","i-lucide-brain-circuit","\u002Fen\u002Fai-engineering\u002Fintro","en\u002F1.ai-engineering\u002F1.intro",[16,20,25,30,35,40],{"title":11,"path":17,"stem":18,"icon":19},"\u002Fen\u002Fai-engineering\u002Fintro\u002Fch01","en\u002F1.ai-engineering\u002F1.intro\u002Fch01","i-lucide-sparkles",{"title":21,"path":22,"stem":23,"icon":24},"The Rise of AI Engineering","\u002Fen\u002Fai-engineering\u002Fintro\u002Fch011-the-rise-of-ai-engineering","en\u002F1.ai-engineering\u002F1.intro\u002Fch011-the-rise-of-ai-engineering","i-lucide-history",{"title":26,"path":27,"stem":28,"icon":29},"Foundation Model Use Cases","\u002Fen\u002Fai-engineering\u002Fintro\u002Fch012-foundation-model-use-cases","en\u002F1.ai-engineering\u002F1.intro\u002Fch012-foundation-model-use-cases","i-lucide-layout-grid",{"title":31,"path":32,"stem":33,"icon":34},"Planning AI Applications","\u002Fen\u002Fai-engineering\u002Fintro\u002Fch013-planning-ai-applications","en\u002F1.ai-engineering\u002F1.intro\u002Fch013-planning-ai-applications","i-lucide-clipboard-list",{"title":36,"path":37,"stem":38,"icon":39},"The AI Engineering Stack","\u002Fen\u002Fai-engineering\u002Fintro\u002Fch014-the-ai-engineering-stack","en\u002F1.ai-engineering\u002F1.intro\u002Fch014-the-ai-engineering-stack","i-lucide-layers",{"title":41,"path":42,"stem":43,"icon":44},"Summary","\u002Fen\u002Fai-engineering\u002Fintro\u002Fch015-summary","en\u002F1.ai-engineering\u002F1.intro\u002Fch015-summary","i-lucide-flag",false,{"id":47,"title":21,"body":48,"description":1461,"extension":1462,"links":6,"meta":1463,"navigation":1464,"path":22,"seo":1465,"stem":23,"__hash__":1466},"docs_en\u002Fen\u002F1.ai-engineering\u002F1.intro\u002Fch011-the-rise-of-ai-engineering.md",{"type":49,"value":50,"toc":1444},"minimark",[51,80,85,99,103,108,126,129,171,178,183,194,217,227,243,250,267,275,340,344,347,423,426,435,444,448,458,465,474,481,484,516,527,530,536,549,553,619,646,653,734,742,753,759,763,769,779,794,829,843,847,854,864,885,904,908,911,937,948,957,963,987,993,1012,1016,1023,1034,1049,1053,1060,1066,1094,1105,1116,1120,1133,1144,1147,1151,1276,1285,1308,1312,1319,1364,1373,1390,1394,1400,1438],[52,53,54,58],"u-page-hero",{},[55,56,21],"template",{"v-slot:title":57},"",[55,59,60,77],{"v-slot:description":57},[61,62,63,64,68,69,72,73,76],"p",{},"Foundation models emerged from large language models, which, in turn, originated as just language models. While applications like ",[65,66,67],"strong",{},"ChatGPT"," and ",[65,70,71],{},"GitHub Copilot"," may seem to have come out of nowhere, they are the culmination of decades of technology advancements — with the first language models emerging in the ",[65,74,75],{},"1950s",".",[61,78,79],{},"This section traces the key breakthroughs that enabled the evolution from language models to AI engineering.",[81,82,84],"h2",{"id":83},"from-language-models-to-large-language-models","From Language Models to Large Language Models",[61,86,87,88,91,92,68,96,98],{},"Language models have been around for a while, but they've only been able to grow to the scale they are today with ",[65,89,90],{},"self-supervision",". This section gives a quick overview of what ",[93,94,95],"em",{},"language model",[93,97,90],{}," mean.",[100,101,102],"tip",{},"If you're already familiar with these concepts, feel free to skip ahead.",[104,105,107],"h3",{"id":106},"language-models","Language Models",[61,109,110,111,113,114,118,119,122,123,76],{},"A ",[93,112,95],{}," encodes statistical information about one or more languages. Intuitively, this information tells us how likely a word is to appear in a given context. For example, given the context ",[115,116,117],"code",{},"My favorite color is",", a language model that encodes English should predict ",[115,120,121],{},"blue"," more often than ",[115,124,125],{},"car",[61,127,128],{},"The statistical nature of languages was discovered centuries ago.",[130,131,132,154],"card-group",{},[133,134,137,138,147,148,151,152,76],"card",{"icon":135,"title":136},"i-lucide-book-open","1905 — Sherlock Holmes","In ",[93,139,140],{},[141,142,146],"a",{"href":143,"rel":144},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FThe_Adventure_of_the_Dancing_Men",[145],"nofollow","The Adventure of the Dancing Men",", Holmes used simple statistics of English to decode mysterious stick figures. Since the most common letter in English is ",[93,149,150],{},"E",", Holmes deduced the most common stick figure must stand for ",[93,153,150],{},[133,155,158,159,166,167,170],{"icon":156,"title":157},"i-lucide-radio","1951 — Claude Shannon","Used more sophisticated statistics to decipher enemies' messages during WWII. His landmark paper ",[93,160,161],{},[141,162,165],{"href":163,"rel":164},"https:\u002F\u002Fwww.kuenzigbooks.com\u002Fpages\u002Fbooks\u002F28623\u002Fc-e-shannon-claude-elwood\u002Fprediction-and-entropy-of-printed-english-bell-monograph",[145],"\"Prediction and Entropy of Printed English\""," introduced concepts like ",[65,168,169],{},"entropy"," that are still used for language modeling today.",[172,173,174,175,76],"note",{},"In the early days, a language model involved one language. Today, a language model can involve ",[65,176,177],{},"multiple languages",[179,180,182],"h4",{"id":181},"tokens","Tokens",[61,184,185,186,189,190,193],{},"The basic unit of a language model is a ",[93,187,188],{},"token",". A token can be a character, a word, or a part of a word (like ",[115,191,192],{},"-tion","), depending on the model. For non-English languages, a single Unicode character can sometimes be represented as multiple tokens.",[61,195,196,197,200,201,204,205,68,208,211,212,76],{},"For example, GPT-4 — the model behind ChatGPT — breaks the phrase ",[115,198,199],{},"I can't wait to build AI applications"," into nine tokens. Note that the word ",[115,202,203],{},"can't"," is broken into two tokens, ",[115,206,207],{},"can",[115,209,210],{},"'t",". You can see how different OpenAI models tokenize text on the ",[141,213,216],{"href":214,"rel":215},"https:\u002F\u002Fplatform.openai.com\u002Ftokenizer",[145],"OpenAI website",[61,218,219,224],{},[220,221],"img",{"alt":222,"src":223},"An example of how GPT-4 tokenizes a phrase",".\u002Fmedia\u002Ffig-1-1.png",[93,225,226],{},"Figure 1-1. An example of how GPT-4 tokenizes a phrase.",[61,228,229,230,233,234,239,240,76],{},"The process of breaking the original text into tokens is called ",[93,231,232],{},"tokenization",". For ",[141,235,238],{"href":236,"rel":237},"https:\u002F\u002Fhelp.openai.com\u002Fen\u002Farticles\u002F4936856-what-are-tokens-and-how-to-count-them",[145],"GPT-4, an average token is approximately ¾ the length of a word",". So, ",[65,241,242],{},"100 tokens are approximately 75 words",[61,244,245,246,249],{},"The set of all tokens a model can work with is the model's ",[93,247,248],{},"vocabulary",". You can use a small number of tokens to construct a large number of distinct words, similar to how you can use a few letters in the alphabet to construct many words.",[130,251,252,261],{},[133,253,256,257,260],{"icon":254,"title":255},"i-lucide-hash","Mixtral 8x7B","Vocabulary size of ",[65,258,259],{},"32,000"," tokens.",[133,262,256,264,260],{"icon":254,"title":263},"GPT-4",[65,265,266],{},"100,256",[61,268,269,270,274],{},"The ",[141,271,232],{"href":272,"rel":273},"https:\u002F\u002Fgithub.com\u002Fopenai\u002Ftiktoken\u002Fblob\u002Fmain\u002Ftiktoken\u002Fmodel.py",[145]," method and vocabulary size are decided by model developers.",[172,276,278,295,337],{"icon":277},"i-lucide-circle-help",[61,279,280,294],{},[65,281,282,283,285,286,289,290,293],{},"Why do language models use ",[93,284,188],{}," as their unit instead of ",[93,287,288],{},"word"," or ",[93,291,292],{},"character","?"," There are three main reasons:",[296,297,298,313,320],"ol",{},[299,300,301,302,305,306,68,309,312],"li",{},"Compared to characters, tokens allow the model to break words into meaningful components. For example, ",[115,303,304],{},"cooking"," can be broken into ",[115,307,308],{},"cook",[115,310,311],{},"ing",", with both components carrying some meaning of the original word.",[299,314,315,316,319],{},"Because there are fewer unique tokens than unique words, this reduces the model's vocabulary size, making the model ",[65,317,318],{},"more efficient"," (as discussed in Chapter 2).",[299,321,322,323,326,327,330,331,68,334,336],{},"Tokens also help the model process ",[65,324,325],{},"unknown words",". For instance, a made-up word like ",[115,328,329],{},"chatgpting"," could be split into ",[115,332,333],{},"chatgpt",[115,335,311],{},", helping the model understand its structure.",[61,338,339],{},"Tokens balance having fewer units than words while retaining more meaning than individual characters.",[179,341,343],{"id":342},"two-main-types-of-language-models","Two Main Types of Language Models",[61,345,346],{},"There are two main types of language models. They differ based on what information they can use to predict a token.",[130,348,349,389],{},[133,350,353,363,376],{"icon":351,"title":352},"i-lucide-square-dashed","Masked Language Model",[61,354,355,356,359,360,76],{},"Trained to predict missing tokens ",[65,357,358],{},"anywhere in a sequence",", using context from both before and after the missing tokens. Essentially trained to ",[65,361,362],{},"fill in the blank",[61,364,365,368,369,372,373,76],{},[93,366,367],{},"Example:"," given ",[115,370,371],{},"My favorite __ is blue",", predict ",[115,374,375],{},"color",[61,377,378,379,382,383,388],{},"A well-known example is ",[65,380,381],{},"BERT"," (",[141,384,387],{"href":385,"rel":386},"https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805",[145],"Devlin et al., 2018","). Today, masked language models are commonly used for non-generative tasks like sentiment analysis, text classification, and code debugging — where understanding the overall context matters.",[133,390,393,407],{"icon":391,"title":392},"i-lucide-arrow-right","Autoregressive Language Model",[61,394,395,396,399,400,403,404,76],{},"Trained to predict the ",[65,397,398],{},"next token"," in a sequence, using ",[65,401,402],{},"only the preceding tokens",". It predicts what comes next in ",[115,405,406],{},"My favorite color is __",[61,408,409,410,413,414],{},"An autoregressive model can continually generate one token after another. Today, autoregressive language models are the ",[65,411,412],{},"models of choice for text generation"," and are far more popular than masked language models. ",[93,415,416,417,422],{},"(Sometimes referred to as ",[141,418,421],{"href":419,"rel":420},"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fen\u002Ftasks\u002Flanguage_modeling",[145],"causal language models",".)",[172,424,425],{},"Technically, a masked language model like BERT can also be used for text generation — if you try really hard.",[61,427,428,432],{},[220,429],{"alt":430,"src":431},"Autoregressive language model and masked language model",".\u002Fmedia\u002Ffig-1-2.png",[93,433,434],{},"Figure 1-2. Autoregressive language model and masked language model.",[172,436,437,438,440,441,76],{},"In this book, unless explicitly stated, ",[93,439,95],{}," will refer to an ",[65,442,443],{},"autoregressive model",[179,445,447],{"id":446},"language-models-as-completion-machines","Language Models as Completion Machines",[61,449,450,451,454,455,76],{},"The outputs of language models are open-ended. A language model can use its fixed, finite vocabulary to construct infinite possible outputs. A model that can generate open-ended outputs is called ",[93,452,453],{},"generative"," — hence the term ",[65,456,457],{},"generative AI",[61,459,460,461,464],{},"You can think of a language model as a ",[93,462,463],{},"completion machine",": given a text (prompt), it tries to complete that text.",[466,467,472],"pre",{"className":468,"code":470,"language":471},[469],"language-text","Prompt (from user):       \"To be or not to be\"\nCompletion (from model):  \", that is the question.\"\n","text",[115,473,470],{"__ignoreMap":57},[475,476,477,480],"warning",{},[65,478,479],{},"Completions are predictions, based on probabilities"," — not guaranteed to be correct. This probabilistic nature of language models makes them both so exciting and frustrating to use. We explore this further in Chapter 2.",[61,482,483],{},"As simple as it sounds, completion is incredibly powerful. Many tasks — translation, summarization, coding, and solving math problems — can be framed as completion tasks.",[130,485,486,502],{},[133,487,490,496],{"icon":488,"title":489},"i-lucide-languages","Translation",[61,491,492,493],{},"Prompt: ",[115,494,495],{},"How are you in French is …",[61,497,498,499],{},"Completion: ",[115,500,501],{},"Comment ça va",[133,503,506,511],{"icon":504,"title":505},"i-lucide-mail","Spam Classification",[61,507,492,508],{},[115,509,510],{},"Is this email likely spam? Here's the email: \u003Cemail content>. Answer:",[61,512,498,513],{},[115,514,515],{},"Likely spam",[172,517,518,519,522,523,526],{},"Completion isn't the same as engaging in a conversation. If you ask a completion machine a question, it can complete what you said by adding ",[65,520,521],{},"another question"," instead of answering. ",[93,524,525],{},"\"Post-Training\" on page 78"," discusses how to make a model respond appropriately to a user's request.",[104,528,529],{"id":90},"Self-Supervision",[61,531,532,533],{},"Language modeling is just one of many ML algorithms. There are also models for object detection, topic modeling, recommender systems, weather forecasting, stock price prediction, and more. ",[65,534,535],{},"What's special about language models that made them the center of the scaling approach behind the ChatGPT moment?",[100,537,538,539,544,545,548],{},"The answer is that ",[65,540,541,542],{},"language models can be trained using ",[93,543,90],{},", while many other models require ",[93,546,547],{},"supervision",". Self-supervision overcomes the data labeling bottleneck, allowing models to scale up.",[179,550,552],{"id":551},"supervision-vs-self-supervision","Supervision vs. Self-Supervision",[130,554,555,602],{},[133,556,559,566,577],{"icon":557,"title":558},"i-lucide-tag","Supervision",[61,560,561,562,565],{},"You ",[65,563,564],{},"label examples"," to show the behaviors you want the model to learn, then train the model on these examples.",[61,567,568,570,571,289,574,76],{},[93,569,367],{}," to train a fraud detection model, you use transactions each labeled ",[115,572,573],{},"fraud",[115,575,576],{},"not fraud",[61,578,579,580,587,588,591,592,594,595,598,599,76],{},"The success of AI models in the 2010s lay in supervision. ",[65,581,582],{},[141,583,586],{"href":584,"rel":585},"https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2012\u002Ffile\u002Fc399862d3b9d6b76c8436e924a68c45b-Paper.pdf",[145],"AlexNet"," (Krizhevsky et al., 2012), the model that started the deep learning revolution, was supervised — trained on ",[65,589,590],{},"ImageNet"," to classify over 1 million images into 1,000 categories such as ",[115,593,125],{},", ",[115,596,597],{},"balloon",", or ",[115,600,601],{},"monkey",[133,603,605,612],{"icon":604,"title":529},"i-lucide-infinity",[61,606,607,608,611],{},"Instead of requiring explicit labels, the model ",[65,609,610],{},"infers labels from the input data",". Language modeling is self-supervised because each input sequence provides both the labels (tokens to be predicted) and the context for predicting them.",[61,613,614,615,618],{},"Because text sequences are everywhere — books, blog posts, articles, Reddit comments — it's possible to construct a ",[65,616,617],{},"massive amount of training data",", allowing language models to scale up to LLMs.",[475,620,621,634],{},[61,622,623,626,627,630,631,76],{},[65,624,625],{},"The labeling bottleneck."," If it costs 5¢ for one person to label one image, it'd cost ",[65,628,629],{},"$50,000"," to label a million images for ImageNet. With cross-checking by a second labeler, it'd cost twice as much. Scaling to 1 million categories would push the labeling cost alone to ",[65,632,633],{},"$50 million",[61,635,636,637,642,643,76],{},"The actual cost varies — ",[141,638,641],{"href":639,"rel":640},"https:\u002F\u002Faws.amazon.com\u002Fsagemaker\u002Fai\u002Fgroundtruth\u002F",[145],"Amazon SageMaker Ground Truth"," charges 8¢ per image for fewer than 50,000 images, dropping to 2¢ above 1 million (as of September 2024). And not all labeling is cheap. Generating Latin translations is harder than tagging everyday objects. Labeling whether a CT scan shows signs of cancer would be ",[65,644,645],{},"astronomical",[61,647,648,649,652],{},"For example, the sentence ",[115,650,651],{},"I love street food."," gives six self-supervised training samples:",[654,655,656,669],"table",{},[657,658,659],"thead",{},[660,661,662,666],"tr",{},[663,664,665],"th",{},"Input (context)",[663,667,668],{},"Output (next token)",[670,671,672,683,693,703,713,722],"tbody",{},[660,673,674,680],{},[675,676,677],"td",{},[115,678,679],{},"\u003CBOS>",[675,681,682],{},"I",[660,684,685,690],{},[675,686,687,689],{},[115,688,679],{},", I",[675,691,692],{},"love",[660,694,695,700],{},[675,696,697,699],{},[115,698,679],{},", I, love",[675,701,702],{},"street",[660,704,705,710],{},[675,706,707,709],{},[115,708,679],{},", I, love, street",[675,711,712],{},"food",[660,714,715,720],{},[675,716,717,719],{},[115,718,679],{},", I, love, street, food",[675,721,76],{},[660,723,724,729],{},[675,725,726,728],{},[115,727,679],{},", I, love, street, food, .",[675,730,731],{},[115,732,733],{},"\u003CEOS>",[61,735,736],{},[93,737,738,739,741],{},"Table 1-1. Training samples from the sentence ",[115,740,651],{}," for language modeling.",[61,743,744,68,746,748,749,752],{},[115,745,679],{},[115,747,733],{}," mark the beginning and the end of a sequence. These markers are necessary for a language model to work with multiple sequences. Each marker is typically treated as one special token by the model. The end-of-sequence marker is especially important — it helps language models know ",[65,750,751],{},"when to end their responses"," (similar to how it's important for humans to know when to stop talking).",[172,754,755,758],{},[65,756,757],{},"Self-supervision differs from unsupervision."," In self-supervised learning, labels are inferred from the input data. In unsupervised learning, you don't need labels at all.",[179,760,762],{"id":761},"from-language-models-to-llms","From Language Models to LLMs",[61,764,765,766,76],{},"Self-supervised learning means language models can learn from text sequences without requiring any labeling. Because text is everywhere, it's possible to construct massive training datasets — allowing language models to scale up to become ",[65,767,768],{},"LLMs",[475,770,771,774,775,778],{},[65,772,773],{},"LLM is hardly a scientific term."," How large does a language model have to be to be considered ",[93,776,777],{},"large","? What is large today might be considered tiny tomorrow.",[61,780,781,782,785,786,793],{},"A model's size is typically measured by its number of parameters. A ",[93,783,784],{},"parameter"," is a variable within an ML model that is updated through the training process. ",[93,787,788,789,792],{},"(In school, parameters were taught as a combination of weights and biases. Today, we generally use ",[65,790,791],{},"model weights"," to refer to all parameters.)"," In general — though not always — the more parameters a model has, the greater its capacity to learn desired behaviors.",[795,796,798,802,808,812,818,822],"steps",{"level":797},"4",[179,799,801],{"id":800},"june-2018-gpt-1","June 2018 — GPT-1",[61,803,804,807],{},[65,805,806],{},"117 million parameters",". Considered large at the time.",[179,809,811],{"id":810},"february-2019-gpt-2","February 2019 — GPT-2",[61,813,814,817],{},[65,815,816],{},"1.5 billion parameters",". 117 million was downgraded to \"small.\"",[179,819,821],{"id":820},"today","Today",[61,823,824,825,828],{},"A model with ",[65,826,827],{},"100 billion parameters"," is considered large. Perhaps one day, this size will be considered small.",[172,830,831,834,835,838,839,842],{"icon":277},[65,832,833],{},"Why do larger models need more data?"," It seems counterintuitive — if a model is more powerful, shouldn't it need ",[93,836,837],{},"fewer"," examples to learn from? But we're not trying to match a small model's performance with the same data; we're trying to ",[65,840,841],{},"maximize"," model performance. Larger models have more capacity to learn, and therefore need more training data to maximize it. You can train a large model on a small dataset, but it'd be a waste of compute — you could have achieved similar or better results with a smaller model.",[81,844,846],{"id":845},"from-large-language-models-to-foundation-models","From Large Language Models to Foundation Models",[61,848,849,850,853],{},"While language models are capable of incredible tasks, they are limited to text. As humans, we perceive the world not just via language but also through ",[65,851,852],{},"vision, hearing, touch, and more",". Being able to process data beyond text is essential for AI to operate in the real world.",[61,855,856,857,68,860,863],{},"For this reason, language models are being extended to incorporate more data modalities. ",[65,858,859],{},"GPT-4V",[65,861,862],{},"Claude 3"," can understand images and texts. Some models even understand videos, 3D assets, protein structures, and more.",[172,865,866],{},[867,868,869],"blockquote",{},[61,870,871,872,875,876],{},"Incorporating additional modalities (such as image inputs) into LLMs is viewed by some as ",[65,873,874],{},"a key frontier in AI research and development",".\n— ",[93,877,878,879,884],{},"OpenAI, ",[141,880,883],{"href":881,"rel":882},"https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002FGPTV_System_Card.pdf",[145],"GPT-4V system card",", 2023",[61,886,887,888,895,896,899,900,903],{},"While many people still call Gemini and GPT-4V LLMs, they're better characterized as ",[141,889,892],{"href":890,"rel":891},"https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.07258",[145],[93,893,894],{},"foundation models",". The word ",[93,897,898],{},"foundation"," signifies both the importance of these models in AI applications and the fact that they can be ",[65,901,902],{},"built upon"," for different needs.",[104,905,907],{"id":906},"a-breakthrough-from-the-old-structure-of-ai-research","A Breakthrough From the Old Structure of AI Research",[61,909,910],{},"For a long time, AI research was divided by data modalities. Each branch handled its own type of input, with little overlap.",[130,912,913,921,929],{},[133,914,917,920],{"icon":915,"title":916},"i-lucide-file-text","NLP",[65,918,919],{},"Text-only."," Translation, spam detection.",[133,922,925,928],{"icon":923,"title":924},"i-lucide-eye","Computer Vision",[65,926,927],{},"Image-only."," Object detection, image classification.",[133,930,933,936],{"icon":931,"title":932},"i-lucide-mic","Audio",[65,934,935],{},"Audio-only."," Speech recognition (STT), speech synthesis (TTS).",[61,938,939,940,943,944,947],{},"A model that can work with more than one data modality is also called a ",[93,941,942],{},"multimodal model",". A generative multimodal model is also called a ",[65,945,946],{},"large multimodal model (LMM)",". If a language model generates the next token conditioned on text-only tokens, a multimodal model generates the next token conditioned on both text and image tokens — or whichever modalities the model supports.",[61,949,950,954],{},[220,951],{"alt":952,"src":953},"A multimodal model can generate the next token using information from both text and visual tokens",".\u002Fmedia\u002Ffig-1-3.png",[93,955,956],{},"Figure 1-3. A multimodal model can generate the next token using information from both text and visual tokens.",[172,958,959,960,962],{},"This book uses the term ",[65,961,894],{}," to refer to both large language models and large multimodal models.",[61,964,965,966,969,970,975,976,979,980,983,984,76],{},"Just like language models, multimodal models need data to scale up. Self-supervision works for them too. OpenAI used a variant called ",[93,967,968],{},"natural language supervision"," to train ",[141,971,974],{"href":972,"rel":973},"https:\u002F\u002Fopenai.com\u002Findex\u002Fclip\u002F",[145],"CLIP (OpenAI, 2021)",". Instead of manually generating labels for each image, they found ",[115,977,978],{},"(image, text)"," pairs that co-occurred on the internet — yielding a dataset of ",[65,981,982],{},"400 million pairs",", 400× larger than ImageNet, with ",[65,985,986],{},"no manual labeling cost",[100,988,989,990,76],{},"CLIP became the first model that could generalize to multiple image classification tasks ",[65,991,992],{},"without requiring additional training",[61,994,995,996,999,1000,1003,1004,1007,1008,1011],{},"CLIP isn't a generative model — it wasn't trained to generate open-ended outputs. CLIP is an ",[93,997,998],{},"embedding model",", trained to produce joint embeddings of texts and images. ",[93,1001,1002],{},"\"Introduction to Embedding\" on page 134"," discusses embeddings in detail; for now, you can think of embeddings as vectors that aim to capture the meanings of the original data. Multimodal embedding models like CLIP are the ",[65,1005,1006],{},"backbones"," of generative multimodal models such as ",[65,1009,1010],{},"Flamingo, LLaVA, and Gemini"," (previously Bard).",[104,1013,1015],{"id":1014},"from-task-specific-to-general-purpose","From Task-Specific to General-Purpose",[61,1017,1018,1019,1022],{},"Foundation models also mark the transition from task-specific models to ",[65,1020,1021],{},"general-purpose"," models. Previously, models were often developed for specific tasks, such as sentiment analysis or translation. A model trained for sentiment analysis wouldn't be able to do translation, and vice versa.",[100,1024,1025,1026,1029,1030,1033],{},"Foundation models, thanks to their scale and the way they are trained, are capable of a ",[65,1027,1028],{},"wide range of tasks",". An LLM can do both sentiment analysis and translation. However, you can often ",[65,1031,1032],{},"tweak"," a general-purpose model to maximize its performance on a specific task.",[61,1035,1036,1040],{},[220,1037],{"alt":1038,"src":1039},"The range of tasks in the Super-Natural-Instructions benchmark (Wang et al., 2022)",".\u002Fmedia\u002Ffig-1-4.png",[93,1041,1042,1043,1048],{},"Figure 1-4. The range of tasks in the Super-Natural-Instructions benchmark (",[141,1044,1047],{"href":1045,"rel":1046},"https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.07705",[145],"Wang et al., 2022",").",[179,1050,1052],{"id":1051},"adapting-a-model-to-your-needs","Adapting a Model to Your Needs",[61,1054,1055,1056,1059],{},"Imagine you're working with a retailer to build an application to generate product descriptions for their website. An out-of-the-box model might generate accurate descriptions but ",[65,1057,1058],{},"fail to capture the brand's voice"," or highlight the brand's messaging. The generated descriptions might even be full of marketing speech and clichés.",[61,1061,1062,1063],{},"There are three common AI engineering techniques to adapt a model to your needs. ",[93,1064,1065],{},"The rest of the book will discuss all of them in detail.",[130,1067,1068,1077,1086],{},[133,1069,1072,1073,1076],{"icon":1070,"title":1071},"i-lucide-message-square","Prompt Engineering","Craft ",[65,1074,1075],{},"detailed instructions"," with examples of the desirable outputs.",[133,1078,1081,1082,1085],{"icon":1079,"title":1080},"i-lucide-database","Retrieval-Augmented Generation (RAG)","Connect the model to a ",[65,1083,1084],{},"database"," (e.g., customer reviews) that it can leverage to generate better outputs.",[133,1087,1090,1093],{"icon":1088,"title":1089},"i-lucide-sliders","Finetuning",[65,1091,1092],{},"Further train"," the model on a dataset of high-quality examples.",[172,1095,1096,1097,1100,1101,1104],{},"Adapting an existing powerful model to your task is generally a lot easier than building one from scratch — for example, ",[65,1098,1099],{},"ten examples and one weekend"," versus ",[65,1102,1103],{},"1 million examples and six months",". Foundation models make it cheaper to develop AI applications and reduce time to market. Exactly how much data is needed depends on the technique you use.",[475,1106,1107,1108,1111,1112,1115],{},"There are still many benefits to ",[65,1109,1110],{},"task-specific models"," — for example, they might be a lot smaller, making them faster and cheaper to use. Whether to build your own model or leverage an existing one is a classic ",[65,1113,1114],{},"buy-or-build question"," that teams will have to answer for themselves.",[81,1117,1119],{"id":1118},"from-foundation-models-to-ai-engineering","From Foundation Models to AI Engineering",[61,1121,1122,1125,1126,1129,1130],{},[93,1123,1124],{},"AI engineering"," refers to the process of ",[65,1127,1128],{},"building applications on top of foundation models",". People have been building AI applications for over a decade — a process often known as ML engineering or MLOps (short for ML operations). ",[65,1131,1132],{},"Why do we talk about AI engineering now?",[100,1134,1135,1136,1139,1140,1143],{},"If traditional ML engineering involves ",[65,1137,1138],{},"developing"," ML models, AI engineering ",[65,1141,1142],{},"leverages"," existing ones.",[61,1145,1146],{},"The availability and accessibility of powerful foundation models lead to three factors that, together, create ideal conditions for the rapid growth of AI engineering as a discipline.",[104,1148,1150],{"id":1149},"three-factors-driving-the-growth-of-ai-engineering","Three Factors Driving the Growth of AI Engineering",[795,1152,1153,1157,1164,1179,1183,1186,1197,1216,1235,1239,1245,1252,1269],{"level":797},[179,1154,1156],{"id":1155},"factor-1-general-purpose-ai-capabilities","Factor 1 — General-Purpose AI Capabilities",[61,1158,1159,1160,1163],{},"Foundation models are powerful not just because they can do existing tasks better — they can do ",[65,1161,1162],{},"more tasks",". Applications previously thought impossible are now possible, and applications not thought of before are emerging. Even applications not thought possible today might be possible tomorrow. This vastly increases both the user base and the demand for AI applications.",[61,1165,1166,1167,1170,1171,1174,1175,1178],{},"Since AI can now write as well as humans (sometimes even better), it can automate or partially automate every task that requires communication — which is pretty much everything. AI is used to ",[65,1168,1169],{},"write emails, respond to customer requests, and explain complex contracts",". Anyone with a computer has access to tools that can instantly generate customized, high-quality images and videos to ",[65,1172,1173],{},"create marketing materials, edit professional headshots, visualize art concepts, illustrate books",", and more. AI is even used to ",[65,1176,1177],{},"synthesize training data, develop algorithms, and write code"," — all of which will help train even more powerful models in the future.",[179,1180,1182],{"id":1181},"factor-2-increased-ai-investments","Factor 2 — Increased AI Investments",[61,1184,1185],{},"The success of ChatGPT prompted a sharp increase in investments in AI, both from venture capitalists and enterprises. As AI applications become cheaper to build and faster to go to market, returns on investment for AI become more attractive. Companies rush to incorporate AI into their products and processes.",[61,1187,1188,1189,1192,1193,1196],{},"Matt Ross, a senior manager of applied research at ",[65,1190,1191],{},"Scribd",", told me that the estimated AI cost for his use cases has ",[65,1194,1195],{},"gone down two orders of magnitude"," from April 2022 to April 2023.",[61,1198,1199,1204,1205,1208,1209,1212,1213],{},[141,1200,1203],{"href":1201,"rel":1202},"https:\u002F\u002Fwww.goldmansachs.com\u002Finsights\u002Farticles\u002Fai-investment-forecast-to-approach-200-billion-globally-by-2025.html",[145],"Goldman Sachs Research"," estimated that AI investment could approach ",[65,1206,1207],{},"$100 billion"," in the US and ",[65,1210,1211],{},"$200 billion globally"," by 2025. ",[93,1214,1215],{},"(For comparison, the entire US expenditure on public elementary and secondary schools is around $900 billion — only nine times the projected investment in AI in the US.)",[61,1217,1218,1219,1224,1225,1230,1231,1234],{},"AI is often mentioned as a competitive advantage. FactSet found that ",[141,1220,1223],{"href":1221,"rel":1222},"https:\u002F\u002Finsight.factset.com\u002Fhighest-number-of-sp-500-companies-citing-ai-on-q2-earnings-calls-in-over-10-years",[145],"one in three S&P 500 companies"," mentioned AI in their earnings calls for Q2 2023 — three times more than the year earlier. According to WallStreetZen, companies that mentioned AI in their earnings calls saw their stock price ",[141,1226,1229],{"href":1227,"rel":1228},"https:\u002F\u002Fwww.theregister.com\u002F2023\u002F09\u002F18\u002Fmention_ai_in_earnings_calls\u002F",[145],"increase more than those that didn't"," — an average ",[65,1232,1233],{},"4.6% increase compared to 2.4%",". It's unclear whether it's causation (AI makes these companies more successful) or correlation (companies are successful because they're quick to adapt to new technologies).",[179,1236,1238],{"id":1237},"factor-3-low-entrance-barrier-to-building-ai-applications","Factor 3 — Low Entrance Barrier to Building AI Applications",[61,1240,269,1241,1244],{},[65,1242,1243],{},"model-as-a-service"," approach popularized by OpenAI and other model providers makes it easier to leverage AI to build applications. Models are exposed via APIs that receive user queries and return outputs — giving you access to powerful models via single API calls, without the infrastructure to host and serve them yourself.",[61,1246,1247,1248,1251],{},"AI also makes it possible to build applications with ",[65,1249,1250],{},"minimal coding",":",[1253,1254,1255,1262],"ul",{},[299,1256,1257,1258,1261],{},"AI can ",[65,1259,1260],{},"write code for you",", allowing people without a software engineering background to quickly turn their ideas into running applications and put them in front of their users.",[299,1263,1264,1265,1268],{},"You can work with these models in ",[65,1266,1267],{},"plain English"," instead of a programming language.",[867,1270,1271],{},[61,1272,1273],{},[65,1274,1275],{},"Anyone, and I mean anyone, can now develop AI applications.",[61,1277,1278,1282],{},[220,1279],{"alt":1280,"src":1281},"The number of S&P 500 companies that mention AI in their earnings calls reached a record high in 2023. Data from FactSet.",".\u002Fmedia\u002Ffig-1-5.png",[93,1283,1284],{},"Figure 1-5. The number of S&P 500 companies that mention AI in their earnings calls reached a record high in 2023. Data from FactSet.",[100,1286,1287,1288,594,1293,1298,1299,1304,1305,76],{},"Because of the resources it takes to develop foundation models, this process is possible only for big corporations (Google, Meta, Microsoft, Baidu, Tencent), governments (",[141,1289,1292],{"href":1290,"rel":1291},"https:\u002F\u002Fwww.nii.ac.jp\u002Fen\u002Fnews\u002Frelease\u002F2023\u002F1020.html",[145],"Japan",[141,1294,1297],{"href":1295,"rel":1296},"https:\u002F\u002Foreil.ly\u002FIUcVg",[145],"the UAE","), and ambitious, well-funded startups (OpenAI, Anthropic, Mistral). In a ",[141,1300,1303],{"href":1301,"rel":1302},"https:\u002F\u002Fgreylock.com\u002Fgreymatter\u002Fsam-altman-ai-for-the-next-era\u002F",[145],"September 2022 interview",", Sam Altman, CEO of OpenAI, said the ",[65,1306,1307],{},"biggest opportunity for the vast majority of people will be to adapt these models for specific applications",[104,1309,1311],{"id":1310},"the-fastest-growing-engineering-discipline","The Fastest-Growing Engineering Discipline",[61,1313,1314,1315,1318],{},"The world is quick to embrace this opportunity. AI engineering has rapidly emerged as one of the fastest — quite possibly ",[93,1316,1317],{},"the"," fastest — growing engineering disciplines. Tools for AI engineering are gaining traction faster than any previous software engineering tools.",[130,1320,1321,1330,1339],{},[133,1322,1325,1326,1329],{"icon":1323,"title":1324},"i-lucide-star","Faster Than Bitcoin","Within just two years, four open-source AI engineering tools — ",[65,1327,1328],{},"AutoGPT, Stable Diffusion Web UI, LangChain, Ollama"," — have already garnered more GitHub stars than Bitcoin.",[133,1331,1334,1335,1338],{"icon":1332,"title":1333},"i-lucide-trending-up","Catching React and Vue","These tools are on track to surpass even the most popular ",[65,1336,1337],{},"web development frameworks",", including React and Vue, in star count.",[133,1340,110,1343,1348,1349,594,1352,594,1354,1356,1357,1360,1361,76],{"icon":1341,"title":1342},"i-lucide-briefcase","75% Monthly Profile Growth",[141,1344,1347],{"href":1345,"rel":1346},"https:\u002F\u002Feconomicgraph.linkedin.com\u002Fcontent\u002Fdam\u002Fme\u002Feconomicgraph\u002Fen-us\u002FPDF\u002Ffuture-of-work-report-ai-august-2023.pdf",[145],"LinkedIn survey"," (Aug 2023) shows the number of professionals adding terms like ",[93,1350,1351],{},"Generative AI",[93,1353,67],{},[93,1355,1071],{},", and ",[93,1358,1359],{},"Prompt Crafting"," to their profiles increased on average ",[65,1362,1363],{},"75% each month",[61,1365,1366,1370],{},[220,1367],{"alt":1368,"src":1369},"Open source AI engineering tools are growing faster than any other software engineering tools, according to their GitHub star counts",".\u002Fmedia\u002Ffig-1-6.png",[93,1371,1372],{},"Figure 1-6. Open source AI engineering tools are growing faster than any other software engineering tools, according to their GitHub star counts.",[172,1374,1375],{},[867,1376,1377],{},[61,1378,1379,1380,875,1383],{},"Teaching AI to behave is the ",[65,1381,1382],{},"fastest-growing career skill",[93,1384,1385],{},[141,1386,1389],{"href":1387,"rel":1388},"https:\u002F\u002Fwww.computerworld.com\u002Farticle\u002F1637946\u002Fteaching-ai-to-behave-is-the-fastest-growing-career-skill.html",[145],"ComputerWorld",[104,1391,1393],{"id":1392},"why-the-term-ai-engineering","Why the Term \"AI Engineering\"?",[61,1395,1396,1397,1399],{},"Many terms are used to describe the process of building applications on top of foundation models — ML engineering, MLOps, AIOps, LLMOps, and so on. Why did I choose to go with ",[93,1398,1124],{}," for this book?",[1401,1402,1403,1419,1427],"accordion",{},[1404,1405,1407,1408,1411,1412,1415,1416,76],"accordion-item",{"icon":277,"label":1406},"Why not \"ML engineering\"?","Working with foundation models differs from working with traditional ML models in several important aspects (see ",[93,1409,1410],{},"\"AI Engineering Versus ML Engineering\" on page 39","). The term ",[93,1413,1414],{},"ML engineering"," won't be sufficient to capture this differentiation. However, ",[65,1417,1418],{},"ML engineering is a great term to encompass both processes",[1404,1420,1422,1423,1426],{"icon":277,"label":1421},"Why not \"MLOps\", \"AIOps\", or \"LLMOps\"?","While there are operational components of the process, the focus is more on ",[65,1424,1425],{},"tweaking (engineering) foundation models"," to do what you want — not just operating them.",[1404,1428,1431,1432,1437],{"icon":1429,"label":1430},"i-lucide-users","I asked 20 practitioners and went with the people.","I surveyed 20 people who were developing applications on top of foundation models about what term they would use to describe what they were doing. ",[65,1433,1434,1435,76],{},"Most preferred ",[93,1436,1124],{}," I decided to go with the people.",[100,1439,1440,1441],{},"The rapidly expanding community of AI engineers has demonstrated remarkable creativity with an incredible range of exciting applications. ",[65,1442,1443],{},"The next section will explore some of the most common application patterns.",{"title":57,"searchDepth":1445,"depth":1445,"links":1446},2,[1447,1452,1456],{"id":83,"depth":1445,"text":84,"children":1448},[1449,1451],{"id":106,"depth":1450,"text":107},3,{"id":90,"depth":1450,"text":529},{"id":845,"depth":1445,"text":846,"children":1453},[1454,1455],{"id":906,"depth":1450,"text":907},{"id":1014,"depth":1450,"text":1015},{"id":1118,"depth":1445,"text":1119,"children":1457},[1458,1459,1460],{"id":1149,"depth":1450,"text":1150},{"id":1310,"depth":1450,"text":1311},{"id":1392,"depth":1450,"text":1393},"Trace how decades of advances in language models, self-supervision, and multimodality produced foundation models — and turned AI engineering into a discipline of its own.","md",{},{"icon":24},{"title":21,"description":1461},"_pVe7W3yDo_jHuRYju6UOfb6tz6xa2RBg26ZfG-VVvA",[1468,1470],{"title":11,"path":17,"stem":18,"description":1469,"icon":19,"children":-1},"How the scaling of foundation models reshaped AI, lowered the barrier to building applications, and turned AI engineering into one of the fastest-growing disciplines in software.",{"title":26,"path":27,"stem":28,"description":1471,"icon":29,"children":-1},"A tour of industry-proven and promising use cases for foundation models — from coding and creative work to writing, education, chatbots, information aggregation, data organization, and workflow automation.",1778484800915]