Classes
02 - Deploy: first try!

Model Deployment

Categories of Model Deployment

A core choice you'll need to determine that will impact both your customer base and the engineers constructing your solution is how it computes and provides its forecasts to consumers: online or in batches.

Online prediction is when predictions are generated and returned as soon as requests for these predictions are received by the service. This baseline selection on synchronous or asynchronous predictions will shape many subsequent designing decisions.

The main advantege of online prediction is that it makes it easier to provide a real-time user experience. Suppose that you deployed an AI model for customer claiming and that the model makes predictions for all customers overnight. During the day, things like sending messages to the call center (indicating that the customer is dissatisfied) or placing new orders (perhaps indicating that the customer is satisfied) may happen. Making the prediction closer to when it is needed makes possible the use of newer information and maybe return a more reliable and valuable forecast.

When online prediction is the choice to deploy a model, it is generally made available to other applications through API calls. In this handout, we are going to build an API to make predictions using the model from the last class.

When is this decision made?

Remeber the ML lifecycle from last class:

Question 1

Answer!

This decision depends on how the model will be used. It is usually possible to have a vision of this during the planning phase. It's something that can be rethought and changed, but generally knowing the company's problem (and the target variable to be predicted) already gives us an idea which style of deployment will generate more value for the business.

What are APIs?

APIs (Application Programming Interfaces) allow developers to access data and services. They enable platforms, applications and systems to connect and interact with each other.

You can use APIs to:

Transcribe audio using Google API
Make an App that interact with ChatGPT
Let an App send data to your ML models make and return predictions
And so forth!

Some usefull links:

Build an API

To construct out API, we are using FastAPI. Follow the handout steps and also make use of the official tutorial available at FastAPI.

Tip!

Create a repository (public or private) in your own github account to store your API.

It is not necessary to submit the activity for this class.

Use the environment (conda or venv) from the last class or create a new one for this class!

Install libs

Let's install the necessary libraries:

$ pip install fastapi
$ pip install uvicorn

Then, create a folder called src:

$ mkdir src

A simple API

Copy and paste this code in the src/main.py file:

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
async def root():
    return "Model API is alive!"

Inside src, start the api with the command:

$ uvicorn main:app --host 0.0.0.0 --port 8900 --reload

To test, go to http://localhost:8900 in your browser!

One of the wonders of fastapi is the availability of documentation. Go to http://localhost:8900/docs in your browser. You will see something like:

Click on "Try it out"!

An API that makes predictions

In the root folder of today's class, create a new folder called models:

$ mkdir models

We are going to store in this folder the pickle of the models trained in the last class.

Exercise 2

Now, your folder should have the following structure:

Then, create the src/model.py file. In addition to importing the necessary libraries, this file must have two functions that open (and return) the models contained in the ohe.pkl and model.pkl files.

def load_model():
    # Your code here
    pass


def load_encoder():
    # Your code here
    pass

These functions will be imported and used in main.py!

Here it is a new (and incomplete) version of main.py:

from fastapi import FastAPI

# loader functions that you programmed!
from model import load_model, load_encoder


app = FastAPI()


@app.get("/")
async def root():
    """
    Route to check that API is alive!
    """
    return "Model API is alive!"


@app.post("/predict")
async def predict():
    """
    Route to make predictions!
    """
    # Load the models
    ohe = load_encoder()
    model = load_model()

    return {"prediction": "I can almost make predictions!"}

Question 3

To make predictions, the predict route needs to receive information about the client (person). When analyzing a row of the X table from the last class (before applying the encoder), an example of the necessary features using JSON format would be:

{
  "age": 42,
  "job": "entrepreneur",
  "marital": "married",
  "education": "primary",
  "balance": 558,
  "housing": "yes",
  "duration": 186,
  "campaign": 2
}

Let's represent the person/customer information using a class identified as "Person". Here's an example with the first two fields:

from pydantic import BaseModel

class Person(BaseModel):
    age: int
    job: str

Now we can update the predict route to receive a person's information!

@app.post("/predict")
async def predict(person: Person):
    """
    Route to make predictions!
    """
    ohe = load_encoder()
    model = load_model()

    df_person = pd.DataFrame([person.dict()])

    person_t = ohe.transform(df_person)
    pred = model.predict(person_t)[0]

    return {"prediction": str(pred)}

Exercise 4

Return to http://localhost:8900/docs in your browser and test the predict route, adding the JSON content you saw earlier!

Tip!

Deploying many online ML systems is conceptually simpler since the records to be scored can be distributed between several machines using a load balancer. But this is a problem for another day!

Improve route with example!

Let's add an example to the code so that the documentation is already pre-populated with an example, making it easier for the user to test the route.

from typing import Annotated
from fastapi import FastAPI, Body

@app.post("/predict")
async def predict(
    person: Annotated[
        Person,
        Body(
            examples=[
                {
                    "age": 42,
                    "job": "entrepreneur",
                    "marital": "married",
                    "education": "primary",
                    "balance": 558,
                    "housing": "yes",
                    "duration": 186,
                    "campaign": 2,
                }
            ],
        ),
    ],
):
    """
    Route to make predictions!
    """
    ohe = load_encoder()
    model = load_model()

    person_t = ohe.transform(pd.DataFrame([person.dict()]))
    pred = model.predict(person_t)[0]

    return {"prediction": str(pred)}

Call API from Python!

If another application needs access to the API, it can simply make a request.

See an example using route / (check if is alive):

import requests as req

print(req.get("http://localhost:8900/").text)

And for the predict route:

import requests as req

data = {
    "age": 42,
    "job": "entrepreneur",
    "marital": "married",
    "education": "primary",
    "balance": 558,
    "housing": "yes",
    "duration": 186,
    "campaign": 2,
}

resp = req.post("http://localhost:8900/predict", json=data)
print(f"Status code: {resp.status_code}")
print(f"Response: {resp.text}")

Add Authentication

Without proper authentication, APIs would be vulnerable to unnecessary access attempts and even malicious attacks from unauthorized parties.

For simplicity, let's assume there is only one valid token ("abc123") as the full implementation of authentication would need database access and caching for performance.

Let's add a dependency to the predict route. When the route is called, the function that resolves the dependency will extract the token from the header check if it is valid:

The function and the route (I removed the example for simplicity):

def get_username_for_token(token):
    if token == "abc123":
        return "pedro1"
    return None

async def validate_token(credentials: HTTPAuthorizationCredentials = Depends(bearer)):
    token = credentials.credentials

    username = get_username_for_token(token)
    if not username:
        raise HTTPException(status_code=401, detail="Invalid token")

    return {"username": username}

@app.post("/predict")
async def predict(person: Person,
                  user=Depends(validate_token)
                 ):
    # Code supressed
    pass

The full code is:

from fastapi import FastAPI, HTTPException, Depends, Body
from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
from pydantic import BaseModel
from typing import Annotated
from model import load_model, load_encoder
import pandas as pd

app = FastAPI()

bearer = HTTPBearer()

def get_username_for_token(token):
    if token == "abc123":
        return "pedro1"
    return None


async def validate_token(credentials: HTTPAuthorizationCredentials = Depends(bearer)):
    token = credentials.credentials

    username = get_username_for_token(token)
    if not username:
        raise HTTPException(status_code=401, detail="Invalid token")

    return {"username": username}

class Person(BaseModel):
    age: int
    job: str
    marital: str
    education: str
    balance: int
    housing: str
    duration: int
    campaign: int

@app.get("/")
async def root():
    return "Model API is alive!"

@app.post("/predict")
async def predict(
    person: Annotated[
        Person,
        Body(
            examples=[
                {
                    "age": 42,
                    "job": "entrepreneur",
                    "marital": "married",
                    "education": "primary",
                    "balance": 558,
                    "housing": "yes",
                    "duration": 186,
                    "campaign": 2,
                }
            ],
        ),
    ],
    user=Depends(validate_token),
):
    ohe = load_encoder()
    model = load_model()

    person_t = ohe.transform(pd.DataFrame([person.dict()]))
    pred = model.predict(person_t)[0]

    return {
        "prediction": str(pred),
        "username": user["username"]
        }

Python

Call the API using Bearer Token Authentication from Python:

import requests as req
import time

token = "abc123"

headers = {"Authorization": f"Bearer {token}"}

data = {
    "age": 42,
    "job": "entrepreneur",
    "marital": "married",
    "education": "primary",
    "balance": 558,
    "housing": "yes",
    "duration": 186,
    "campaign": 2,
}

resp = req.post("http://localhost:8900/predict",
                json=data,
                headers=headers)

print(resp.status_code)
print(resp.text)

Exercise 5

Loading Models at Startup

A performance issue with AI APIs is the time required to open models. Notice that the way we did it, the model is opened every time the predict wheel is called.

We can configure so that models are opened when the API starts:

ml_models = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    ml_models["ohe"] = load_encoder()
    ml_models["models"] = load_model()
    yield
    ml_models.clear()


app = FastAPI(lifespan=lifespan)

So, the predict route would now have:

    ohe = ml_models["ohe"]
    model = ml_models["models"]

Rather than:

    ohe = load_encoder()
    model = load_model()

Especially for larger models, this can represent a good performance improvement.

That is all for today!

References

Image: https://www.redhat.com/rhdc/managed-files/styles/wysiwyg_full_width/private/API-page-graphic.png?itok=RRsvST-
Introduction to MLOps. Chapter 6.
Designing Machine Learning Systems. Chapter 7.