The role of a data scientist is evolving rapidly, with increasing expectations to go beyond traditional tasks like data cleaning and model building. Today, the demand for “full-stack data scientists” is growing—professionals who can manage the entire lifecycle of a data science project, including deploying machine learning (ML) models into production. This guide explores how to transition from developing models in Jupyter notebooks to deploying them as APIs using FastAPI and Docker.
What is a Full-Stack Data Scientist?
A full-stack data scientist is someone who can handle every stage of the data science lifecycle, from data preprocessing and model development to deployment and monitoring. While mastering every skill isn’t feasible for most, having a working knowledge of software engineering, ML Ops, and deployment tools is essential. By expanding their skill set, data scientists can make their work more impactful by ensuring their models are accessible and usable in real-world applications.
Step 1: Model Development
The first step in deploying an ML model is building it. For this example, we’ll create a simple linear regression model to predict salaries based on features like experience level, job title, and company size.
Preparing the Data
The dataset includes categorical variables that need encoding before being fed into the model. Ordinal encoding is used for features like experience level and company size, while dummy variables are created for job titles and employment types.
pythonfrom sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
import joblib
# Encode categorical variables
encoder = OrdinalEncoder(categories=[['EN', 'MI', 'SE', 'EX']])
salary_data['experience_level_encoded'] = encoder.fit_transform(salary_data[['experience_level']])
# Create dummy variables for job titles
salary_data = pd.get_dummies(salary_data, columns=['job_title'], drop_first=True)
# Split data into training and testing sets
X = salary_data.drop(columns='salary_in_usd')
y = salary_data['salary_in_usd']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train a linear regression model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
# Save the trained model
joblib.dump(regr, 'lin_regress.sav')
While this example produces a basic model with low accuracy (R² = 0.27), the focus here is on deployment rather than optimization.
Step 2: Creating an API with FastAPI
An API (Application Programming Interface) allows different software systems to communicate. Using FastAPI, we can expose our trained model as an endpoint that accepts user input and returns predictions.
FastAPI Script
The following script initializes a FastAPI application and sets up two endpoints:
- A
GET
endpoint for welcoming users. - A
POST
endpoint for sending input data and receiving predictions.
pythonfrom fastapi import FastAPI
from pydantic import BaseModel
import pandas as pd
import joblib
app = FastAPI()
class PredictionFeatures(BaseModel):
experience_level_encoded: float
company_size_encoded: float
job_title_Data_Scientist: int
# Load the trained model
model = joblib.load('lin_regress.sav')
@app.get("/")
def root():
return {"message": "Welcome to the Salary Prediction API"}
@app.post("/predict")
def predict(features: PredictionFeatures):
input_data = pd.DataFrame([features.dict()])
prediction = model.predict(input_data)
return {"predicted_salary": prediction[0]}
Run the API using Uvicorn:
bashuvicorn main:app --reload
You can now access the API locally at http://127.0.0.1:8000
.
Step 3: Testing the API
Using Python’s requests
library, you can test the API by sending sample input data via a POST request.
pythonimport requests
url = "http://127.0.0.1:8000/predict"
data = {
"experience_level_encoded": 3,
"company_size_encoded": 2,
"job_title_Data_Scientist": 1
}
response = requests.post(url, json=data)
print(response.json())
The API will return the predicted salary in JSON format.
Step 4: Deploying with Docker
What is Docker?
Docker simplifies deployment by packaging applications and their dependencies into containers that can run consistently across different environments.
Creating a Dockerfile
A Dockerfile defines the instructions for building a Docker image of your application.
textFROM python:3.9-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -r requirements.txt
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Build the Docker image:
bashdocker build -t salary-prediction-api .
Run the container:
bashdocker run -p 8000:8000 salary-prediction-api
Your API is now accessible at http://localhost:8000
.
Benefits of Using Docker
- Portability: Containers include all dependencies, ensuring consistent behavior across environments.
- Scalability: Easily deploy multiple instances of your application.
- Efficiency: Containers are lightweight compared to virtual machines.
Conclusion
Deploying machine learning models is a critical skill for modern data scientists aiming to expand their expertise into ML Ops and software engineering. By leveraging tools like FastAPI and Docker, you can transition from static Jupyter notebooks to fully functional APIs that make your models accessible to others.
Becoming a full-stack data scientist doesn’t mean mastering every skill but having enough knowledge to contribute across various stages of the data science lifecycle effectively.
With these tools in hand, you’re well on your way to delivering impactful solutions that bridge the gap between development and production.
Read more such articles from our Newsletter here.