Automated Machine Learning (AutoML): A Hands-On Tutorial

We all live in a world of technology. And lately in this, if we may put it mildly even, Machine Learning (ML) has stood out by offering solutions that handle vast amounts of data. This data has been used to predict outcomes accurately and efficiently.

However, developing these ML models traditionally involves complex steps. The steps involv

Data preprocessing
Model selection
Hyperparameter tuning

All this can be quite challenging particularly for individuals with less experience. What now then? Well, the solution is Automated Machine Learning (AutoML). It aims to automate these steps and make ML accessible to non-experts and significantly speed up the process.

In this blog post, we will explore the practical aspects of AutoML using a popular Python library called Auto-sklearn. We shall help you learn everything from

Setting up your environment
Preparing your data
Training your first model
Evaluating its performance.

Let us get started.

What is AutoML?

Automated Machine Learning is a process of automating the tasks. It requires great knowledge in machine learning. This includes automating the parts of machine learning like data preprocessing, model selection, and hyperparameter optimization.

The beauty of AutoML lies in its capacity to make machine learning more accessible to non-experts.
Imagine being a business analyst with some understanding of data who wants to predict future sales but doesn’t know how to tune machine learning models.
AutoML tools can help here by simplifying the process down to feeding in the data and specifying what you want to predict.

Commonly used AutoML tools include

Google AutoML
H2O AutoML
Auto-sklearn.

Each of the above tools has its strengths. However, as we iterated earlier, we shall focus on Auto-sklearn.

It is compatible with the popular scikit-learn library.
It is easier to use.

Setting Up Your Environment

Before we jump into using AutoML, we need to set up our environment. For this, we will use Auto-sklearn. Let us begin:

Python Installation:

Step1: Go to Python.org

Step 2: Select Python 3.8 or later.

Step 3: Download and install it.

Install Necessary Packages

You will need to install some specific Python libraries.
The easiest way to do this is using pip, Python’s package installer.
Open your command line (or terminal on MacOS and Linux).

Run the following commands:

bash

pip install numpy scipy scikit-learn

pip install auto-sklearn

Test your installation: To ensure everything is set up correctly, run the following commands in your Python interpreter:

python

import autosklearn.classification

print(“Auto-sklearn installed successfully!”)

If there are no errors, congratulations, you are ready to move to the next step.

Next Step: Exploring the Dataset

For this exploration, we will use the well-known Iris dataset.

This dataset includes data on various iris plants.
Our goal is to classify iris plants into one of three species.
They are based on attributes like petal length and sepal width.

Let us load and explore this dataset:

python

import pandas as pd

from sklearn.datasets import load_iris

# Load Iris dataset

iris = load_iris()

data = pd.DataFrame(data=iris.data, columns=iris.feature_names)

data[‘target’] = iris.target

# Display the first few rows of the dataframe

print(data.head())

# Display basic statistical details

print(data.describe())

This above given code is meant to show us the first few entries in our dataset and some basic statistics.
It will help us understand the distribution of different features and labels.

Preparing the Data

Before we can use the data to train our model, we need to prepare it.

This involves handling any missing values
Encoding categorical features if necessary
Splitting the data into training and testing sets.

For the Iris dataset, the data is already clean, so we will focus on splitting:

python

from sklearn.model_selection import train_test_split

# Prepare data

X = data.drop(‘target’, axis=1)

y = data[‘target’]

# Splitting the dataset into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This splits our data so that 80% is used for training our model.
And 20% is held back as a test set to evaluate the model.

Running AutoML with Auto-sklearn

Now comes the exciting part. Using Auto-sklearn to automatically find the best machine learning model for our data. Here is how to go about it:

python

import autosklearn.classification as auto_cls

# Create an AutoML classifier

automl = auto_cls.AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30, memory_limit=1024)

# Fit the classifier to our training data

automl.fit(X_train, y_train)

# Output the models that Auto-sklearn tried

print(automl.sprint_statistics())

This code configures Auto-sklearn with a total runtime of 120 seconds and a per model run limit of 30 seconds.
It then fits multiple models within these constraints.

Evaluating the Model

After training, it is essential to evaluate how well our model performs. We do this using the test set we created earlier:

python

from sklearn.metrics import accuracy_score, confusion_matrix

# Predict on the test data

y_pred = automl.predict(X_test)

# Calculate the accuracy and confusion matrix

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

print(f”Accuracy: {accuracy}”)

print(f”Confusion Matrix:\n{conf_matrix}”)

This will give us the accuracy of our model and a confusion matrix.
It helps us understand where our model is making mistakes.

Advanced Areas to Explore and Tips

While what we have covered in this blog should give you a good start into AutoML, there is much more to explore. Some of the areas where you can further enhance the performance of AutoML systems are:

Handling larger datasets
Optimizing run times
Integrating domain-specific knowledge

Some tips for effective use of AutoML:

Always ensure your data is well-prepped, as the quality of data fed into AutoML directly influences output quality.
Experiment with different settings for time and memory limits based on the computational resources you have.

Conclusion

To conclude, we have now seen how AutoML can significantly simplify the machine learning pipeline. Be it data preparation or model evaluation, it can be done. By using Auto-sklearn, we have been able to train and evaluate a model with minimal code and effort.

As machine learning continues to evolve, tools like AutoML will also become increasingly valuable. This will allow more users to leverage the power of machine learning. It would be advisable for you to experiment with different datasets and AutoML configurations to see what you can achieve. Happy modeling!

Adarsh M

Future Engineer and Software developer. Curious about advanced tools and Web technologies. Loves to be a part of fast moving technical world