Machine Learning: Train a Random Forest for Classification¶

This tutorial demonstrates how to train a Random Forest ¹ classifier using Scikit-Learn ².

1. Acquire training data¶

The data used in this example comes from the QSAR group's biodegradation database on Kaggle ³. The dataset consists of 41 unique descriptors of each molecule, and the goal is to predict whether the molecule is biodegradable or not.

The dataset has been pre-processed to encode class labels as 0 and 1.

Download the dataset here. For the purposes of this tutorial, it is referred to as "data_to_train_with.csv".

2. Upload the training data¶

Click the Dropbox button in the left sidebar to navigate to the Dropbox Page. Then click the Upload button:

Dropbox Page with Upload

When the browser's upload window appears, navigate to the downloaded file and select it. If successful, the file appears in the dropbox.

3. Copy the classification workflow from the bank¶

Click the Bank Workflows button in the left sidebar to navigate to the Bank Workflows Page. Search for the "Python ML Train Classification" workflow owned by the "Curators" account, and copy it to the account.

A diagram and detailed description of this workflow can be found here.

4. Create the ML job¶

Create a new job by clicking Create Job in the left sidebar. Give the job a descriptive name, such as "Python ML Tutorial". Then click the Actions Button and choose Select Workflow.

Job Designer with Circles

In the Select Workflow dialogue, search for "Python ML Train Classification" and select it.

5. Select the dataset¶

Once the ML Training workflow is selected, the Materials tab is replaced with a Dataset tab.

Dataset Tab

Click the Actions Button and choose Select Dataset. Select "data_to_train_with.csv" from the file explorer. A preview appears on the dataset tab, confirming the data has been loaded.

6. Configure the workflow¶

Open the Workflows Tab to view the training workflow. Two subworkflows are available: Set Up the Job and Machine Learning.

Do not modify the setup subworkflow

The Set Up the Job subworkflow is automatically configured during training. Modifying it can disrupt the Predict workflow.

Select the Machine Learning subworkflow. The following workflow units are visible:

Setup Packages and Variables — configures the job and downloads required packages via pip
Data Input — reads the training data from disk
Train Test Split — splits the data into training and testing sets
Data Standardize — scales the data to mean 0 and standard deviation 1
Model Train and Predict — handles model training and prediction
ROC Curve Plot — draws a Receiver Operating Characteristic (ROC) curve ⁴

6.1. Set the target column and problem category¶

Open the Important Settings portion of the workflow editor. Set target_column_name to "Class" and problem_category to "classification".

Important settings with target column name set

7. Submit the job¶

Click the check-mark in the upper right of the job designer, in the Header Menu, to save the job.

Jobs Tab with ML Training Calculation Set Up

The job can now be run.

8. Analyze the training results¶

After a few minutes, the job completes. The Results tab shows two calculated properties. The first, Machine Learning - Model Train and Predict, is the predict workflow generated by the training job, which can be used for predictions on new data.

The second result is Machine Learning - ROC Curve Plot, containing the ROC curve for model assessment.

Results Tab Showcasing ROC Curve

9. Video walkthrough¶

This tutorial is demonstrated in the following animation: