Machine Learning: Train a Random Forest for Classification¶
This tutorial demonstrates how to train a Random Forest 1 classifier using Scikit-Learn 2.
1. Acquire training data¶
The data used in this example comes from the QSAR group's biodegradation database on Kaggle 3. The dataset consists of 41 unique descriptors of each molecule, and the goal is to predict whether the molecule is biodegradable or not.
The dataset has been pre-processed to encode class labels as 0 and 1.
Download the dataset here. For the purposes of this tutorial, it is referred to as "data_to_train_with.csv".
2. Upload the training data¶
Click the Dropbox button in the left sidebar to navigate to the Dropbox Page. Then click the Upload button:

When the browser's upload window appears, navigate to the downloaded file and select it. If successful, the file appears in the dropbox.
3. Copy the classification workflow from the bank¶
Click the Bank Workflows button in the left sidebar to navigate to the Bank Workflows Page. Search for the "Python ML Train Classification" workflow owned by the "Curators" account, and copy it to the account.
A diagram and detailed description of this workflow can be found here.
4. Create the ML job¶
Create a new job by clicking Create Job in the left sidebar. Give the job a descriptive name, such as "Python ML Tutorial". Then click the Actions Button and choose Select Workflow.

In the Select Workflow dialogue, search for "Python ML Train Classification" and select it.
5. Select the dataset¶
Once the ML Training workflow is selected, the Materials tab is replaced with a Dataset tab.

Click the Actions Button and choose Select Dataset. Select "data_to_train_with.csv" from the file explorer. A preview appears on the dataset tab, confirming the data has been loaded.
6. Configure the workflow¶
Open the Workflows Tab to view the training workflow. Two subworkflows are available: Set Up the Job and Machine Learning.
Do not modify the setup subworkflow
The Set Up the Job subworkflow is automatically configured during training. Modifying it can disrupt the Predict workflow.
Select the Machine Learning subworkflow. The following workflow units are visible:
Setup Packages and Variables— configures the job and downloads required packages viapipData Input— reads the training data from diskTrain Test Split— splits the data into training and testing setsData Standardize— scales the data to mean 0 and standard deviation 1Model Train and Predict— handles model training and predictionROC Curve Plot— draws a Receiver Operating Characteristic (ROC) curve 4
6.1. Set the target column and problem category¶
Open the Important Settings portion of the workflow editor. Set target_column_name to "Class" and problem_category to "classification".

7. Submit the job¶
Click the check-mark in the upper right of the job designer, in the Header Menu, to save the job.

The job can now be run.
8. Analyze the training results¶
After a few minutes, the job completes. The Results tab shows two calculated properties. The first, Machine Learning - Model Train and Predict, is the predict workflow generated by the training job, which can be used for predictions on new data.
The second result is Machine Learning - ROC Curve Plot, containing the ROC curve for model assessment.

9. Video walkthrough¶
This tutorial is demonstrated in the following animation: