Forest-Based Classification and Regression

Forest-based Classification and Regression workflow diagram


Creates models and generates predictions using an adaptation of Leo Breiman's random forest algorithm, which is a supervised machine learning method. Predictions can be performed for both categorical variables (classification) and continuous variables (regression). Explanatory variables are fields in the attribute table of the training features. The tool can be run to generate a model to assess performance, or generate a model and predict results to another datasets.

Analysis Type


Specifies the operation mode of the tool. The tool can be run to train a model to only assess performance, or train a model and predict to features. Prediction types are as follows:

  • Train a model to assess model performance—A model will be trained, and fit to the input data. Use this option to assess the accuracy of your model before generating predictions on a new dataset. The output of this option will be a feature service of your fitted training data, model diagnostics, and an optional table of variable importance.
  • Train a model and predict values— Predictions or classifications will be generated for features. Explanatory variables must be provided for both the training features and the features to be predicted. The output of this option will be a feature service of your predicted values, model diagnostics, and an optional table of variable importance.

Train a model to assess model performance


Use this mode if you want to fit a model, and investigate the fit.

With this choice model will be trained using an input layer. Use this option to assess the accuracy of your model before generating predictions on a new dataset. This option will output model diagnostics in the messages window and apply the model to your training data.

Train a model and predict values


Use this mode if you want to fit a model, and apply the model to the dataset to generate predictions.

Predictions or classifications will be generated for features. The output of this option will be a feature service, model diagnostics, and an optional table of variable importance.

Choose training layer


The feature layer containing the variable to predict and the fields that will be used to generate the prediction.

In addition to choosing a layer from your map, you can choose Choose Analysis Layer at the bottom of the drop-down list to browse to your contents for a big data file share dataset or feature layer.

Choose a layer to predict values for


A feature layer representing locations where predictions will be made. This feature layer must also contain any explanatory variables provided as fields that correspond to those used from the training features.

In addition to choosing a layer from your map, you can choose Choose Analysis Layer at the bottom of the drop-down list to browse to your contents for a big data file share dataset or feature layer.

Choose the field to predict


The field from the training features containing the values to be used to train the model. This field contains known (training) values of the variable that will be used to predict at unknown locations. If values are categorical (for example, Maple, Pine, Oak) select the Categorical check box.

Choose one or more explanatory variables


One or more fields representing the explanatory variables (fields) that help predict the value or category of the variable to predict. Use the categorical checkbox for any variables that represent classes or categories (such as landcover or presence or absence). Specify the variable as true for any that represent classes or categories such as landcover or presence or absence and false if the variable is continuous.

Number of trees


The number of trees to create in the model. More trees will generally result in more accurate model prediction, but the model will take longer to calculate. The default number of trees is 100.

Minimum leaf size


The minimum number of observations required to keep a leaf (that is the terminal node on a tree without further splits). The default minimum for regression is 5 and the default for classification is 1. For very large data, increasing these numbers will decrease the run time of the tool.

Maximum tree depth


The maximum number of splits that will be made down a tree. Using a large maximum depth, more splits will be created, which may increase the chances of overfitting the model. The default is data driven and depends on the number of trees created and the number of variables included.

Data available per tree (%)


Specifies the percentage of the features in the training layer used for each decision tree. The default is 100 percent of the data. Samples for each tree are taken randomly from two-thirds of the data specified.

Each decision tree in the forest is created using a random sample or subset (approximately two-thirds) of the training data available. Using a lower percentage of the input data for each decision tree increases the speed of the tool for very large datasets.

Number of randomly sampled variables


Specifies the number of explanatory variables used to create each decision tree.

Each of the decision trees in the forest is created using a random subset of the explanatory variables specified. Increasing the number of variables used in each decision tree will increase the chances of overfitting your model particularly if there is one or a couple dominant variables. A common practice is to use the square root of the total number of explanatory variables if your variable to predict is numeric or divide the total number of explanatory variables by 3 if the variable to predict is categorical.

Choose how explanatory fields are matched


How the corresponding variables in the training layer will match the variables in the prediction layer. Only the variables used in training will be included in the table.

Number of runs for validation


Specifies the percentage (between 0 percent and 50 percent) of features in the training layer to reserve as the test dataset for validation. The model will be trained without this random subset of data, and the observed values for those features will be compared to the predicted value. The default is 10 percent.

Result layer name


This is the name of the layer that will be created in My Content and added to the map. The default name is based on the tool name and the input layer name. If the layer already exists, you will be asked to provide another name.

The results returned will depend on the type of analysis. If you are training to assess model fit, results will contain a layer of training data fit to the model and result info assessing the model fit. If you are training and predicting, results will contain a layer of the training data fit to the model, a layer of predicted results, and result info assessing the model fit.

Using the Save result in drop-down box, you can specify the name of a folder in My Content where the result will be saved.