What are the main parameters of the random forest model?

  • max_depth: Longest Path between root node and the leaf
  • min_sample_split: The minimum number of observations needed to split a given node
  • max_leaf_nodes: Conditions the splitting of the tree and hence, limits the growth of the trees
  • min_samples_leaf: minimum number of samples in the leaf node
  • n_estimators: Number of trees
  • max_sample: Fraction of original dataset given to any individual tree in the given model
  • max_features: Limits the maximum number of features provided to trees in random forest model

Selecting the depth of the trees in random forest

The greater the depth, the greater amount of information is extracted from the tree, however, there is a limit to this, and the algorithm even if defensive against overfitting may learn complex features of noise present in data and as a result, may overfit on noise. Hence, there is no hard thumb rule in deciding the depth, but literature suggests a few tips on tuning the depth of the tree to prevent overfitting:

  • limit the maximum depth of a tree
  • limit the number of test nodes
  • limit the minimum number of objects at a node required to split
  • do not split a node when, at least, one of the resulting subsample sizes is below a given threshold
  • stop developing a node if it does not sufficiently improve the fit.

How many trees we need in random forest?

The number of trees in random forest is worked by n_estimators, and a random forest reduces overfitting by increasing the number of trees. There is no fixed thumb rule to decide the number of trees in a random forest, it is rather fine tuned with the data, typically starting off by taking the square of the number of features (n) present in the data followed by tuning until we get the optimal results.

Speak Your Mind