# An introduction to decision trees

Decision trees are one of the major data structures in machine learning. Their inner workings rely on heuristics which, while satisfying to our intuitions, give remarkably good results in practice (notably when they are aggregated into "random forests"). Their tree structure also makes them readable by human beings, contrary to other "black box" approaches such as neural networks.

In this introduction, we will see how decision trees work and provide a few theoretical justifications along the way. We will also (briefly) talk about their extension to random forests. To get started, you should already be familiar with the general setting of supervised learning.

## Training of a decision tree

A decision tree models a hierarchy of tests on the values of its input, which
are called *attributes* or features. At the end of these tests, the predictor
outputs a numerical value or chooses a label in a pre-defined set of options.
The former is called *regression*, the latter *classification*. For instance,
the following tree could be used to decide if a French student finds a lecture
interesting or not, *i.e.* classification in the binary set {*oui*, *non*}
(French booleans ;p), based on the boolean attributes {*motivation*,
*surprenant*, *difficile*} and a discrete attribute *durée* \(\in\) {1 h, 2
h, 3 h}. We could also have made the last one a continuous attribute, but we'll
come back to that.

The input to our decision is called an *instance* and consists of values for
each attribute. It is usually denoted by \((\bfx, y)\) where \(y\) is
the attribute the tree should learn to predict and \(\bfx = (x_1, \ldots,
x_m)\) are the \(m\) other attributes. A decision tree is learned on a
collection of instances \(T = \{(\bfx, y)\}\) called the *training set*.

date | motivation | durée (h) | difficile | surprenant | température (K) |
---|---|---|---|---|---|

11/03 | oui | 1 | non | oui | 271 |

14/04 | oui | 4 | oui | oui | 293 |

03/01 | non | 3 | oui | non | 297 |

25/12 | oui | 2 | oui | non | 291 |

... | ... | ... | ... | ... | ... |

Given our observations \(T = \{ (\bfx, y) \}\), we now want to build a decision tree that predicts \(y\) given new instances \(\bfx\). As of today (TN: 2011) there are essentially two families of algorithms to do this: Quinlan trees and CART trees. Both follow a divide-and-conquer strategy that we can sum up, for discrete attributes, by the following pseudocode:

1: DecisionTree(T: training set): 2: if "stop condition": 3: return leaf(T) 4: choose the "best" attribute i between 1 and m 5: for each value v of i: 6: T[v] = {(x, y) from T such that x_i = v} 7: t[v] = DecisionTree(T[v]) 8: return node(i, {v: t[v]})

In what follows, we use uppercase \(T\) for training sets and lowercase
\(t\) for trees. A tree is either a leaf `leaf(T)`, where we save a
subset \(T\) of the original training set, or a node `node(i, {v: t[v]})`
that tests attribute \(i\) and has a child tree \(t_v\) for each
potential value \(v\) of the attribute. The two statements with quotation
marks correspond to heuristics specific to each algorithm:

- Stop condition:
- The condition used to decide when it is time to produce an answer rather than grow the tree to better fit the data. For instance, the condition \(|T| = 1\) will produce detailed trees with excellent score on the training set (for each instance \((\bfx, y) \in T\) the tree will predict exactly \(y\) given \(\bfx\)) but quite deep and not likely to generalize well outside of this set due to overfitting.
- Best attribute:
- The goal of this heuristic is to evaluate locally what attribute yields the "most information" on (or is the "most correlated" to) the result to predict. We will see examples of such criteria below.

When learning a continuous attribute \(x_i \in \mathbb{R}\), we adapt the
above pseudocode by selecting a *split value* \(v\) corresponding to a test
\(x_i \leq v\). We will denote by `node(i, v, t[≤], t[>])` the
corresponding node constructor. In the particular case where all attributes are
continuous, the resulting decision tree is a binary tree.

## Regression

... translation in progress...

### Choosing a split criterion

... translation in progress...

### Why minimize variance?

... translation in progress...

## Classification

... translation in progress...

### Measuring leaf homogeneity

... translation in progress...

### Choosing a split criterion

... translation in progress...

## Extensions

... translation in progress...

### Bagging

... translation in progress...

### Random forests

... translation in progress...

## To go further

This article is a translation of Une introduction aux arbres de décision, a report I wrote in 2011 after my internship at INRIA with Marc Lelarge. Warm thanks to Marc for supporting this work, for our discussions, and the cool topics we worked on :-)