In this post, we’re going to finish the work started in the previous one and
eventually classify the quantized version of
Wrist-worn Accelerometer Dataset.
There are many ways to classify the datasets with numerical features, but
the Decision Tree algorithm is one of the most intuitively understandable and simple approaches in
terms of its underlying implementation. We are going to build a *Decision Tree*
classifier using **Numpy** library and generalize it to
*Random Forest* — an ensemble of randomly generated trees that is less prone to the data noise.

TL;DR:As always, here is a link to the repository with solutions implementing Decision Tree and Random Forest classifiers, as well as their training using the accelerometer dataset.

#### Post contents

### Decision Trees

**A decision tree** is a classifier expressed as a recursive partition of the analyzed dataset.
Mathematically, it represents a function that takes as input a vector of attribute
values and returns a “decision” — a single output value. The tree reaches its decision by
performing a sequence of tests on a given observation attributes. Each internal node in the tree
corresponds to a test of the value of one of the observation’s attributes $A_i$, and the
branches from the node are labeled with the possible values of the attribute $A_i = v_{ik}$.
Every leaf node in the tree specifies a value to be returned by the function.

However, in the post we’re describing a bit simplified version of this general conception.
**Figure 1** shows a graphical representation of a decision tree classifier we’re going to implement.
This specific tree was trained on Wine Dataset.
The dataset is exceptionally tiny and contains only **178 instances**, each belongs to one of the **3 classes**.
These properties make it an ideal candidate for a smoke or sanity test
when developing the machine learning code. Even the simplest classifier should show an excellent performance
on it, and the training process is very very fast. In our case, each node has the two branches only
and makes its decision comparing attribute like *Magnesium* with a threshold value.
Therefore, even for the continuous attributes, each node performs a binary classification task.

*Fig 1. An example of decision tree trained on Wine Dataset*

The figure gives a clue why the considered method of classification contains the word “tree” in its name. The decision process constitutes the following branches of the directed acyclic graph depending on the values of a classified observation’s attributes.

Note:The pseudo-code that goes below is nor a standard definition of Decision Tree algorithm, neither a novel implementation. It is intended to provide aconciseandintuitively clearexplanation of the generic idea which I would like to have when started working on my implementation. Here is a link to the one of the original papers on decision trees.

Now when we’ve seen a visual representation of a trained classifier, the question is how can we formally represent the training process? As it was already mentioned, we’re discussing one of the purest versions of the Decision Tree algorithm implementation, as the pseudo-code shows.

```
def decision_tree(D: Dataset, depth: int) -> Node:
if depth >= THRESHOLD:
return majority_class(D)
if empty(D):
return None
best_split = None
for feature in features(D):
for value in unique(feature):
split = select_samples_less_or_equal(value, dataset=D)
quality = compute_quality(split, D, classes(D))
if better(quality, best_split):
best_split = quality
D_left, D_right = split(D, bestSplit)
left_child = decision_tree(D_left, depth + 1)
right_child = decision_tree(D_right, depth + 1)
node = create_node(left_child, right_child)
return node
```

We’re building a decision tree by recursively splitting the dataset into two subsets
based on the values of its attributes. One of the crucial algorithm points is the call
of the `compute_quality()`

function that estimates
how “good” (in terms of the classification accuracy) the partition is. Note that
we also **limiting the depth** of a tree. Otherwise, we could get a very deep
tree with just a few observations per node. Therefore, as soon as we’re reaching
the maximal depth we are returning the most frequent class of the subset.

There are several possible quality metrics. One of this metrics is called Gini impurity score and is defined with the formula:

$$ Gini(D, C) = N \times \left[ 1 - \sum_{c_j \in C}\left(\frac{1}{N}\sum_{y_i \in D}{I(y_i = c_j)}\right)^2 \right] $$

where $N$ is the total number of the samples in the dataset $D$, and $I$ denotes an
indicator variable which is equal to `1`

when the $j$th instance’s class is equal
to $c_j$, and is equal to `0`

— otherwise.

As it follows from the definition, the Gini score shows *how impure*
the split is. The higher the score, the worse a specific split, the more
various classes are represented in the child nodes after a split. Therefore,
this metric is an opposite of the quality, and our goal is to achieve the
lowest possible Gini score when splitting a parent node.

In the next section, we’re going to build a simple decision tree implementation that works with continuous attributes.

### Decision Tree Implementation with Numpy

The snippet below shows a simple implementation of the Decision Tree learning algorithm. There are a few other improvements shown in the Python script that are not present in the pseudo-code discussed below.

The code is mostly self-explanatory, but there is a couple of tips. First of
all, reaching the maximally allowed depth is not the only considered “shortcut”
condition. Lines **87-90** are also checking if all observations belong
to a single class and if the node contains too few observations to be split
again. Second, the implementation doesn’t split the original 2D array into
smaller arrays but operates with indexes instead to be more memory efficient.
Finally, the tree learning function provides an opportunity to use only a subset
of features to build the tree. Why do we need such functionality? Isn’t it better to always use
all available attributes of the data? The reason is described in the next section.

### Ensemble Methods and Random Forests

Ok, now we have an implemented classifier and can use it to classify various
datasets. However, what’s going on if we try to train a tree on various subsets
of the original dataset? The **Figure 2** shows the decision trees trained on
three distinct random splits of the dataset.

*Fig 2. Group of decision trees trained using different dataset splits*

Despite that the root nodes are almost equal, the trees are a bit different in their leaves and the inner nodes. It means that you’re going to get different classification results with every new random generator seed. Moreover, the classifier’s accuracy significantly changes from one tree to another. The reason directly follows from the tree’s construction process. When taking various subsets of the original dataset, we’re effectively changing the list of considered feature values when building the nodes. Therefore, with every start we have a bit different Gini score and different best splits.

How can we alleviate this issue? One of the possible solutions is to build
several trees and average their predictions. The theoretical foundation of this result
comes from the theorem about weak learners.
Briefly speaking, the idea is that if one has a classifier that
performs a bit better then a random one (like better than a fair
coin in case of binary classification data), then having a sufficient
number of such weak classifiers would be able to predict the classes with
almost perfect accuracy. Strictly speaking, to apply the theorem the dataset
samples should meet a requirement of being **independent and identically
distributed** random variables, and the distribution where samples
come from shouldn’t change much. Nevertheless, this approach works reasonably
well on practice if each classifier’s training process is a bit different from
the another one.

So now it is time to answer the question we’ve stated in the previous
section: *Why to build a tree using only a subset of features?* To make
trees as different from each other as possible. In this case, every tree
makes different kinds of mistakes, and when averaged they could give us
better accuracy than one by one.

The idea of the joining a group of weak learners into a single strong one
could be applied not only to the Decision Tree but to any kind of “weak” classifier.
In the cases when this approach is applied to a group of trees, the strong
learner is called **Random Forest**.

The snippet below shows a simple wrapper build on top of our decision tree implementation which creates a bunch of decision trees and provides a few convenience methods for making the predictions.

Lines **28-36** create an array of decision trees. Every tree is trained on
a bootstrapped sample of the original dataset. Lines **46-66** define the method
called `predict_proba()`

returning a matrix with classes occurrence
frequencies based on the trees predictions. Other methods and utilities serve for
convenience purposes. The standalone implementation of the classifier and
the helping utilities can be found here.

Now we’re ready to join everything described in this post together and apply our DIY classifiers to the Wrist-worn Accelerometer Dataset.

### Accelerometer Dataset Classification

As we know from the previous post, the accelerometer dataset is not prepared to be directly fed into a classical supervised learning algorithm that expects an array with samples $X$ and an array with targets $y$. Therefore, the first step is to apply the dataset quantization algorithm. Then, we need to convert targets from strings into numerical values. Next, we split the quantized dataset into the training and validation subsets. Finally, we’re training an instance of a Random Forest classifier on the training subset and checking its performance on the validation subset.

The snippet below shows all these steps. An interested reader could also check this Jupyter notebook that contains the same steps, and can be used as a playground to investigate separate steps of our pipeline.

A question that one would ask when using an ensemble classifier is how many weak
learners do we need to use to achieve the best possible performance? One way to
know it is to add classifiers one after another to the ensemble and check the
ensemble’s performance. **Figure 3** shows a plot showing the dependency between a
logarithm of the number of trees $N$ in the ensemble and the validation accuracy measured
in percents. The $N$ value is varied from $1$ to $1000$. Each orange dot
reflects an accuracy for a specific $N$ value, and the blue curve is a polynomial
approximation of these discrete measurements.

*Fig 3. Relationship between ensemble accuracy and its size*

We’re getting approximately **47%** of accuracy on the validation subset with
**14** classes, and we can claim that our classifier successfully grasps the
relationships between samples and targets, performing much better than a random guess.

As the curve shows, we’re getting a significant accuracy increase when going from a single tree to several dozens of trees. However, the ensemble accuracy has the boundary. The accuracy slowly stops increasing when the number of trees reaches several hundred. To get better results, we could try to randomize the training process even more, add more features, or gather extra data.

### Conclusion

Decision trees and their ensembles are intuitively clear but powerful machine learning techniques. There are many improvements to the original algorithm and a bunch of great libraries that allow training trees ensembles in parallel fashion and getting higher accuracy, like, an excellent library called XGBoost. Nevertheless, even a naïve implementation shows decent results and proofs that it is not too challenging to implement a machine learning classifier from scratch.