Руководство для начинающих по алгоритму случайного леса

На мой взгляд, большинство руководств по машинному обучению недостаточно удобны для начинающих.

Last month, I wrote an introduction to Neural Networks for complete beginners. This post will adopt the same strategy, meaning it again assumes ZERO prior knowledge of machine learning. Мы узнаем, что такое случайные леса и как они работают с нуля.

Готовы? Давайте погрузимся.

1. Деревья решений ?

Случайный лес ??? на самом деле просто набор деревьев решений ?, связанных вместе (о-о-о, вот почему он называетсяforest). We need to talk about trees before we can get into forests.

Look at the following dataset:

Если бы я сказал вам, что появилась новая точка с Икс xx координата 1 11, как вы думаете, какого цвета это будет?

Blue, right?

You just evaluated a decision tree in your head:

Это простое дерево решений с однимdecision node that tests x < 2 x < 2x<2. If the test passes ( x < 2 x < 2x<2), we take the left branchи выбираем синий.Если тест не пройден (x≥2x\geq 2x≥2), мы берем правыйbranch and pick Green.

Decision Trees are often used to answer that kind of question: given a labelled dataset, how should we classify new samples?

Labelled: Our dataset is labelled because each point has a class (color): blue or green.

Classify: To classify a new datapoint is to assign a class (color) to it.

Вот набор данных, который теперь имеет 3 класса вместо 2:

Наше старое дерево решений уже не так хорошо работает. ( Икс , у ) (х, у) (х, у),

Если Икс ≥ 2 x \geq 2x≥2, мы все еще можем уверенно классифицировать его как зеленый.
Если Икс

We need to add another decision node to our decision tree:

Довольно просто, не так ли?Это основная идея деревьев решений.

2. Training a Decision Tree

Давайте начнем тренировать дерево решений!Мы снова будем использовать набор данных 3 класса:

2.1 Training a Decision Tree: The Root Node

Our first task is to determine the root decision node in our tree. Which feature ( x xx or y yy) will it test on, and what will the test threshold be? For example, the root node in our tree from earlier used the x xx feature with a test threshold of 222:

Интуитивно нам нужен узел решения, который делает «хорошее» разделение, где «хорошее» можно приблизительно определить какseparating different classes as much as possible, Корневой узел выше делает «хорошее» разделение:all the greens are on the right, and no greens are on the left.

Таким образом, наша цель теперь состоит в том, чтобы выбрать корневой узел, который даст нам «наилучшее» возможное разделение.But how do we quantify how good a split is?Это сложно, я написалan entire blog post about one way to do this using a metric called Gini Impurity. ← Рекомендую прочитать прямо сейчаспрежде чем вы продолжите — мы будем использовать эти концепции позже в этом посте.

Welcome back!

Hopefully, you just read my Gini Impurity post, Если вы этого не сделали, вот очень короткий TL; DR: мы можем использовать примесь Джини для вычисления значения, называемогоGini Gain for any split. A better split has higher Gini Gain.

Вернемся к проблеме определения нашего корневого узла решения. Теперь, когда у нас есть способ оценить разбиение, все, что нам нужно сделать, это найти наилучшее возможное разбиение! Ради простоты мы просто собираемсяtry every possible split and use the best one (the one with the highest Gini Gain). This is not the fastest way to find the best split, but it is the easiest to understand.

Trying every split means trying

Every feature ( x xx or y yy).
Все «уникальные» пороги.We only need to try thresholds that produce different splits.

For example, here are the thresholds we might select if we wanted to use the x xx coordinate:

Давайте сделаем пример расчета коэффициента Джини для Икс знак равно 0,4 х = 0,4х=0,4 разделить.

Split	Left Branch	Right Branch
x = 0.4 x = 0.4x=0.4

First, we calculate the Gini Impurity of the whole dataset:

г я н я т я а л знак равно ∑ я знак равно 1 3 п ( я ) * ( 1 − п ( я ) ) знак равно 3 * ( 1 3 * 2 3 ) знак равно 2 3 \begin{align} G_{initial} &= \sum_{i=1}^3 p(i) * (1 - p(i)) \\ &= 3 * (\frac{1}{3} * \ frac{2}{3}) \\ &= \boxed{\frac{2}{3}} \\ \end{aligned}Ginitial=i=1∑3p(i)∗(1−p (i))=3∗(31∗32)=32

Then, we calculate the Gini Impurities of the two branches:

г л е ф т знак равно 0 * 1 + 1 * 0 + 0 * 1 знак равно 0 G_{left} = 0 * 1 + 1 * 0 + 0 * 1 = \boxed{0}Gleft=0∗1+1∗0+0∗1=0 г р я г час т знак равно 3 8 * 5 8 + 2 8 * 6 8 + 3 8 * 5 8 знак равно двадцать один 32 \begin{align} G_{right} &= \frac{3}{8} * \frac{5}{8} + \frac{2}{8} * \frac{6}{8} + \frac{ 3}{8} * \frac{5}{8} \\ &= \boxed{\frac{21}{32}} \\ \end{aligned}Gright=83∗85+82∗ 86+83∗85=3221

Finally, we calculate Gini Gain by subtracting the weighted branch impurities from the original impurity:

Усиление знак равно г я н я т я а л − 1 9 г л е ф т − 8 9 г р я г час т знак равно 2 3 − 1 9 * 0 − 8 9 * двадцать один 32 знак равно 0,083 \begin{align} \text{Gain} &= G_{initial} - \frac{1}{9} G_{left} - \frac{8}{9} G_{right} \\ &= \frac{2 }{3} - \frac{1}{9} * 0 - \frac{8}{9} * \frac{21}{32} \\ &= \boxed{0,083} \\ \end{aligned}Усиление =Ginitial−91Gleft−98Gright=32−91∗0−98∗3221=0.083

Смущен тем, что только что произошло? Я сказал вам, что вы должны были прочитатьmy Gini Impurity postЭто объяснит все эти вещи с Джини.

We can calculate Gini Gain for every possible split in the same way:

Split	Left Branch	Right Branch	Gini Gain
x = 0.4 x = 0.4x=0.4			0.083 0.0830.083
x = 0.8 x = 0.8x=0.8			0.048 0.0480.048
x = 1.1 x = 1.1x=1.1			0.133 0.1330.133
x = 1.3 x = 1.3x=1.3			0.233 0.2330.233
x = 2 x = 2x=2			0.333 0.3330.333
x = 2.4 x = 2.4x=2.4			0.191 0.1910.191
x = 2.8 x = 2.8x=2.8			0.083 0.0830.083
y = 0.8 y = 0.8y=0.8			0.083 0.0830.083
y = 1.2 y = 1.2y=1.2			0.111 0.1110.111
y = 1.8 y = 1.8y=1.8			0.233 0.2330.233
y = 2.1 y = 2.1y=2.1			0.233 0.2330.233
y = 2.4 y = 2.4y=2.4			0.111 0.1110.111
y = 2.7 y = 2.7y=2.7			0.048 0.0480.048
y = 2.9 y = 2.9y=2.9			0.083 0.0830.083

Попробовав все пороги для обоих Икс хх и у yy, мы обнаружили, что Икс знак равно 2 Разделение x = 2x=2 имеет самый высокий коэффициент Джини, поэтому мы заставим наш корневой узел принятия решений использовать функцию xxx с порогом 222. Вот что у нас есть:

Making progress!

2.2: Training a Decision Tree: The Second Node

Пришло время сделать наш второй узел решения.Давайте (произвольно) перейдем к левой ветке.Теперь мы используем только те точки данных, которые берут левую ветвь. (i.e. the datapoints satisfying x < 2 x < 2x<2), specifically the 3 blues and 3 reds.

To build our second decision node, we just do the same thing! We try every possible split for the 6 datapoints we have and realize that y = 2 y = 2y=2 is the best split. We make that into a decision node and now have this:

Наше дерево решений почти готово…

2.3 Training a Decision Tree: When to Stop?

Давайте продолжим и попробуем создать третий узел принятия решений. На этот раз мы будем использовать правую ветвь от корневого узла. Единственными точками данных в этой ветви являются 3 зеленых.

Again, we try all the possible splits, but they all

Are equally good.
Иметь прирост Джини 0 (примесь Джини уже была 0 и не может быть ниже).

Добавлять здесь узел решений не имеет смысла, потому что это не улучшит наше дерево решений, поэтому мы сделаем этот узелleaf node and slap the Green label on it. This means that мы будем классифицировать любую точку данных, которая достигает этого узла, как зеленую.

Если мы продолжим с двумя оставшимися узлами, произойдет то же самое: мы сделаем нижний левый узел нашим синим листовым узлом, а нижний правый узел сделаем нашим красным листовым узлом, Это приведет нас к окончательному результату:

Как только все возможные ветви в нашем дереве решений заканчиваются конечными узлами, все готово.Мы обучили дерево решений!

3. Случайные леса ?????

Наконец-то мы готовы поговорить о случайных лесах. Помните, что я говорил ранее?

A Random Forest is actually just a bunch of Decision Trees bundled together.

Это правда, но это небольшое упрощение.

3.1 Bagging

Consider the following algorithm to train a bundle of decision trees given a dataset of n nn points:

Sample, with replacement, n nn training examples from the dataset.
Train a decision tree on the n nn samples.
Repeat t tt times, for some t tt.

To make a prediction using this model with t tt trees, we aggregate the predictions from the individual decision trees and either

Take the majority vote if our trees produce class labels (like colors).
Take the average if our trees produce numerical values (e.g. when predicting temperature, price, etc).

This technique is called bagging, or bootstrap aggregating. The sampling with replacement we did is known as a bootstrap sample.

Деревья решений в мешках очень близки к Random Forests — им не хватает только одного…

3.2 Бэггинг → Случайный лес

Bagged decision trees have only one parameter: t tt, the number of trees.

Random Forests have a second parameter that controls how many features to try when finding the best split. Our simple dataset for this tutorial only had 2 22 features ( x xx and yyy), but most datasets will have far more (hundreds or thousands).

Suppose we had a dataset with p pp features. Instead of trying all features every time we make a new decision node, we only try a subset of the features, usually of size p. We do this primarily to inject randomness that makes individual trees more unique and reduces correlation between trees, что повышает общую производительность леса. Этот метод иногда называют какfeature bagging.

4. Now What?

Это введение в Random Forests для начинающих! Краткий обзор того, что мы сделали:

Introduced decision trees, the building blocks of Random Forests.
Learned how to train decision trees by iteratively making the best split possible.
Defined Gini Impurity, метрика, используемая для количественной оценки того, насколько «хорошо» разделение.
Saw that a random forest = a bunch of decision trees.
Understood how bagging combines predictions from multiple trees.
Learned that feature bagging is the difference between bagged decision trees and a random forest.

A few things you could do from here:

Поэкспериментируйте с scikit-learnDecisionTreeClassifier and RandomForestClassifier classes on real datasets.
Попробуйте написать простую реализацию дерева решений или случайного леса с нуля. Я буду рад дать рекомендации или проверить код! Простоtweet at me or email me.
Read about Gradient Boosted Decision Trees and play with XGBoost, a powerful gradient boosting library.
Read about ExtraTrees, расширение Random Forests, или поиграйте с scikit-learn’sExtraTreesClassifier class.

That concludes this tutorial. I like writing about Machine Learning (but also other topics), so subscribe if you want to get notified about new posts.

Thanks for reading!