Это 31-й день моего участия в августовском испытании обновлений. Узнайте подробности события:Испытание августовского обновления
Заметки о машинном обучении Эндрю Нг - (13) Обучение без присмотра
В этой статье в основном представлены неконтролируемое обучение и алгоритм кластеризации K-средних.
Clustering
Unsupervised Learning: Introduction
In supervised learning, we are given a set of labeled training set ( ( given)) to fit a hypothesis to it:
In contrast, in the unsupervised learning problems, we are given data that does not have any labels associated with it ( only, no ):
The slide above shows that here's a set of points add in no labels. So, in unsupervised learning what we do is we give this sort of unlabeled training set to an algorithm and we just ask the algorithm find some structure in the data for us.
Given this data set one type of structure we might have an algorithm find is that it looks like this data set has points grouped into two separate clusters (drawn in green) and so an algorithm that finds clusters like the ones I've just circled is called a clustering algorithm.
K-Means Algorithm
In the clustering problem we are given an unlabeled data set and we would like to have an algorithm automatically group the data into coherent subsets or into coherent clusters for us.
The K-Means Algorithm is by far the most popular and widely used clustering algorithm.
Given an unlabeled data set to group (cluster) them into K (let's say two, for example) clusters, what the K-Means Algorithm
do is:
- Randomly initialize two points, called the cluster centroids.
- Cluster assignment step: going through each of the examples, for each of data dots depending on which cluster centroid it's closer to, one point is going to be assigned to one of the K cluster centroids (color them into the same color, illustratingly).
- Move centroid step: take the K cluster centroids, and we are going to move them to the average of the points colored the same colour.
We keep on move centroid then redo a cluster assignment. Doing this looply until the cluster centroids not change any further (i.e. the K-Means has converged). And, it's done a good job finding the clusters in the data.
K-means algoithm
Input:
-
(number of clusters)
-
(training set)
(drop convention)
Output:
- subsets
Algorithm:
K-Means for non-separated clusters
K-Means can do well in both well-separated cluster and non-separated clusters:
K-means Optimization Objective
Notations:
- = index of cluster () to which example is currently assigned.
- = cluster centroid ().
- = cluster centroid of cluster to which example has been assigned.
For example, let's say we have a that is currently assigned to cluster , so in this case, our and , where cluster centroid of cluster is notated as .
Optimization objective:
Our cost function is also called distortion.
In the K-means algorithm, what it doing in our cluster assignment step is minimize wit (holding fixed), and the move centroid step is minimizing the wit .
Random initialization
There is a way to random initialization our cluster centroids:
- Should have
- Randomly pick training examples.
- Set equal to these examples.
i.e. we pick distinct random integers from , set .
Local optima
Beacuse of the random initialization, we sometimes get a good cluster (get a global optima, like the one at the right top of the picture) and we may also get a bad one (get a local optima, like the two at the right bottom of the picture).
To avoid the K-means algorithm stop at a loacal optima, we can try this:
Choosing the Number of Clusters
The most common way to choose the number of clusters is actually to choose it by hand.
Choosing the Elbow number is a capable way to choose the value of :
However the elbow method is not always make sense for a condition like this:
We can hardly find a elbow of it.
So, there is another thought to make it:
Sometimes, you are running K-means to get clusters to use for some later/downstream purpose. Elaluate K-means based on a metric for how well it performs for that later purpose.
T-shirt sizing for example:
In particular, what we can do is, think about this from the perspective of the T-shirt business and ask: "Well if I have five segments, then how well will my T-shirts fit my customers and so, how many T-shirts can I sell? How happy will my customers be?" What really makes sense, from the perspective of the T-shirt business, in terms of whether, I want to have Goer T-shirt sizes so that my T-shirts fit my customers better. Or do I want to have fewer T-shirt sizes so that I make fewer sizes of T-shirts. And I can sell them to the customers more cheaply. And so, the t-shirt selling business, that might give you a way to decide, between three clusters versus five clusters.