Data Mining Notes Technical Chapter-Chapter5 Classification Algorithm KNN

Data Mining Notes Technical Chapter-Chapter5 Classification Algorithm KNN

Build a classifier based on relative distance measurements

Among them, the most representative is the KNN algorithm

Reference: "Principles of Statistical Learning Li Hang"

1. Introduction to KNN algorithm

data set: T=[(x1,y1),(x2,y2),...,(xn,yn)]T={[(x_1,y_1),(x_2,y_2),...,(x_n,y_n)]}

among them xiAXARnx_i/in/mathcal{X}/in/mathbb{R}^n is the feature vector of the instance, yiAY=[c1,c2,...,cN]y_i/in/mathcal{Y}=[c_1,c_2,...,c_N] Is the category of the instance,i=1,2,...,Ni=1,2,...,N

Output: examplexxThe class to which belongsyy .

  • According to the given distance vector, in the training setTTFound in andxx nearestkpoints, covering thiskk pointsxxThe neighborhood of is written asNk(x)N_k(x) ;
  • inNk(x)N_k(x) is determined according to the classification decision rulexx categoryyy ;

model

In the k-nearest neighbor method, when the training set, distance measurement (such as Euclidean distance>, k value and classification decision rules (such as majority voting) are determined, for any new input instance, the class it belongs to is uniquely determined. It is equivalent to dividing the feature space into some subspaces based on the above-mentioned elements, and determining the class to which each point in the space belongs. This fact can be clearly seen from the nearest neighbor algorithm.

In the feature space, for each training instance pointxx , all points closer to this point than other points form an area, called unit (ce11). Each training instance point has a unit, and the units of all training instance points constitute a division of the feature space. Examples of Nearest Neighborsxix_ithe typeyy as the class label of all points in its cell. In this way, the category of the instance point of each unit is determined.

Distance metric

For a feature space, the distance between two instance points is a reflection of the similarity of the two instance points.

Euclidean distance is generally used, of course there areLpL_pDistance and Minkowski distance.

Choice of k value

The choice of k value will have a significant impact on the results of the k-nearest neighbor method.

If you choose a smaller value of k, it is equivalent to predicting with training examples in a smaller neighborhood, the approximation error of "learning" will be reduced, and only training examples that are closer to the input instance (similar) will be used for prediction. The prediction results play a role. But the disadvantage is that the estimation error of "learning" will increase, and the prediction result will be very sensitive to the neighboring instance points. If the neighboring instance points happen to be noise, the prediction will be wrong. In other words, the decrease of the k value means that the overall model becomes complicated, and it is prone to overfitting.

If you choose a larger value of k, it is equivalent to predicting with the training examples in the larger neighborhood. The advantage is that it can reduce the estimation error of learning, but the disadvantage is that the approximate error of learning will increase. At this time, the <unsimilar> training example far away from the input example will also play a role in the prediction. The increase of the k value to make the prediction error means that the overall model becomes simple.

In application, the value of k generally takes a relatively small value. The cross-validation method is usually used to select the optimal k value.

Classification rules

KNN classification rule is the majority rule

The loss function is 0-1 loss function, and the classification rules are:

f:Rn c1,c2,...,ckf:R^n/rightarrow {c_1,c_2,...,c_k}

Then for the set of k adjacent training instance pointsNN , the error (misclassification) rate is:

1k xiANk(x)I(yi cj)=1 1k xiANk(x)I(yi=cj)\frac{1}{k}/sum_{x_i/in N_k(x)}I(y_i/neq c_j) = 1-\frac{1}{k}/sum_{x_i/in N_k(x)}I( y_i =c_j)

To minimize the misclassification rate, it is required that the probability of correctness be maximized, so the rule that the minority obeys the majority can just meet the empirical risk minimization.

kd tree

When implementing KNN, the main consideration is how to perform fast K-nearest neighbor search on training data. If the dimensionality of the feature space is large or the training data capacity is large, then data storage is a big problem. The simplest way to implement KNN is linear scanning. At this time, when the data set is large, the calculation is very time-consuming. In order to improve the efficiency of this search, a special structure is used to store training data-kd tree. The kd tree is a tree data structure that stores points in a k-dimensional space for quick search. In essence, the kd tree is a binary tree, which represents a division of the k-dimensional space.