A Complete guide to K Nearest Neighbor (KNN) in Machine Learning

2 min readJun 2, 2022

K Nearest Neighbor or K-NN is a supervised learning algorithm that uses a non-parametric algorithm, which means it does not make any assumption on underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.

KNN classifies new data points based on K similar records in the dataset.

Let’s take a look at the steps in KNN algorithm-

Choose a K, let’s say K= 3
Choose a data point you want to classify
Calculate the distance of all data points from this data point.
Find the K nearest neighbour
Pick the majority class

In the example above, we are trying to find which class the ? should belong to — class blue or class orange. With K=3, we calculate the distance of all data points from ?, and get the nearest 3 data points. From the example, we can see that the majority neighbors belong to class orange, hence we classify the ? data point as class orange as well.

KNN in Python class-

KNeighborsClassifier(n_neighbors=3, metric=’minkowski’, p=2)

From the above class, we need to decide on 2 things-

n_neighbors: How to choose K?
metric: How to calculate distances?

How to choose k

Choosing the right number of neighbors(K) is very important.

Low K (like K=1): predictions based on only one data sample could be greatly impacted by noise in the data (outliers, mislabeling)

Large K: more robust to noise, but the nearest “neighborhood” can get too inclusive, breaking the “locality”, and a class with only a few samples in it will always be outvoted by the other classes

Rule of thumb in selecting K:
K = √n, where is the number of data points

How to calculate distances

Another important factor for KNN is deciding the distance metric.

Possible Distance metrics -

Euclidean/Manhattan: This metric is used for Real-valued features
For similar types: p = 2, Euclidean
For mixed types (lengths, ages, salaries): p = 1, Manhattan (taxi-cab)
Hamming :This metric is used for Binary-values features
Number of positions where the values of two vectors are different
Jaccard: This metric is used for Boolean-valued features
Ratio of shared values to the total number of values of two vectors
Cosine similarity: This metric is used for High dimensional feature space
The angle between two vectors (similarity irrespective of their sizes)

A Complete guide to K Nearest Neighbor (KNN) in Machine Learning

KNN in Python class-

How to choose k

How to calculate distances

Written by Swagata Ashwani

No responses yet