A Complete guide to K Nearest Neighbor (KNN) in Machine Learning

Swagata Ashwani
2 min readJun 2, 2022

--

Photo by Gleren Meneghin on Unsplash

K Nearest Neighbor or K-NN is a supervised learning algorithm that uses a non-parametric algorithm, which means it does not make any assumption on underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.

KNN classifies new data points based on K similar records in the dataset.

Let’s take a look at the steps in KNN algorithm-

  1. Choose a K, let’s say K= 3
  2. Choose a data point you want to classify
  3. Calculate the distance of all data points from this data point.
  4. Find the K nearest neighbour
  5. Pick the majority class

In the example above, we are trying to find which class the ? should belong to — class blue or class orange. With K=3, we calculate the distance of all data points from ?, and get the nearest 3 data points. From the example, we can see that the majority neighbors belong to class orange, hence we classify the ? data point as class orange as well.

KNN in Python class-

KNeighborsClassifier(n_neighbors=3, metric=’minkowski’, p=2)

From the above class, we need to decide on 2 things-

n_neighbors: How to choose K?
metric: How to calculate distances?

How to choose k

Choosing the right number of neighbors(K) is very important.

Low K (like K=1): predictions based on only one data sample could be greatly impacted by noise in the data (outliers, mislabeling)

Large K: more robust to noise, but the nearest “neighborhood” can get too inclusive, breaking the “locality”, and a class with only a few samples in it will always be outvoted by the other classes

Rule of thumb in selecting K:
K = √n, where is the number of data points

How to calculate distances

Another important factor for KNN is deciding the distance metric.

Possible Distance metrics -

  1. Euclidean/Manhattan: This metric is used for Real-valued features
    For similar types: p = 2, Euclidean
    For mixed types (lengths, ages, salaries): p = 1, Manhattan (taxi-cab)
  2. Hamming :This metric is used for Binary-values features
    Number of positions where the values of two vectors are different
  3. Jaccard: This metric is used for Boolean-valued features
    Ratio of shared values to the total number of values of two vectors
  4. Cosine similarity: This metric is used for High dimensional feature space
    The angle between two vectors (similarity irrespective of their sizes)

--

--

Swagata Ashwani
Swagata Ashwani

Written by Swagata Ashwani

I love talking Data! Data Scientist with a passion for finding optimized solutions in the AI space.Follow me here — https://www.linkedin.com/in/swagata-ashwani/

No responses yet