K-Nearest Neighbors (KNN)

Supervised learning model - training data has labels

Used for classification

Weighted K-Nearest Neighbors

Closer neighbors have a stronger influence on the prediction since it’s weighted by distance

Find the nearest $k$ points to our new data ( $k$ is a hyperparameter here)
Take the weighted average of their labels based on their Euclidean distance
- Could also use other kinds of distance (e.g. Manhattan distance, Hamming distance, cosine similarity), but Euclidean is most common
Assign that label

What if $k = 1$ ?
- Model will just say whatever is closest
- Will almost be memorizing things; won’t be able to generalize
What if $k = n$ ?
- Treats all points as neighbors
- Soon, more and more irrelevant points will contribute, and we will just predict the mode of the data

Try multiple $k$ s and graph the error rate, pick the lowest (elbow method)

Or heuristics, e.g. $n$ , where $n$ is number of data points

You can also predefine sphere sizes and then let everything inside the sphere vote

Saves you the trouble of searching all the points

Pros:

Cons:

Instance based model: Must be able to store the entire training set, which could be extremely large
K must be tuned correctly
Sensitive to outliers
Searching for the nearest neighbors is very time-consuming
- Can be helped by the spherical approach