Wednesday, November 12, 2014

The power of k-nearest-neighbor searches

I came across the k-nearest-neighbor (k-NN) algorithm recently. Altough it's a relatively simple algorithm, its power still amazed me.

K-NN can be used to classify sets of data when the algorithm is fed with some examples of classifications.

In order to demonstrate that, I have written a perl script. The script creates two csv files (known.csv and unknown) and a third file: correct.txt. The k-NN algorithm will use known.csv to train its understanding of a classification. Then, it tries to guess a classification for each record in unknown.csv. For comparing purposes, correct.txt contains the classification for each record in unknown.csv.

Structure of known.csv

known.csv is a csv file in which each record consists of 11 numbers. The first number is the classifaction for the record. It is a integer between 1 and 4 inclusively. The remaining 10 numbers are floats between 0 and 1.

Structure of unknown.csv

In unknown.csv, each record consists of 10 floats between 0 and 1. They correspond to the remaining 10 numbers in known.csv. The classification for the records in unkown.csv is missing in the file - it is the task of the k-NN algorithm to determine this classification. However, for each record in unknown.csv, the correct classification is found in correct.txt

Values for the floats

A record's classification determines the value-ranges for the floats in the record, according to the following graphic:
The four classifications are represented by the four colors red, green, blue and orange. The Perl script generated eight floats. For range for the first float for the red classification is [0.15,0.45], the range for the first float for the green classification is [0.35,0.65] etc. Similarly, the range for the second value of the red classification is [0.55,0.85] and so on.

To make things a bit more complicated, two random values in the range [0,1] are added to the eight values resulting in 10 values. These two values can either be at the beginning, at the end or one at the beginning and the other at the end.


When I let the Perl script create 1000 unknown and known records, the following r script guesses more than 985 classifications right, most of the time.
library(FNN) known <- read.csv("known.csv" , header=FALSE) unknown <- read.csv("unknown.csv", header=FALSE) labels <- known[, 1] known <- known[,-1] results <- (0:4)[knn(known, unknown, labels, k = 10, algorithm="cover_tree")] write(results, file="guessed.txt", ncolumns=1)
Source code on github
I find that rather impressive

No comments:

Post a Comment