machine learning - Distance measure for categorical attributes for k-Nearest Neighbor -
for class project, working on kaggle competition - don't kicked
the project classify test data good/bad buy cars. there 34 features , data highly skewed. made following choices:
- since data highly skewed, out of 73,000 instances, 64,000 instances bad buy , 9,000 instances buy. since building decision tree overfit data, chose use knn - k nearest neighbors.
after trying out knn, plan try out perceptron , svm techniques, if knn doesn't yield results. understanding overfitting correct? - since features numeric, can directly use euclid distance measure, there other attributes categorical. aptly use these features, need come own distance measure. read hamming distance, still unclear on how merge 2 distance measures each feature gets equal weight.
- is there way find approximate value of k? understand depends lot on use-case , varies per problem. but, if taking simple vote each neighbor, how should set value of k? i'm trying out various values, such 2,3,10 etc.
i researched around , found these links, these not helpful -
a) metric nearest neighbor, says finding out own distance measure equivalent 'kernelizing', couldn't make sense it.
b) distance independent approximation of knn talks r-trees, m-trees etc. believe don't apply case.
c) finding nearest neighbors using jaccard coeff
please let me know if need more information.
since data unbalanced, should either sample equal number of good/bad (losing lots of "bad" records), or use algorithm can account this. think there's svm implementation in rapidminer this.
you should use cross-validation avoid overfitting. might using term overfitting incorrectly here though.
you should normalize distances have same weight. normalize mean force between 0 , 1. normalize something, subtract minimum , divide range.
the way find optimal value of k try possible values of k (while cross-validating) , chose value of k highest accuracy. if "good" value of k fine, can use genetic algorithm or similar find it. or try k in steps of 5 or 10, see k leads accuracy (say it's 55), try steps of 1 near "good value" (ie 50,51,52...) may not optimal.
Comments
Post a Comment