cluster analysis - K-modes clustering in R for categorical data with NAs -


> dat <- data.frame(x=sample(letters[1:3],20,true),y=sample(letters[7:9],20,true),stringsasfactors=false) > dat[c(1:5,9,17,20),1] <- na;dat[c(8,11),2] <- na > dat       x    y 1  <na>    h 2  <na>    3  <na>    g 4  <na>    h 5  <na>    6     c    h 7     c    g 8     <na> 9  <na>    g 10    c    g 11    b <na> 12       g 13       g 14       g 15    b    16       g 17 <na>    h 18       19       g 20 <na>    g 

i'm trying clustering on categorical data using klar::kmodes, have trouble dealing these nas.

a workaround came treating nas new category:

> dat[c(1:5,9,17,20),1] <- "na";dat[c(8,11),2] <- "na" > (cl <- kmodes(dat,modes=dat[c(6,7),])) k-modes clustering 2 clusters of sizes 11, 9  cluster modes:   x    y   1 "na" "h" 2 "a"  "g"  clustering vector:  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20   1  1  1  1  1  1  2  2  1  2  1  2  2  2  1  2  1  2  2  1   within cluster simple-matching distance cluster: [1] 10  4  available components: [1] "cluster"    "size"       "modes"      "withindiff" "iterations" "weighted"  

this flawed since kmodes default uses simple-matching distance determine dissimilarity of 2 objects, we'll have na , na match.

another thought treat every na different, i.e. in data there 8 nas in x, can treat them 8 different categories?

> dat[c(1:5,9,17,20),1] <- paste("na",1:8,sep=""); dat[c(8,11),2] <- paste("na",1:2,sep="") > (cl <- kmodes(dat,modes=dat[c(6,7),])) k-modes clustering 2 clusters of sizes 10, 10  cluster modes:   x   y   1 "c" "h" 2 "a" "g"  clustering vector:  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20   1  1  2  1  1  1  1  2  2  1  1  2  2  2  1  2  1  2  2  2   within cluster simple-matching distance cluster: [1] 13  5  available components: [1] "cluster"    "size"       "modes"      "withindiff" "iterations" "weighted"  

any comments or new solutions appreciated.


Popular posts from this blog

php - How should I create my API for mobile applications (Needs Authentication) -

5 Reasons to Blog Anonymously (and 5 Reasons Not To)

Google AdWords and AdSense - A Dynamic Small Business Marketing Duo