cluster analysis - K-modes clustering in R for categorical data with NAs -
> dat <- data.frame(x=sample(letters[1:3],20,true),y=sample(letters[7:9],20,true),stringsasfactors=false) > dat[c(1:5,9,17,20),1] <- na;dat[c(8,11),2] <- na > dat x y 1 <na> h 2 <na> 3 <na> g 4 <na> h 5 <na> 6 c h 7 c g 8 <na> 9 <na> g 10 c g 11 b <na> 12 g 13 g 14 g 15 b 16 g 17 <na> h 18 19 g 20 <na> g
i'm trying clustering on categorical data using klar::kmodes
, have trouble dealing these nas.
a workaround came treating nas new category:
> dat[c(1:5,9,17,20),1] <- "na";dat[c(8,11),2] <- "na" > (cl <- kmodes(dat,modes=dat[c(6,7),])) k-modes clustering 2 clusters of sizes 11, 9 cluster modes: x y 1 "na" "h" 2 "a" "g" clustering vector: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 1 1 1 1 1 2 2 1 2 1 2 2 2 1 2 1 2 2 1 within cluster simple-matching distance cluster: [1] 10 4 available components: [1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"
this flawed since kmodes
default uses simple-matching distance determine dissimilarity of 2 objects, we'll have na , na match.
another thought treat every na different, i.e. in data there 8 nas in x
, can treat them 8 different categories?
> dat[c(1:5,9,17,20),1] <- paste("na",1:8,sep=""); dat[c(8,11),2] <- paste("na",1:2,sep="") > (cl <- kmodes(dat,modes=dat[c(6,7),])) k-modes clustering 2 clusters of sizes 10, 10 cluster modes: x y 1 "c" "h" 2 "a" "g" clustering vector: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 1 2 1 1 1 1 2 2 1 1 2 2 2 1 2 1 2 2 2 within cluster simple-matching distance cluster: [1] 13 5 available components: [1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"
any comments or new solutions appreciated.