dataframe - Performing simple lookup using 2 data frames in R -
in r, have 2 data frames & b follows-
data-frame a:
name age city gender income company ... jxx 21 chicago m 20k xyz ... cxx 25 newyork m 30k pqr ... cxx 26 chicago m na zzz ...
data-frame b:
age city gender avg income avg height avg weight ... 21 chicago m 30k ... ... ... 25 newyork m 40k ... ... ... 26 chicago m 50k ... ... ...
i want fill missing values in data frame data frame b.
for example, third row in data frame can substitute avg income data frame b instead of exact income. don't want merge these 2 data frames, instead want perform look-up operation using age, city , gender columns.
library(data.table); ## generate data set.seed(5l); nk <- 6l; pa <- 0.8; pb <- 0.2; keydf <- unique(data.frame(age=sample(18:65,nk,t),city=sample(c('chicago','newyork'),nk,t),gender=sample(c('m','f'),nk,t),stringsasfactors=f)); no <- nrow(keydf)-1l; af <- cbind(keydf[-1l,],name=sample(paste0(letters,letters,letters),no,t),income=sample(c(na,paste0(seq(20l,90l,10l),'k')),no,t,c(pa,rep((1-pa)/8,8l))),stringsasfactors=f)[sample(seq_len(no)),]; bf <- cbind(keydf[-2l,],`avg income`=sample(c(na,paste0(seq(20l,90l,10l),'k')),no,t,c(pb,rep((1-pb)/8,8l))),stringsasfactors=f)[sample(seq_len(no)),]; @ <- as.data.table(af); bt <- as.data.table(bf); at; ## age city gender name income ## 1: 50 newyork f ooo na ## 2: 23 chicago m sss na ## 3: 62 newyork m vvv na ## 4: 51 chicago f fff 90k ## 5: 31 chicago m xxx na bt; ## age city gender avg income ## 1: 62 newyork m na ## 2: 51 chicago f 60k ## 3: 31 chicago m 50k ## 4: 27 newyork m na ## 5: 23 chicago m 60k
i generated random test data demonstration purposes. i'm quite happy result got seed 5, covers many cases:
- one row in doesn't join b (50/newyork/f).
- one row in b doesn't join (27/newyork/m).
- two rows join , should result in replacement of na in non-na value b (23/chicago/m , 31/chicago/m).
- one row joins has na in b, shouldn't affect na in (62/newyork/m).
- one row join, has non-na in a, shouldn't take value b (i assumed want behavior) (51/chicago/f). value in (90k) differs value in b (60k), can verify behavior.
and intentionally scrambled rows of , b ensure join them correctly, regardless of incoming row order.
## data.table solution keys <- c('age','city','gender'); at[is.na(income),income:=bt[.sd,on=keys,`avg income`]]; ## age city gender name income ## 1: 50 newyork f ooo na ## 2: 23 chicago m sss 60k ## 3: 62 newyork m vvv na ## 4: 51 chicago f fff 90k ## 5: 31 chicago m xxx 50k
in above filter na values in first, join in j
argument on key columns , assign in-place source column target column using data.table :=
syntax.
note in data.table world x[y]
right join, if want left join need reverse y[x]
(with "left" referring x
, counter-intuitively). that's why used bt[.sd]
instead of (the more natural expectation of) .sd[bt]
. need left join on .sd
because result of join index expression assigned in-place target column, , rhs of assignment must full vector correspondent target column.
you can repeat in-place assignment line each column want replace.
## base r solution keys <- c('age','city','gender'); m <- merge(cbind(af[keys],ai=seq_len(nrow(af))),cbind(bf[keys],bi=seq_len(nrow(bf))))[c('ai','bi')]; m; ## ai bi ## 1 2 5 ## 2 5 3 ## 3 4 2 ## 4 3 1 mi <- which(is.na(af$income[m$ai])); af$income[m$ai[mi]] <- bf$`avg income`[m$bi[mi]]; af; ## age city gender name income ## 2 50 newyork f ooo <na> ## 5 23 chicago m sss 60k ## 3 62 newyork m vvv <na> ## 6 51 chicago f fff 90k ## 4 31 chicago m xxx 50k
i guess feeling little bit creative here, base r solution did that's little unusual, , i've never done before. column-bound synthesized row index column key-column subset of each of , b data.frames, called merge()
join them (note inner join, since don't need kind of outer join here), , extracted row index columns resulted join. precomputes joined pairs of rows subsequent modification operations.
for modification, precompute subset of join pairs row in satisfies replacement condition, e.g. income
value na income
replacement. can subset join pair table rows, , direct assignment b carry out replacement.
as before, can repeat assignment line every column want replace.