dataframe - Performing simple lookup using 2 data frames in R -

in r, have 2 data frames & b follows-

data-frame a:

name      age    city       gender   income    company   ... jxx       21     chicago    m        20k       xyz       ... cxx       25     newyork    m        30k       pqr       ... cxx       26     chicago    m        na        zzz       ...

data-frame b:

age    city       gender    avg income  avg height  avg weight   ... 21     chicago    m         30k         ...         ...          ... 25     newyork    m         40k         ...         ...          ... 26     chicago    m         50k         ...         ...          ...

i want fill missing values in data frame data frame b.

for example, third row in data frame can substitute avg income data frame b instead of exact income. don't want merge these 2 data frames, instead want perform look-up operation using age, city , gender columns.

library(data.table);  ## generate data set.seed(5l); nk <- 6l; pa <- 0.8; pb <- 0.2; keydf <- unique(data.frame(age=sample(18:65,nk,t),city=sample(c('chicago','newyork'),nk,t),gender=sample(c('m','f'),nk,t),stringsasfactors=f)); no <- nrow(keydf)-1l; af <- cbind(keydf[-1l,],name=sample(paste0(letters,letters,letters),no,t),income=sample(c(na,paste0(seq(20l,90l,10l),'k')),no,t,c(pa,rep((1-pa)/8,8l))),stringsasfactors=f)[sample(seq_len(no)),]; bf <- cbind(keydf[-2l,],`avg income`=sample(c(na,paste0(seq(20l,90l,10l),'k')),no,t,c(pb,rep((1-pb)/8,8l))),stringsasfactors=f)[sample(seq_len(no)),]; @ <- as.data.table(af); bt <- as.data.table(bf); at; ##    age    city gender name income ## 1:  50 newyork      f  ooo     na ## 2:  23 chicago      m  sss     na ## 3:  62 newyork      m  vvv     na ## 4:  51 chicago      f  fff    90k ## 5:  31 chicago      m  xxx     na bt; ##    age    city gender avg income ## 1:  62 newyork      m         na ## 2:  51 chicago      f        60k ## 3:  31 chicago      m        50k ## 4:  27 newyork      m         na ## 5:  23 chicago      m        60k

i generated random test data demonstration purposes. i'm quite happy result got seed 5, covers many cases:

one row in doesn't join b (50/newyork/f).
one row in b doesn't join (27/newyork/m).
two rows join , should result in replacement of na in non-na value b (23/chicago/m , 31/chicago/m).
one row joins has na in b, shouldn't affect na in (62/newyork/m).
one row join, has non-na in a, shouldn't take value b (i assumed want behavior) (51/chicago/f). value in (90k) differs value in b (60k), can verify behavior.

and intentionally scrambled rows of , b ensure join them correctly, regardless of incoming row order.

## data.table solution keys <- c('age','city','gender'); at[is.na(income),income:=bt[.sd,on=keys,`avg income`]]; ##    age    city gender name income ## 1:  50 newyork      f  ooo     na ## 2:  23 chicago      m  sss    60k ## 3:  62 newyork      m  vvv     na ## 4:  51 chicago      f  fff    90k ## 5:  31 chicago      m  xxx    50k

in above filter na values in first, join in j argument on key columns , assign in-place source column target column using data.table := syntax.

note in data.table world x[y] right join, if want left join need reverse y[x] (with "left" referring x, counter-intuitively). that's why used bt[.sd] instead of (the more natural expectation of) .sd[bt]. need left join on .sd because result of join index expression assigned in-place target column, , rhs of assignment must full vector correspondent target column.

you can repeat in-place assignment line each column want replace.

## base r solution keys <- c('age','city','gender'); m <- merge(cbind(af[keys],ai=seq_len(nrow(af))),cbind(bf[keys],bi=seq_len(nrow(bf))))[c('ai','bi')]; m; ##   ai bi ## 1  2  5 ## 2  5  3 ## 3  4  2 ## 4  3  1 mi <- which(is.na(af$income[m$ai])); af$income[m$ai[mi]] <- bf$`avg income`[m$bi[mi]]; af; ##   age    city gender name income ## 2  50 newyork      f  ooo   <na> ## 5  23 chicago      m  sss    60k ## 3  62 newyork      m  vvv   <na> ## 6  51 chicago      f  fff    90k ## 4  31 chicago      m  xxx    50k

i guess feeling little bit creative here, base r solution did that's little unusual, , i've never done before. column-bound synthesized row index column key-column subset of each of , b data.frames, called merge() join them (note inner join, since don't need kind of outer join here), , extracted row index columns resulted join. precomputes joined pairs of rows subsequent modification operations.

for modification, precompute subset of join pairs row in satisfies replacement condition, e.g. income value na income replacement. can subset join pair table rows, , direct assignment b carry out replacement.

as before, can repeat assignment line every column want replace.

Search This Blog

Employment