python - Pandas dataframe grouping by multiple columns and dropping duplicate rows -
i trying task (in bioinformatics, tcga data) using dataframe of following form:
df = pd.dataframe({'id':['tcga-ab-0001','tcga-ab-0001','tcga-ab-0001','tcga-ab-0001','tcga-ab-0002','tcga-ab-0002','tcga-ab-0002','tcga-ab-0002','tcga-ab-0003','tcga-ab-0002'], 'reference':['hg19','hg18','hg19','grch37','hg18','hg19','grch37','hg19','grch37','grch37'], 'sampletype':['tumor','tumor','normal','normal','tumor','normal','normal','tumor','tumor','tumor'] })
which looks like:
id reference sampletype 0 tcga-ab-0001 hg19 tumor 1 tcga-ab-0001 hg18 tumor 2 tcga-ab-0001 hg19 normal 3 tcga-ab-0001 grch37 normal 4 tcga-ab-0002 hg18 tumor 5 tcga-ab-0002 hg19 normal 6 tcga-ab-0002 grch37 normal 7 tcga-ab-0002 hg19 tumor 8 tcga-ab-0003 grch37 tumor 9 tcga-ab-0002 grch37 tumor
i trying match pairs of rows if have same 'reference' , different 'sampletype'. result new dataframe of following form:
tumor normal index id reference sampletype index id reference sampletype 0 tcga-ab-0001 hg19 tumor 2 tcga-ab-0001 hg19 normal 7 tcga-ab-0002 hg19 tumor 5 tcga-ab-0002 hg19 tumor 9 tcga-ab-0002 grch37 tumor 6 tcga-ab-0002 grch37 normal
now drop duplicate ids priority according list [grch37, hg19, hg18]. if example both hg19 , hg18 exist same id, keep hg19. result should following:
tumor normal index id reference sampletype index id reference sampletype 0 tcga-ab-0001 hg19 tumor 2 tcga-ab-0001 hg19 normal 9 tcga-ab-0002 grch37 tumor 6 tcga-ab-0002 grch37 normal
is there way groupby or other pandas function?
thanks!
it still not 100% clear me desired output is. seems trick based on understanding.
import numpy np import pandas pd df = pd.dataframe({'id':['tcga-ab-0001','tcga-ab-0001','tcga-ab-0001','tcga-ab-0001','tcga-ab-0002','tcga-ab-0002','tcga-ab-0002','tcga-ab-0002','tcga-ab-0003','tcga-ab-0002', 'tcga-ab-0001', 'tcga-ab-0001'], 'reference':['hg19','hg18','hg19','grch37','hg18','hg19','grch37','hg19','grch37','grch37', 'grch37', 'grch37'], 'sampletype':['tumor','tumor','normal','normal','tumor','normal','normal','tumor','tumor','tumor', 'normal', 'tumor'] })
this longer original example , tests having redundant candidate rows.
id reference sampletype 0 tcga-ab-0001 hg19 tumor 1 tcga-ab-0001 hg18 tumor 2 tcga-ab-0001 hg19 normal 3 tcga-ab-0001 grch37 normal 4 tcga-ab-0002 hg18 tumor 5 tcga-ab-0002 hg19 normal 6 tcga-ab-0002 grch37 normal 7 tcga-ab-0002 hg19 tumor 8 tcga-ab-0003 grch37 tumor 9 tcga-ab-0002 grch37 tumor 10 tcga-ab-0001 grch37 normal 11 tcga-ab-0001 grch37 tumor
now create temp df may have "redundant" rows.
## # create df sort , first level filtering ## df_2 = df.groupby(['id','reference']).filter(lambda x:set(x.sampletype)=={'tumor','normal'}).drop_duplicates(['id', 'reference', 'sampletype']).sort(['id','reference', 'sampletype']) # dropping dups , sorting, sampletype column must alternate: normal, tumor, normal... # break 2 pieces horizontal concat left = df_2.iloc[np.arange(0,df_2.shape[0], 2)] right = df_2.iloc[np.arange(1, df_2.shape[0], 2)] # reindex id pd.concat can match rows left['old_index'] = left.index.values left.index = left['id'] right['old_index'] = right.index.values right.index = right['id'] right.columns = [c + '_2' c in right.columns] # rename right side columns can groupby(['id']) # horizontal concat temp = pd.concat([left, right], axis=1) # possible duplicates each unique (id, reference) tuple temp.index = np.arange(temp.shape[0]) temp id reference sampletype old_index id_2 reference_2 \ 0 tcga-ab-0001 grch37 normal 3 tcga-ab-0001 grch37 1 tcga-ab-0001 hg19 normal 2 tcga-ab-0001 hg19 2 tcga-ab-0002 grch37 normal 6 tcga-ab-0002 grch37 3 tcga-ab-0002 hg19 normal 5 tcga-ab-0002 hg19 sampletype_2 old_index_2 0 tumor 11 1 tumor 0 2 tumor 9 3 tumor 7
if understand correctly, want keep 1 row each id, choosing them in order of priority = ['grch37', 'hg19', 'hg18']
## # second level of filtering using priority list ## priority = ['grch37', 'hg19', 'hg18'] g = temp.groupby(['id']) def filter_2(grp, priority = ['grch37', 'hg19', 'hg18']): pos = np.argsort(grp['reference'], priority).iloc[0] idx = grp.index[pos] return grp.loc[idx, :] final = temp.groupby(['id']).apply(filter_2) final.index = np.arange(final.shape[0])
which yields understanding of final desired output. note: different original example since expanded in input df.
final id reference sampletype old_index id_2 reference_2 \ 0 tcga-ab-0001 grch37 normal 3 tcga-ab-0001 grch37 1 tcga-ab-0002 grch37 normal 6 tcga-ab-0002 grch37 sampletype_2 old_index_2 0 tumor 11 1 tumor 9