python - Pandas dataframe grouping by multiple columns and dropping duplicate rows -


i trying task (in bioinformatics, tcga data) using dataframe of following form:

df = pd.dataframe({'id':['tcga-ab-0001','tcga-ab-0001','tcga-ab-0001','tcga-ab-0001','tcga-ab-0002','tcga-ab-0002','tcga-ab-0002','tcga-ab-0002','tcga-ab-0003','tcga-ab-0002'],               'reference':['hg19','hg18','hg19','grch37','hg18','hg19','grch37','hg19','grch37','grch37'],              'sampletype':['tumor','tumor','normal','normal','tumor','normal','normal','tumor','tumor','tumor']                }) 

which looks like:

             id reference sampletype 0  tcga-ab-0001      hg19      tumor 1  tcga-ab-0001      hg18      tumor 2  tcga-ab-0001      hg19     normal 3  tcga-ab-0001    grch37     normal 4  tcga-ab-0002      hg18      tumor 5  tcga-ab-0002      hg19     normal 6  tcga-ab-0002    grch37     normal 7  tcga-ab-0002      hg19      tumor 8  tcga-ab-0003    grch37      tumor 9  tcga-ab-0002    grch37      tumor 

i trying match pairs of rows if have same 'reference' , different 'sampletype'. result new dataframe of following form:

             tumor                                     normal index        id reference sampletype      index        id reference sampletype 0  tcga-ab-0001      hg19      tumor      2  tcga-ab-0001      hg19     normal 7  tcga-ab-0002      hg19      tumor      5  tcga-ab-0002      hg19      tumor 9  tcga-ab-0002    grch37      tumor      6  tcga-ab-0002    grch37     normal 

now drop duplicate ids priority according list [grch37, hg19, hg18]. if example both hg19 , hg18 exist same id, keep hg19. result should following:

             tumor                                     normal index        id reference sampletype      index        id reference sampletype 0  tcga-ab-0001      hg19      tumor      2  tcga-ab-0001      hg19     normal 9  tcga-ab-0002    grch37      tumor      6  tcga-ab-0002    grch37     normal 

is there way groupby or other pandas function?

thanks!

it still not 100% clear me desired output is. seems trick based on understanding.

import numpy np import pandas pd   df = pd.dataframe({'id':['tcga-ab-0001','tcga-ab-0001','tcga-ab-0001','tcga-ab-0001','tcga-ab-0002','tcga-ab-0002','tcga-ab-0002','tcga-ab-0002','tcga-ab-0003','tcga-ab-0002', 'tcga-ab-0001', 'tcga-ab-0001'],               'reference':['hg19','hg18','hg19','grch37','hg18','hg19','grch37','hg19','grch37','grch37', 'grch37', 'grch37'],              'sampletype':['tumor','tumor','normal','normal','tumor','normal','normal','tumor','tumor','tumor', 'normal', 'tumor']                }) 

this longer original example , tests having redundant candidate rows.

              id reference sampletype 0   tcga-ab-0001      hg19      tumor 1   tcga-ab-0001      hg18      tumor 2   tcga-ab-0001      hg19     normal 3   tcga-ab-0001    grch37     normal 4   tcga-ab-0002      hg18      tumor 5   tcga-ab-0002      hg19     normal 6   tcga-ab-0002    grch37     normal 7   tcga-ab-0002      hg19      tumor 8   tcga-ab-0003    grch37      tumor 9   tcga-ab-0002    grch37      tumor 10  tcga-ab-0001    grch37     normal 11  tcga-ab-0001    grch37      tumor 

now create temp df may have "redundant" rows.

## # create df sort , first level filtering ## df_2 = df.groupby(['id','reference']).filter(lambda x:set(x.sampletype)=={'tumor','normal'}).drop_duplicates(['id', 'reference', 'sampletype']).sort(['id','reference', 'sampletype']) # dropping dups , sorting, sampletype column must alternate: normal, tumor, normal...  # break 2 pieces horizontal concat left = df_2.iloc[np.arange(0,df_2.shape[0], 2)] right = df_2.iloc[np.arange(1, df_2.shape[0], 2)]  # reindex id pd.concat can match rows left['old_index'] = left.index.values left.index = left['id'] right['old_index'] = right.index.values right.index = right['id'] right.columns = [c + '_2' c in right.columns]  # rename right side columns can groupby(['id'])  # horizontal concat temp = pd.concat([left, right], axis=1)  # possible duplicates each unique (id, reference) tuple temp.index = np.arange(temp.shape[0])   temp               id reference sampletype  old_index          id_2 reference_2  \ 0  tcga-ab-0001    grch37     normal          3  tcga-ab-0001      grch37 1  tcga-ab-0001      hg19     normal          2  tcga-ab-0001        hg19 2  tcga-ab-0002    grch37     normal          6  tcga-ab-0002      grch37 3  tcga-ab-0002      hg19     normal          5  tcga-ab-0002        hg19    sampletype_2  old_index_2 0        tumor           11 1        tumor            0 2        tumor            9 3        tumor            7 

if understand correctly, want keep 1 row each id, choosing them in order of priority = ['grch37', 'hg19', 'hg18']

## # second level of filtering using priority list ## priority = ['grch37', 'hg19', 'hg18'] g = temp.groupby(['id'])  def filter_2(grp, priority = ['grch37', 'hg19', 'hg18']):     pos = np.argsort(grp['reference'], priority).iloc[0]     idx = grp.index[pos]     return grp.loc[idx, :]  final = temp.groupby(['id']).apply(filter_2) final.index = np.arange(final.shape[0]) 

which yields understanding of final desired output. note: different original example since expanded in input df.

final               id reference sampletype  old_index          id_2 reference_2  \ 0  tcga-ab-0001    grch37     normal          3  tcga-ab-0001      grch37 1  tcga-ab-0002    grch37     normal          6  tcga-ab-0002      grch37    sampletype_2  old_index_2 0        tumor           11 1        tumor            9 

Popular posts from this blog

php - How should I create my API for mobile applications (Needs Authentication) -

python 3.x - PyQt5 - Signal : pyqtSignal no method connect -

5 Reasons to Blog Anonymously (and 5 Reasons Not To)