python - looping over a dataframe and fetching related data from another dataframe :PANDAS -


i have dataframe having transaction data of customers. columns mailid,txn_date,city. have situation have consider customer's 01jan2016 , each of mailid in have fetch txn data base file , considering last 12 month data(txn date between last txn date , -365days timedelta) finding out max transacted city name.

sample base dataframe

#df maild   txn_date   city satya   2015-07-21  satya   2015-08-11  b satya   2016-05-11  c xyz     2016-06-01  f satya   2016-06-01  satya   2016-06-01  b 

as need cust 2016-01-01 did

d = df[['mailid', 'txn-date']][df['txn_date'] >= '2016-01-01'] 

now each mailid in d have fetch each of last 12month transaction data base dataframe df , calculate max city transacted. using loop like

x = d.groupby(['mailid'])['txn-date'].max().reset_index() #### finding last transacted date find out 12 month date x['max_city'] = 'n'  ## giving default value 'n' idx,row in x.iterrows():  g = row[1].date()  h = g-timedelta(days=365)  ###getting last 12 month date   y = df[(df['mailid']==row[0]) & (df['txn_date'] >= str(h))]  y.sort(['txn_date'],ascending=true,inplace=true)  ### sorting bcoz want consider last txn when count 1 or more cities become same   c = y.groupby(['mailid','city']).size().reset_index()  v = c.groupby(['mailid'])[0].max().reset_index()  dca = pd.merge(y,c,on=['mailid','city'],how='left')  dcb = pd.merge(dca,v,on=['mailid',0])  m = dcb.drop_duplicates(['mailid'],take_last=true)  row[2] = m['city'].unique()[0] 

o/p:

maild  max_city satya   b   ### in last 12 month 2016-06-01 2015-06-01  txn in a=2 b= 2 last b consider b max city xyz     f 

though code works(i sure un-organised , no proper naming convention used practicing) small chunk of data , loop hit main base dataframe df each customer present in dataframe x.

so main concern if df of 100mln rows , x of 6mln rows . loop executed 6mln times , hit df fetch matched mailid data , operation find max transacted city.

if in 1 min calculate 3 mailid's max city. 6mln take 2mln minutes... serious problem...

so need suggestion guys on how optimize scenario..thereby hitting main base fewer times , applying more convenient pandas way d that(which not able yet)...

please, suggest!!!!thanks in adv.

you can use groupby , apply functionality more efficiently.

group both city , maild , maximum date , total number of transactions. sort max date.

g=d.groupby(['maild','city'])['txn_date'].agg(['count','max']).sort_values('max',ascending=false) 

then group maild , index of highest count

g.groupby(level='maild')['count'].agg(lambda x:pd.series.argmax(x)[1]) 

-

maild satya    xyz      f 

btw, in example have transactions satya both , b on 2016-01-01. how did decide b right answer?


Popular posts from this blog

php - How should I create my API for mobile applications (Needs Authentication) -

5 Reasons to Blog Anonymously (and 5 Reasons Not To)

Google AdWords and AdSense - A Dynamic Small Business Marketing Duo