python - looping over a dataframe and fetching related data from another dataframe :PANDAS -
i have dataframe having transaction data of customers. columns mailid,txn_date,city. have situation have consider customer's 01jan2016 , each of mailid in have fetch txn data base file , considering last 12 month data(txn date between last txn date , -365days timedelta) finding out max transacted city name.
sample base dataframe
#df maild   txn_date   city satya   2015-07-21  satya   2015-08-11  b satya   2016-05-11  c xyz     2016-06-01  f satya   2016-06-01  satya   2016-06-01  b as need cust 2016-01-01 did
d = df[['mailid', 'txn-date']][df['txn_date'] >= '2016-01-01'] now each mailid in d have fetch each of last 12month transaction data base dataframe df , calculate max city transacted. using loop like
x = d.groupby(['mailid'])['txn-date'].max().reset_index() #### finding last transacted date find out 12 month date x['max_city'] = 'n'  ## giving default value 'n' idx,row in x.iterrows():  g = row[1].date()  h = g-timedelta(days=365)  ###getting last 12 month date   y = df[(df['mailid']==row[0]) & (df['txn_date'] >= str(h))]  y.sort(['txn_date'],ascending=true,inplace=true)  ### sorting bcoz want consider last txn when count 1 or more cities become same   c = y.groupby(['mailid','city']).size().reset_index()  v = c.groupby(['mailid'])[0].max().reset_index()  dca = pd.merge(y,c,on=['mailid','city'],how='left')  dcb = pd.merge(dca,v,on=['mailid',0])  m = dcb.drop_duplicates(['mailid'],take_last=true)  row[2] = m['city'].unique()[0] o/p:
maild  max_city satya   b   ### in last 12 month 2016-06-01 2015-06-01  txn in a=2 b= 2 last b consider b max city xyz     f though code works(i sure un-organised , no proper naming convention used practicing) small chunk of data , loop hit main base dataframe df each customer present in dataframe x.
so main concern if df of 100mln rows , x of 6mln rows . loop executed 6mln times , hit df fetch matched mailid data , operation find max transacted city.
if in 1 min calculate 3 mailid's max city. 6mln take 2mln minutes... serious problem...
so need suggestion guys on how optimize scenario..thereby hitting main base fewer times , applying more convenient pandas way d that(which not able yet)...
please, suggest!!!!thanks in adv.
you can use groupby , apply functionality more efficiently.
group both city , maild , maximum date , total number of transactions. sort max date.
g=d.groupby(['maild','city'])['txn_date'].agg(['count','max']).sort_values('max',ascending=false) then group maild , index of highest count
g.groupby(level='maild')['count'].agg(lambda x:pd.series.argmax(x)[1]) -
maild satya    xyz      f btw, in example have transactions satya both , b on 2016-01-01. how did decide b right answer?