python - looping over a dataframe and fetching related data from another dataframe :PANDAS -
i have dataframe having transaction data of customers. columns mailid,txn_date,city. have situation have consider customer's 01jan2016 , each of mailid in have fetch txn data base file , considering last 12 month data(txn date between last txn date , -365days timedelta) finding out max transacted city name.
sample base dataframe
#df maild txn_date city satya 2015-07-21 satya 2015-08-11 b satya 2016-05-11 c xyz 2016-06-01 f satya 2016-06-01 satya 2016-06-01 b
as need cust 2016-01-01 did
d = df[['mailid', 'txn-date']][df['txn_date'] >= '2016-01-01']
now each mailid in d have fetch each of last 12month transaction data base dataframe df , calculate max city transacted. using loop like
x = d.groupby(['mailid'])['txn-date'].max().reset_index() #### finding last transacted date find out 12 month date x['max_city'] = 'n' ## giving default value 'n' idx,row in x.iterrows(): g = row[1].date() h = g-timedelta(days=365) ###getting last 12 month date y = df[(df['mailid']==row[0]) & (df['txn_date'] >= str(h))] y.sort(['txn_date'],ascending=true,inplace=true) ### sorting bcoz want consider last txn when count 1 or more cities become same c = y.groupby(['mailid','city']).size().reset_index() v = c.groupby(['mailid'])[0].max().reset_index() dca = pd.merge(y,c,on=['mailid','city'],how='left') dcb = pd.merge(dca,v,on=['mailid',0]) m = dcb.drop_duplicates(['mailid'],take_last=true) row[2] = m['city'].unique()[0]
o/p:
maild max_city satya b ### in last 12 month 2016-06-01 2015-06-01 txn in a=2 b= 2 last b consider b max city xyz f
though code works(i sure un-organised , no proper naming convention used practicing) small chunk of data , loop hit main base dataframe df each customer present in dataframe x.
so main concern if df of 100mln rows , x of 6mln rows . loop executed 6mln times , hit df fetch matched mailid data , operation find max transacted city.
if in 1 min calculate 3 mailid's max city. 6mln take 2mln minutes... serious problem...
so need suggestion guys on how optimize scenario..thereby hitting main base fewer times , applying more convenient pandas way d that(which not able yet)...
please, suggest!!!!thanks in adv.
you can use groupby , apply functionality more efficiently.
group both city , maild , maximum date , total number of transactions. sort max date.
g=d.groupby(['maild','city'])['txn_date'].agg(['count','max']).sort_values('max',ascending=false)
then group maild , index of highest count
g.groupby(level='maild')['count'].agg(lambda x:pd.series.argmax(x)[1])
-
maild satya xyz f
btw, in example have transactions satya both , b on 2016-01-01. how did decide b right answer?