pandas - Most efficient way to aggregate tabular data meeting certain conditions in Python in O(1) time? -
let's have table bunch of data in long format (each row has 1 data point). instance, let's have table of people's sat scores, columns state, city, school, gender, race, , person. goal find way pull out , average data points corresponding groupings of data.
for instance, if want average sat scores associated males in texas, or white females in nyc. what's best way in python (i suppose it's similar tapply
in r)? boolean indexing on pandas
data structure? done in o(1)
time? dumbest possible way crawl through row row, check if each row meets conditions, , collect numbers do, can't think there's better way this; don't become unnecessarily slow datasets bigger.
you cannot in o(1) time.
this problem has more 1 complexity associated it. can think of three: prepossessing, insertion, , lookup.
you in o(nlog(n)) preproccessing + o(log(n)) lookup time, , o(log(n)) insertion follows:
collections import deque datatable = collections.deque(your_data_table) column_names = ['col1', 'col2'...etc]
keep data structure updated has list of pointers rows of main table sorted each column. i.e if have data table 6 columns, need 6 sorted arrays values point row numbers of main table. structure should take same amount of memory original data table.
overhead_structure = {} column, column_name in zip( column_names, table): row_indexes, sorted_column_values = sort(column) overhead_structure[column_name] = (sorted_column_values, row_indexes)
with new overhead of sorted meta-data -> can search rows satisfy criterion binary search on each of rows upon have conditions.
from bisect import bisect_left overhead_index = bisect_left(overhead_structure[column_name], 1, lo,hi) real_index = overhead_structure[column_name][overhead_index][1]
also new overhead of sorted meta-data, each time add row main table, have cram in pointer in new meta-data sorts. have figure out row belongs in each column sort.
for column_name in column_names: overhead_index = bisect_left(overhead_structure[column_name],1, lo,hi) real_index = overhead_structure[column_name][overhead_index] overhead_structure[column_name].insert(overhead_index, (new_item, real_index)) datatable.append(new_item)
this fastest way can think of off top of head.. suspect there more robust , better way though, because sql databases have deal under hood select statements, , years of smart people working hard have made queries optimized.
edit:
using arbitrary data types give headaches insertion, because native array has fixed lengths. using doubly linked list datatype make insertion time go down, , create more memory overhead, keep other complexities estimated them. wrote sudo-code python '''deque'''