python - PySpark: Calculate grouped-by AUC -
- spark version: 1.6.0
i tried computing auc (area under roc) grouped field id
. given following data:
# within each key-value pair # key "id" # value list of (score, label) data = sc.parallelize( [('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)), ('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)) ]
the binaryclassificationmetrics class can calculate auc given list of (score, label)
.
i want compute auc key (i.e. id1, id2
). how "map" class
rdd key?
update
i tried wrap binaryclassificationmetrics
in function:
def auc(scoreandlabels): return binaryclassificationmetrics(scoreandlabels).areaunderroc
and map wrapper function each values:
data.groupbykey()\ .mapvalues(auc)
but list of (score, label)
in fact of type resultiterable
in mapvalues()
while binaryclassificationmetrics
expects rdd
.
is there approach of converting resultiterable
rdd
the auc
function can applied? or other workaround computing group-by auc (without importing third-party modules scikit-learn)?
instead of using binaryclassificationmetrics
can use sklearn.metrics.auc , map each rdd element value , you'll auc value per key:
from sklearn.metrics import auc data = sc.parallelize([ ('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)]), ('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)])]) result_aucs = data.map(lambda x: (x[0] + '_auc', auc(*zip(*x[1])))) result_aucs.collect() out [1]: [('id1_auc', 0.15000000000000002), ('id2_auc', 0.15000000000000002)]