python - PySpark: Calculate grouped-by AUC -

spark version: 1.6.0

i tried computing auc (area under roc) grouped field id. given following data:

# within each key-value pair # key "id" # value list of (score, label) data = sc.parallelize(          [('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)),           ('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0))          ]

the binaryclassificationmetrics class can calculate auc given list of (score, label).

i want compute auc key (i.e. id1, id2). how "map" class rdd key?

update

i tried wrap binaryclassificationmetrics in function:

def auc(scoreandlabels):     return binaryclassificationmetrics(scoreandlabels).areaunderroc

and map wrapper function each values:

data.groupbykey()\     .mapvalues(auc)

but list of (score, label) in fact of type resultiterable in mapvalues() while binaryclassificationmetrics expects rdd.

is there approach of converting resultiterable rdd the auc function can applied? or other workaround computing group-by auc (without importing third-party modules scikit-learn)?

instead of using binaryclassificationmetrics can use sklearn.metrics.auc , map each rdd element value , you'll auc value per key:

from sklearn.metrics import auc  data = sc.parallelize([          ('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)]),          ('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)])])  result_aucs = data.map(lambda x: (x[0] + '_auc', auc(*zip(*x[1])))) result_aucs.collect()   out [1]: [('id1_auc', 0.15000000000000002), ('id2_auc', 0.15000000000000002)]

Search This Blog

Employment

python - PySpark: Calculate grouped-by AUC -

update

Popular posts from this blog

Apache NiFi ExecuteScript: Groovy script to replace Json values via a mapping file -

node.js - How do I prevent MongoDB replica set from querying the primary? -

python 3.x - PyQt5 - Signal : pyqtSignal no method connect -