python - PySpark: Calculate grouped-by AUC -


  • spark version: 1.6.0

i tried computing auc (area under roc) grouped field id. given following data:

# within each key-value pair # key "id" # value list of (score, label) data = sc.parallelize(          [('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)),           ('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0))          ] 

the binaryclassificationmetrics class can calculate auc given list of (score, label).

i want compute auc key (i.e. id1, id2). how "map" class rdd key?

update

i tried wrap binaryclassificationmetrics in function:

def auc(scoreandlabels):     return binaryclassificationmetrics(scoreandlabels).areaunderroc 

and map wrapper function each values:

data.groupbykey()\     .mapvalues(auc) 

but list of (score, label) in fact of type resultiterable in mapvalues() while binaryclassificationmetrics expects rdd.

is there approach of converting resultiterable rdd the auc function can applied? or other workaround computing group-by auc (without importing third-party modules scikit-learn)?

instead of using binaryclassificationmetrics can use sklearn.metrics.auc , map each rdd element value , you'll auc value per key:

from sklearn.metrics import auc  data = sc.parallelize([          ('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)]),          ('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)])])  result_aucs = data.map(lambda x: (x[0] + '_auc', auc(*zip(*x[1])))) result_aucs.collect()   out [1]: [('id1_auc', 0.15000000000000002), ('id2_auc', 0.15000000000000002)] 

Popular posts from this blog

php - How should I create my API for mobile applications (Needs Authentication) -

5 Reasons to Blog Anonymously (and 5 Reasons Not To)

Google AdWords and AdSense - A Dynamic Small Business Marketing Duo