python - Precision and Recall on PySpark DecisionTree model diverges from manual results -

i trained decisiontree model on pyspark dataframe. resulting dataframe simulated below:

rdd = sc.parallelize(     [         (0., 1.),          (0., 0.),          (0., 0.),          (1., 1.),          (1.,0.),          (1.,0.),         (1.,1.),         (1.,1.)     ] ) df = sqlcontext.createdataframe(rdd, ["prediction", "target_index"]) df.show() +----------+------------+ |prediction|target_index| +----------+------------+ |       0.0|         1.0| |       0.0|         0.0| |       0.0|         0.0| |       1.0|         1.0| |       1.0|         0.0| |       1.0|         0.0| |       1.0|         1.0| |       1.0|         1.0| +----------+------------+

so let's calculate metric, recall:

metricsp = multiclassmetrics(df.rdd) print metricsp.recall() 0.625

ok. let's try confirm correct:

tp = df[(df.target_index == 1) & (df.prediction == 1)].count() tn = df[(df.target_index == 0) & (df.prediction == 0)].count() fp = df[(df.target_index == 0) & (df.prediction == 1)].count() fn = df[(df.target_index == 1) & (df.prediction == 0)].count() print "true positives:", tp print "true negatives:", tn print "false positives:", fp print "false negatives:", fn print "total", df.count() true positives: 3 true negatives: 2 false positives: 2 false negatives: 1 total 8

and calculate recall:

r = float(tp)/(tp + fn) print "recall", r  recall 0.75

and results differ. i'm doing wrong?

btw, functions metrics class giving same results:

print metricsp.recall() print metricsp.precision() print metricsp.fmeasure() 0.625 0.625 0.625

the problem using multiclassmetrics processing output of binary classifier. docs:

recall() returns recall (equals precision multiclass classifier because sum of false positives equal sum of false negatives)

to correct results, use recall(label=1):

>>> print metricsp.recall(label=1) 0.75

btw, headers in df.show() seem jumbled up, should be:

+----------+------------+ |prediction|target_index| +----------+------------+ |       0.0|         1.0| |       0.0|         0.0| |       0.0|         0.0| |       1.0|         1.0| |       1.0|         0.0| |       1.0|         0.0| |       1.0|         1.0| |       1.0|         1.0| +----------+------------+

Search This Blog

Employment

python - Precision and Recall on PySpark DecisionTree model diverges from manual results -

Popular posts from this blog

Apache NiFi ExecuteScript: Groovy script to replace Json values via a mapping file -

audio - What is the sound ID for the "Glass" sound in iOS? -

python 3.x - PyQt5 - Signal : pyqtSignal no method connect -