python - Precision and Recall on PySpark DecisionTree model diverges from manual results -
i trained decisiontree
model on pyspark dataframe. resulting dataframe simulated below:
rdd = sc.parallelize( [ (0., 1.), (0., 0.), (0., 0.), (1., 1.), (1.,0.), (1.,0.), (1.,1.), (1.,1.) ] ) df = sqlcontext.createdataframe(rdd, ["prediction", "target_index"]) df.show() +----------+------------+ |prediction|target_index| +----------+------------+ | 0.0| 1.0| | 0.0| 0.0| | 0.0| 0.0| | 1.0| 1.0| | 1.0| 0.0| | 1.0| 0.0| | 1.0| 1.0| | 1.0| 1.0| +----------+------------+
so let's calculate metric, recall:
metricsp = multiclassmetrics(df.rdd) print metricsp.recall() 0.625
ok. let's try confirm correct:
tp = df[(df.target_index == 1) & (df.prediction == 1)].count() tn = df[(df.target_index == 0) & (df.prediction == 0)].count() fp = df[(df.target_index == 0) & (df.prediction == 1)].count() fn = df[(df.target_index == 1) & (df.prediction == 0)].count() print "true positives:", tp print "true negatives:", tn print "false positives:", fp print "false negatives:", fn print "total", df.count() true positives: 3 true negatives: 2 false positives: 2 false negatives: 1 total 8
and calculate recall:
r = float(tp)/(tp + fn) print "recall", r recall 0.75
and results differ. i'm doing wrong?
btw, functions metrics
class giving same results:
print metricsp.recall() print metricsp.precision() print metricsp.fmeasure() 0.625 0.625 0.625
the problem using multiclassmetrics processing output of binary classifier. docs:
recall() returns recall (equals precision multiclass classifier because sum of false positives equal sum of false negatives)
to correct results, use recall(label=1):
>>> print metricsp.recall(label=1) 0.75
btw, headers in df.show()
seem jumbled up, should be:
+----------+------------+ |prediction|target_index| +----------+------------+ | 0.0| 1.0| | 0.0| 0.0| | 0.0| 0.0| | 1.0| 1.0| | 1.0| 0.0| | 1.0| 0.0| | 1.0| 1.0| | 1.0| 1.0| +----------+------------+