scala - Using ReduceByKey to group list of values -
i want group list of values per key , doing this:
sc.parallelize(array(("red", "zero"), ("yellow", "one"), ("red", "two"))).groupbykey().collect.foreach(println) (red,compactbuffer(zero, two)) (yellow,compactbuffer(one))
but noticed blog post databricks , it's recommending not use groupbykey large dataset.
is there way achieve same result using reducebykey?
i tried it's concatenating values. way, case, both key , value string type.
sc.parallelize(array(("red", "zero"), ("yellow", "one"), ("red", "two"))).reducebykey(_ ++ _).collect.foreach(println) (red,zerotwo) (yellow,one)
use aggregatebykey
:
sc.parallelize(array(("red", "zero"), ("yellow", "one"), ("red", "two"))) .aggregatebykey(listbuffer.empty[string])( (numlist, num) => {numlist += num; numlist}, (numlist1, numlist2) => {numlist1.appendall(numlist2); numlist1}) .mapvalues(_.tolist) .collect() scala> array[(string, list[string])] = array((yellow,list(one)), (red,list(zero, two)))
see this answer details on aggregatebykey
, this link rationale behind using mutable dataset listbuffer
.
edit:
is there way achieve same result using reducebykey?
the above worse in performance, please see comments @zero323 details.