scala - Using ReduceByKey to group list of values -


i want group list of values per key , doing this:

sc.parallelize(array(("red", "zero"), ("yellow", "one"), ("red", "two"))).groupbykey().collect.foreach(println)  (red,compactbuffer(zero, two)) (yellow,compactbuffer(one)) 

but noticed blog post databricks , it's recommending not use groupbykey large dataset.

avoid groupbykey

is there way achieve same result using reducebykey?

i tried it's concatenating values. way, case, both key , value string type.

sc.parallelize(array(("red", "zero"), ("yellow", "one"), ("red", "two"))).reducebykey(_ ++ _).collect.foreach(println)  (red,zerotwo) (yellow,one) 

use aggregatebykey:

 sc.parallelize(array(("red", "zero"), ("yellow", "one"), ("red", "two"))) .aggregatebykey(listbuffer.empty[string])(         (numlist, num) => {numlist += num; numlist},          (numlist1, numlist2) => {numlist1.appendall(numlist2); numlist1}) .mapvalues(_.tolist) .collect()  scala> array[(string, list[string])] = array((yellow,list(one)), (red,list(zero, two))) 

see this answer details on aggregatebykey, this link rationale behind using mutable dataset listbuffer.

edit:

is there way achieve same result using reducebykey?

the above worse in performance, please see comments @zero323 details.


Popular posts from this blog

php - How should I create my API for mobile applications (Needs Authentication) -

python 3.x - PyQt5 - Signal : pyqtSignal no method connect -

5 Reasons to Blog Anonymously (and 5 Reasons Not To)