scala - Using ReduceByKey to group list of values -

i want group list of values per key , doing this:

sc.parallelize(array(("red", "zero"), ("yellow", "one"), ("red", "two"))).groupbykey().collect.foreach(println)  (red,compactbuffer(zero, two)) (yellow,compactbuffer(one))

but noticed blog post databricks , it's recommending not use groupbykey large dataset.

avoid groupbykey

is there way achieve same result using reducebykey?

i tried it's concatenating values. way, case, both key , value string type.

sc.parallelize(array(("red", "zero"), ("yellow", "one"), ("red", "two"))).reducebykey(_ ++ _).collect.foreach(println)  (red,zerotwo) (yellow,one)

use aggregatebykey:

 sc.parallelize(array(("red", "zero"), ("yellow", "one"), ("red", "two"))) .aggregatebykey(listbuffer.empty[string])(         (numlist, num) => {numlist += num; numlist},          (numlist1, numlist2) => {numlist1.appendall(numlist2); numlist1}) .mapvalues(_.tolist) .collect()  scala> array[(string, list[string])] = array((yellow,list(one)), (red,list(zero, two)))

see this answer details on aggregatebykey, this link rationale behind using mutable dataset listbuffer.

edit:

is there way achieve same result using reducebykey?

the above worse in performance, please see comments @zero323 details.

Search This Blog

Employment

scala - Using ReduceByKey to group list of values -

Popular posts from this blog

Apache NiFi ExecuteScript: Groovy script to replace Json values via a mapping file -

node.js - How do I prevent MongoDB replica set from querying the primary? -

python 3.x - PyQt5 - Signal : pyqtSignal no method connect -