list - pyspark collect_set or collect_list with groupby -
how can use collect_set
or collect_list
on dataframe after groupby
. example: df.groupby('key').collect_set('values')
. error: attributeerror: 'groupeddata' object has no attribute 'collect_set'
you need use agg. example:
from pyspark import sparkcontext pyspark.sql import hivecontext pyspark.sql import functions f sc = sparkcontext("local") sqlcontext = hivecontext(sc) df = sqlcontext.createdataframe([ ("a", none, none), ("a", "code1", none), ("a", "code2", "name2"), ], ["id", "code", "name"]) df.show() +---+-----+-----+ | id| code| name| +---+-----+-----+ | a| null| null| | a|code1| null| | a|code2|name2| +---+-----+-----+
note in above have create hivecontext. see https://stackoverflow.com/a/35529093/690430 dealing different spark versions.
(df .groupby("id") .agg(f.collect_set("code"), f.collect_list("name")) .show()) +---+-----------------+-----------------+ | id|collect_set(code)|collect_list(name)| +---+-----------------+-----------------+ | a| [code1, code2]| [name2]| +---+-----------------+-----------------+