list - pyspark collect_set or collect_list with groupby -


how can use collect_set or collect_list on dataframe after groupby. example: df.groupby('key').collect_set('values'). error: attributeerror: 'groupeddata' object has no attribute 'collect_set'

you need use agg. example:

from pyspark import sparkcontext pyspark.sql import hivecontext pyspark.sql import functions f  sc = sparkcontext("local")  sqlcontext = hivecontext(sc)  df = sqlcontext.createdataframe([     ("a", none, none),     ("a", "code1", none),     ("a", "code2", "name2"), ], ["id", "code", "name"])  df.show()  +---+-----+-----+ | id| code| name| +---+-----+-----+ |  a| null| null| |  a|code1| null| |  a|code2|name2| +---+-----+-----+ 

note in above have create hivecontext. see https://stackoverflow.com/a/35529093/690430 dealing different spark versions.

(df   .groupby("id")   .agg(f.collect_set("code"),        f.collect_list("name"))   .show())  +---+-----------------+-----------------+ | id|collect_set(code)|collect_list(name)| +---+-----------------+-----------------+ |  a|   [code1, code2]|          [name2]| +---+-----------------+-----------------+ 

Popular posts from this blog

php - How should I create my API for mobile applications (Needs Authentication) -

python 3.x - PyQt5 - Signal : pyqtSignal no method connect -