Java spark将多行转换为具有多个集合的单行
我正在寻找如何解决以下情况的想法。我的用例是JavaSpark,但我正在寻找如何做到这一点的想法,而不考虑语言,因为我已经没有想法了 我有如下非结构化数据Java spark将多行转换为具有多个集合的单行,java,apache-spark,apache-spark-sql,Java,Apache Spark,Apache Spark Sql,我正在寻找如何解决以下情况的想法。我的用例是JavaSpark,但我正在寻找如何做到这一点的想法,而不考虑语言,因为我已经没有想法了 我有如下非结构化数据 98480|PERSON|TOM|GREER|1982|12|27 98480|PHONE|CELL|732|201|6789 98480|PHONE|HOME|732|123|9876 98480|ADDR|RES|102|JFK BLVD|PISCATAWAY|NJ|08854 98480|ADDR|OFF|211|EXCHANGE PL
98480|PERSON|TOM|GREER|1982|12|27
98480|PHONE|CELL|732|201|6789
98480|PHONE|HOME|732|123|9876
98480|ADDR|RES|102|JFK BLVD|PISCATAWAY|NJ|08854
98480|ADDR|OFF|211|EXCHANGE PL|JERSEY CITY|NJ|07302
98481|PERSON|LIN|JASSOY|1976|09|15
98481|PHONE|CELL|908|398|3389
98481|PHONE|HOME|917|363|2647
98481|ADDR|RES|111|JOURNAL SQ|JERSEY CITY|NJ|07704
98481|ADDR|OFF|365|DOWNTOWN NEWYORK|NEWYORK CITY|NY|10001
我试图将它们转换成带有persondata的行,带有一组phone和addr,如下所示,基本上每个personId只有一行
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
|personId|type |firstName|lastName|year|month|day|Phone | addr | |
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
|98481 |PERSON|LIN |JASSOY |1976|09 |15 |[[PHONE, HOME, 917, 363, 2647], [PHONE, CELL, 908, 398, 3389]] | [[ADDR, OFF, 365, DOWNTOWN NEWYORK, NEWYORK CITY, NY, 10001], [ADDR, RES, 111, JOURNAL SQ, JERSEY CITY, NJ, 07704]] |
|98480 |PERSON|TOM |GREER |1982|12 |27 |[[PHONE, HOME, 732, 123, 9876], [PHONE, CELL, 732, 201, 6789]] | [[ADDR, RES, 102, JFK BLVD, PISCATAWAY, NJ, 08854], [ADDR, OFF, 211, EXCHANGE PL, JERSEY CITY, NJ, 07302]] |
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
用下面的代码
Dataset<Row> dataset = groupedDataset
.agg(collect_set(struct(phoneRow.col("type").as("collType"), phoneRow.col("phoneType").as("phoneType"),
phoneRow.col("areaCode").as("areaCode"), phoneRow.col("phoneMiddle").as("phoneMiddle"),
phoneRow.col("ext").as("ext"), addressRow.col("type").as("collType"),
addressRow.col("addrType").as("addrType"), addressRow.col("addr1").as("rowType"),
addressRow.col("addr2").as("addr2"), addressRow.col("city").as("city"),
addressRow.col("state").as("state"), addressRow.col("zipCode").as("zipCode"))).as("addrPhone"));
正在寻找解决上述问题的方法
更新:
我能够获得预期的输出,但我不确定它有多有效,看起来像是有很多带有连接和数据帧的锅炉板代码。这是我用来理解spark的示例数据,但是我将要处理的实际数据将有很多复杂的转换,而这段代码看起来并不有效
这里是更新的代码
Dataset<Row> groupedPhoneDataSet = groupedDataset.agg(collect_set(struct(phoneRow.col("type").as("phColType"),
phoneRow.col("phoneType").as("phoneType"), phoneRow.col("areaCode").as("areaCode"),
phoneRow.col("phoneMiddle").as("phoneMiddle"), phoneRow.col("ext").as("ext"))).as("phoneRec"));
Dataset<Row> groupedAddrDataSet = groupedDataset
.agg(collect_set(struct(addressRow.col("type").as("addrColType"),
addressRow.col("addrType").as("addrType"), addressRow.col("addr1").as("addr1"),
addressRow.col("addr2").as("addr2"), addressRow.col("city").as("city"),
addressRow.col("state").as("state"), addressRow.col("zipCode").as("zipCode"))).as("addrRec"));
Dataset<Row> finalDataSet = groupedAddrDataSet
.join(groupedPhoneDataSet,
groupedAddrDataSet.col("personId").equalTo(groupedPhoneDataSet.col("personId")))
.select(groupedPhoneDataSet.col("personId"), groupedPhoneDataSet.col("type"),
groupedPhoneDataSet.col("firstName"), groupedPhoneDataSet.col("lastName"),
groupedPhoneDataSet.col("year"), groupedPhoneDataSet.col("month"),
groupedPhoneDataSet.col("day"), col("phoneRec"), col("addrRec"));
有没有一种方法可以在不创建大量数据帧的情况下实现这一点?如果您可以创建多个数据帧, 将每种类型的记录拆分为不同的数据帧,并按personId分组, 将person id上的所有三个数据框连接起来 找到下面我尝试过的代码,让我知道它是否解决了您的问题
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions.{col, collect_list, struct}
object Test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("Leads Processing Job").setMaster("local[1]")
val sparkContext = new org.apache.spark.SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
val df = sqlContext.read.option("delimiter","|").format("csv").load("data.csv")
df.printSchema()
val df_person = df.where("_c1 = 'PERSON'")
.select(col("_c0").as("personId"),col("_c1").as("type")
,col("_c2").as("firstName"),col("_c3").as("lastName")
,col("_c4").as("year"),col("_c5").as("month")
,col("_c6").as("day"))
val df_address = df.where("_c1 = 'ADDR'")
val df_phone = df.where("_c1 = 'PHONE'")
val df_addr_f = df_address
.withColumn("addr",struct(col("_c1"),col("_c2")
,col("_c3"),col("_c4"),col("_c5"),col("_c6")))
.groupBy(col("_c0").as("personId")).agg(collect_list(col("addr")).as("addr"))
val df_phone_f = df_phone.groupBy(col("_c0").as("personId")).agg(collect_list(struct(col("_c1"),col("_c2")
,col("_c3"),col("_c4"),col("_c5"))).as("Phone"))
val final_df = df_person.join(df_addr_f,"personId").join(df_phone_f,"personId")
final_df.show(false)
}
}
它的产量低于产量
+--------+------+---------+--------+----+-----+---+-----------------------------------------------------------------------------------------------------+--------------------------------------------------------------+
|personId|type |firstName|lastName|year|month|day|addr |Phone |
+--------+------+---------+--------+----+-----+---+-----------------------------------------------------------------------------------------------------+--------------------------------------------------------------+
|98480 |PERSON|TOM |GREER |1982|12 |27 |[[ADDR, RES, 102, JFK BLVD, PISCATAWAY, NJ], [ADDR, OFF, 211, EXCHANGE PL, JERSEY CITY, NJ]] |[[PHONE, CELL, 732, 201, 6789], [PHONE, HOME, 732, 123, 9876]]|
|98481 |PERSON|LIN |JASSOY |1976|09 |15 |[[ADDR, RES, 111, JOURNAL SQ, JERSEY CITY, NJ], [ADDR, OFF, 365, DOWNTOWN NEWYORK, NEWYORK CITY, NY]]|[[PHONE, CELL, 908, 398, 3389], [PHONE, HOME, 917, 363, 2647]]|
+--------+------+---------+--------+----+-----+---+-----------------------------------------------------------------------------------------------------+--------------------------------------------------------------+
IIUC,您可以在线模式下读取数据,执行一些数据操作,然后使用collect_list或collect_set获得所需的结果:
from pyspark.sql.functions import expr, substring_index
# read the files into dataframe with a single column named `value`
df = spark.read.text('/path/to/file/')
将行拆分为两列:personId
(第一个字段)和ArrayType列data
(其余字段):
使用groupby+collect\u list(或collect\u set)请注意,collect\u list/collect\u set将跳过具有空值的项目。下面我们使用collect_list根据数据[0]
的值创建3个ArrayType列:
(1) 如果数据[0]=PHONE
或ADDR
,将data
转换为StructType,结果将是Structs数组
(2) 如果数据[0]=PERSON
,则将data
保留为ArrayType,从生成的数组中提取第一项(名为d1
),然后使用selectExpr将此数组d1
转换为6列
df1.groupby('personId') \
.agg(
expr("collect_list(IF(data[0] = 'PERSON', data, NULL))[0] as d1"),
expr("""
collect_list(
IF(data[0] = 'PHONE'
, (data[0] as phColType,
data[1] as phoneType,
data[2] as areaCode,
data[3] as phoneMiddle,
data[4] as ext)
, NULL)
) AS Phone"""),
expr("""
collect_list(
IF(data[0] = 'ADDR'
, (data[0] as addrColType,
data[1] as addrType,
data[2] as addr1,
data[3] as addr2,
data[4] as city,
data[5] as state,
data[6] as zipCode)
, NULL)
) AS Addr""")
).selectExpr(
'personId',
'd1[0] as type',
'd1[1] as firstName',
'd1[2] as lastName',
'd1[3] as year',
'd1[4] as month',
'd1[5] as day',
'Phone',
'Addr'
).show(truncate=False)
结果(Phone和Addr都是结构数组):
from pyspark.sql.functions import expr, substring_index
# read the files into dataframe with a single column named `value`
df = spark.read.text('/path/to/file/')
df1 = df.withColumn('personId', substring_index('value', '|', 1)) \
.selectExpr('personId', 'split(substr(value, length(personId)+2), "[|]") as data')
#+--------+--------------------+
#|personId| data|
#+--------+--------------------+
#| 98480|[PERSON, TOM, GRE...|
#| 98480|[PHONE, CELL, 732...|
#| 98480|[PHONE, HOME, 732...|
#| 98480|[ADDR, RES, 102, ...|
#| 98480|[ADDR, OFF, 211, ...|
#| 98481|[PERSON, LIN, JAS...|
#| 98481|[PHONE, CELL, 908...|
#| 98481|[PHONE, HOME, 917...|
#| 98481|[ADDR, RES, 111, ...|
#| 98481|[ADDR, OFF, 365, ...|
#+--------+--------------------+
df1.groupby('personId') \
.agg(
expr("collect_list(IF(data[0] = 'PERSON', data, NULL))[0] as d1"),
expr("""
collect_list(
IF(data[0] = 'PHONE'
, (data[0] as phColType,
data[1] as phoneType,
data[2] as areaCode,
data[3] as phoneMiddle,
data[4] as ext)
, NULL)
) AS Phone"""),
expr("""
collect_list(
IF(data[0] = 'ADDR'
, (data[0] as addrColType,
data[1] as addrType,
data[2] as addr1,
data[3] as addr2,
data[4] as city,
data[5] as state,
data[6] as zipCode)
, NULL)
) AS Addr""")
).selectExpr(
'personId',
'd1[0] as type',
'd1[1] as firstName',
'd1[2] as lastName',
'd1[3] as year',
'd1[4] as month',
'd1[5] as day',
'Phone',
'Addr'
).show(truncate=False)
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|personId|type |firstName|lastName|year|month|day|Phone |Addr |
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|98481 |PERSON|LIN |JASSOY |1976|09 |15 |[[PHONE, CELL, 908, 398, 3389], [PHONE, HOME, 917, 363, 2647]]|[[ADDR, RES, 111, JOURNAL SQ, JERSEY CITY, NJ, 07704], [ADDR, OFF, 365, DOWNTOWN NEWYORK, NEWYORK CITY, NY, 10001]]|
|98480 |PERSON|TOM |GREER |1982|12 |27 |[[PHONE, CELL, 732, 201, 6789], [PHONE, HOME, 732, 123, 9876]]|[[ADDR, RES, 102, JFK BLVD, PISCATAWAY, NJ, 08854], [ADDR, OFF, 211, EXCHANGE PL, JERSEY CITY, NJ, 07302]] |
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+