Python 3.x 如何将数据帧(包括数组)中的两列与CSV(数据帧/字典)中的两列进行匹配
我有一个这样的数据帧Python 3.x 如何将数据帧(包括数组)中的两列与CSV(数据帧/字典)中的两列进行匹配,python-3.x,dataframe,pyspark,Python 3.x,Dataframe,Pyspark,我有一个这样的数据帧 df = spark.createDataFrame([ [["Apple"],['iPhone EE','iPhone 11', 'iPhone 11 Pro']], [["Acer"],['Iconia Talk S','liquid Z6 Plus']], [["Casio"],['Casio G\'zOne Brigade']], [["Alcatel"[,[
df = spark.createDataFrame([
[["Apple"],['iPhone EE','iPhone 11', 'iPhone 11 Pro']],
[["Acer"],['Iconia Talk S','liquid Z6 Plus']],
[["Casio"],['Casio G\'zOne Brigade']],
[["Alcatel"[,[]],
[["HTC", "Honor"].["Play 4", "Play 7"]]
]).toDF("brand","type")
Apple;iPhone EE
Apple;iPhone 11 Pro
Apple;iPhone XS
Acer;liquid Z6 Plus
Acer;Acer Predator 8
Casio;Casio G'zOne Ravine
Alcatel;3L
HTC;Play 4
Honor;Play 7
像这样的csv
df = spark.createDataFrame([
[["Apple"],['iPhone EE','iPhone 11', 'iPhone 11 Pro']],
[["Acer"],['Iconia Talk S','liquid Z6 Plus']],
[["Casio"],['Casio G\'zOne Brigade']],
[["Alcatel"[,[]],
[["HTC", "Honor"].["Play 4", "Play 7"]]
]).toDF("brand","type")
Apple;iPhone EE
Apple;iPhone 11 Pro
Apple;iPhone XS
Acer;liquid Z6 Plus
Acer;Acer Predator 8
Casio;Casio G'zOne Ravine
Alcatel;3L
HTC;Play 4
Honor;Play 7
我需要创建一个新的布尔列match
如果
品牌
和类型
的组合与CSV中的一行匹配,则为真
,否则为假
预期产出:
Brand | Type | Match
-------------------------------------------------------------
Apple | [iPhone EE, iPhone 11, iPhone 11 Pro] | True
Acer | [Iconia Talk S, liquid Z6 Plus] | True
Casio | [Casio G\'zOne Brigade] | False
Alcatel | [] | False
HTC, Honor | [Play 4, Play 7] | True
更新品牌
也是阵列类型
csv文件只是一个开始。它可以转换为数据帧或字典(或任何最适合的)
如何才能最好地完成此任务?这可能会很有用
>>> import pyspark.sql.functions as F
>>> df = spark.createDataFrame([
... ["Apple",['iPhone EE','iPhone 11', 'iPhone 11 Pro']],
... ["Acer",['Iconia Talk S','liquid Z6 Plus']],
... ["Casio",['Casio G\'zOne Brigade']],
... ["Alcatel",[]]
... ]).toDF("brand","type")
>>> df.show(df.count(), False)
+-------+-------------------------------------+
|brand |type |
+-------+-------------------------------------+
|Apple |[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer |[Iconia Talk S, liquid Z6 Plus] |
|Casio |[Casio G'zOne Brigade] |
|Alcatel|[] |
+-------+-------------------------------------+
>>> file_df = sqlcontext.read.csv('/home/chai/brand.csv', header='true')
>>> file_df.show(file_df.count(), False)
+-------+-------------------+
|brand |types |
+-------+-------------------+
|Apple |iPhone EE |
|Apple |iPhone 11 Pro |
|Apple |iPhone XS |
|Acer |liquid Z6 Plus |
|Acer |Acer Predator 8 |
|Casio |Casio G'zOne Ravine|
|Alcatel|3L |
+-------+-------------------+
>>> file_df = file_df.groupBy('brand').agg(F.collect_list('types').alias('new'))
>>> file_df.show(file_df.count(), False)
+-------+-------------------------------------+
|brand |new |
+-------+-------------------------------------+
|Casio |[Casio G'zOne Ravine] |
|Alcatel|[3L] |
|Acer |[liquid Z6 Plus, Acer Predator 8] |
|Apple |[iPhone EE, iPhone 11 Pro, iPhone XS]|
+-------+-------------------------------------+
>>> def test(row_dict):
... new_dict = dict()
... for i in row_dict.get('type'):
... if i in row_dict.get('new'):
... new_dict['flag'] = 'True'
... else:
... new_dict['flag'] = 'False'
... if len(row_dict.get('type')) == 0 and len(row_dict.get('new')) > 0:
... new_dict['flag'] = 'False'
... new_dict['brand'] = row_dict.get('brand')
... new_dict['type'] = row_dict.get('type')
... new_dict['new'] = row_dict.get('new')
... return new_dict
...
>>> def row_to_dict(row):
... return row.asDict(recursive=True)
>>> rdd = all.rdd.map(row_to_dict)
>>> rdd1 = rdd.map(test)
>>> final_df = sqlcontext.createDataFrame(rdd1)
>>> final_df.show(final_df.count(), False)
+-------+-----+-------------------------------------+-------------------------------------+
|brand |flag |new |type |
+-------+-----+-------------------------------------+-------------------------------------+
|Apple |True |[iPhone EE, iPhone 11 Pro, iPhone XS]|[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer |True |[liquid Z6 Plus, Acer Predator 8] |[Iconia Talk S, liquid Z6 Plus] |
|Casio |False|[Casio G'zOne Ravine] |[Casio G'zOne Brigade] |
|Alcatel|False|[3L] |[] |
+-------+-----+-------------------------------------+-------------------------------------+
这可能有用
>>> import pyspark.sql.functions as F
>>> df = spark.createDataFrame([
... ["Apple",['iPhone EE','iPhone 11', 'iPhone 11 Pro']],
... ["Acer",['Iconia Talk S','liquid Z6 Plus']],
... ["Casio",['Casio G\'zOne Brigade']],
... ["Alcatel",[]]
... ]).toDF("brand","type")
>>> df.show(df.count(), False)
+-------+-------------------------------------+
|brand |type |
+-------+-------------------------------------+
|Apple |[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer |[Iconia Talk S, liquid Z6 Plus] |
|Casio |[Casio G'zOne Brigade] |
|Alcatel|[] |
+-------+-------------------------------------+
>>> file_df = sqlcontext.read.csv('/home/chai/brand.csv', header='true')
>>> file_df.show(file_df.count(), False)
+-------+-------------------+
|brand |types |
+-------+-------------------+
|Apple |iPhone EE |
|Apple |iPhone 11 Pro |
|Apple |iPhone XS |
|Acer |liquid Z6 Plus |
|Acer |Acer Predator 8 |
|Casio |Casio G'zOne Ravine|
|Alcatel|3L |
+-------+-------------------+
>>> file_df = file_df.groupBy('brand').agg(F.collect_list('types').alias('new'))
>>> file_df.show(file_df.count(), False)
+-------+-------------------------------------+
|brand |new |
+-------+-------------------------------------+
|Casio |[Casio G'zOne Ravine] |
|Alcatel|[3L] |
|Acer |[liquid Z6 Plus, Acer Predator 8] |
|Apple |[iPhone EE, iPhone 11 Pro, iPhone XS]|
+-------+-------------------------------------+
>>> def test(row_dict):
... new_dict = dict()
... for i in row_dict.get('type'):
... if i in row_dict.get('new'):
... new_dict['flag'] = 'True'
... else:
... new_dict['flag'] = 'False'
... if len(row_dict.get('type')) == 0 and len(row_dict.get('new')) > 0:
... new_dict['flag'] = 'False'
... new_dict['brand'] = row_dict.get('brand')
... new_dict['type'] = row_dict.get('type')
... new_dict['new'] = row_dict.get('new')
... return new_dict
...
>>> def row_to_dict(row):
... return row.asDict(recursive=True)
>>> rdd = all.rdd.map(row_to_dict)
>>> rdd1 = rdd.map(test)
>>> final_df = sqlcontext.createDataFrame(rdd1)
>>> final_df.show(final_df.count(), False)
+-------+-----+-------------------------------------+-------------------------------------+
|brand |flag |new |type |
+-------+-----+-------------------------------------+-------------------------------------+
|Apple |True |[iPhone EE, iPhone 11 Pro, iPhone XS]|[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer |True |[liquid Z6 Plus, Acer Predator 8] |[Iconia Talk S, liquid Z6 Plus] |
|Casio |False|[Casio G'zOne Ravine] |[Casio G'zOne Brigade] |
|Alcatel|False|[3L] |[] |
+-------+-----+-------------------------------------+-------------------------------------+
您可以尝试+设置此标志
from pyspark.sql.functions import collect_set, size, array_intersect, broadcast, expr, flatten, collect_list, array_join
df_list = spark.read.csv("/path/to/csv_list", sep=';').toDF('brand_name','type')
df1 = df_list.groupby('brand_name').agg(collect_set('type').alias('types'))
df_new = df.join(broadcast(df1), expr("array_contains(brand, brand_name)"), "left") \
.groupby('brand', 'Type') \
.agg(flatten(collect_list('types')).alias('types')) \
.select(array_join('brand', ', ').alias('brand'), 'Type', (size(array_intersect('type', 'types'))>0).alias("Match"))
df_new.show(5,0)
+----------+-------------------------------------+-----+
|brand |Type |Match|
+----------+-------------------------------------+-----+
|Alcatel |[] |false|
|HTC, Honor|[Play 4, Play 7] |true |
|Casio |[Casio G'zOne Brigade] |false|
|Acer |[Iconia Talk S, liquid Z6 Plus] |true |
|Apple |[iPhone EE, iPhone 11, iPhone 11 Pro]|true |
+----------+-------------------------------------+-----+
方法2:使用地图(Map
):
从pyspark.sql.functions导入数组\u重叠、数组、照明、列、创建映射、列、单调递增\u id、第一、分解、数组\u连接
dict1=df1.rdd.collectAsMap()
map1=创建映射([t代表k,v在dict1中。items()代表t在[lit(k),array(*map(lit,v))]))
#纵队
df_new=df.withColumn('id',单调递增的\u id())\
.withColumn('brand',explode('brand'))\
.withColumn('Match',数组重叠('type',map1[col('brand')]))\
.groupby('id'))\
阿格先生(
数组联接(集合集合('brand'),',')。别名('brand'),
第一个('Type')。别名('Type'),
expr(“总和(整数(匹配))>0作为匹配”)
)
df_新节目(5,0)
+---+----------+-------------------------------------+-----+
|id |品牌|类型|匹配|
+---+----------+-------------------------------------+-----+
|0 |苹果|[iPhone EE、iPhone 11、iPhone 11 Pro]|正确|
|1 | Acer |[Iconia Talk S,liquid Z6 Plus]|正确|
|3 |阿尔卡特|[]错|
|2 |卡西欧|[Casio G'zOne旅]|假|
|4 | HTC,荣誉|[Play 4,Play 7]|正确|
+---+----------+-------------------------------------+-----+
您可以尝试+设置此标志
from pyspark.sql.functions import collect_set, size, array_intersect, broadcast, expr, flatten, collect_list, array_join
df_list = spark.read.csv("/path/to/csv_list", sep=';').toDF('brand_name','type')
df1 = df_list.groupby('brand_name').agg(collect_set('type').alias('types'))
df_new = df.join(broadcast(df1), expr("array_contains(brand, brand_name)"), "left") \
.groupby('brand', 'Type') \
.agg(flatten(collect_list('types')).alias('types')) \
.select(array_join('brand', ', ').alias('brand'), 'Type', (size(array_intersect('type', 'types'))>0).alias("Match"))
df_new.show(5,0)
+----------+-------------------------------------+-----+
|brand |Type |Match|
+----------+-------------------------------------+-----+
|Alcatel |[] |false|
|HTC, Honor|[Play 4, Play 7] |true |
|Casio |[Casio G'zOne Brigade] |false|
|Acer |[Iconia Talk S, liquid Z6 Plus] |true |
|Apple |[iPhone EE, iPhone 11, iPhone 11 Pro]|true |
+----------+-------------------------------------+-----+
方法2:使用地图(Map
):
从pyspark.sql.functions导入数组\u重叠、数组、照明、列、创建映射、列、单调递增\u id、第一、分解、数组\u连接
dict1=df1.rdd.collectAsMap()
map1=创建映射([t代表k,v在dict1中。items()代表t在[lit(k),array(*map(lit,v))]))
#纵队
df_new=df.withColumn('id',单调递增的\u id())\
.withColumn('brand',explode('brand'))\
.withColumn('Match',数组重叠('type',map1[col('brand')]))\
.groupby('id'))\
阿格先生(
数组联接(集合集合('brand'),',')。别名('brand'),
第一个('Type')。别名('Type'),
expr(“总和(整数(匹配))>0作为匹配”)
)
df_新节目(5,0)
+---+----------+-------------------------------------+-----+
|id |品牌|类型|匹配|
+---+----------+-------------------------------------+-----+
|0 |苹果|[iPhone EE、iPhone 11、iPhone 11 Pro]|正确|
|1 | Acer |[Iconia Talk S,liquid Z6 Plus]|正确|
|3 |阿尔卡特|[]错|
|2 |卡西欧|[Casio G'zOne旅]|假|
|4 | HTC,荣誉|[Play 4,Play 7]|正确|
+---+----------+-------------------------------------+-----+
我必须道歉。品牌也是数组类型。您能帮助我更改方法1吗?添加了更新以反映您的最新更改,包括csv列表。我在df_new=df.join(广播(df1),expr(“数组包含(品牌,品牌名称)”,“左”)
无法解析数组包含(…)由于数据类型不匹配:函数array_contains的输入应该是array后跟一个具有相同元素类型的值,但它是[array,array]
看起来它们都是相同类型的array
。我还检查了printSchema()
,它们都是array
类型。您是如何设置df1的?如果您按照我帖子中的步骤进行操作,我们将从csv文件中读取df_列表,在groupby之后,brand_name
应该是StringType吗?除非你有其他的后处理步骤?更改订单修复了它…duhexpr(“array\u contains(brand,brand\u name)”)
我使用了expr(“array\u contains(brand\u name,brand)”)
我必须道歉。品牌也是数组类型。您能帮助我更改方法1吗?添加了更新以反映您的最新更改,包括csv列表。我在df_new=df.join(广播(df1),expr(“数组包含(品牌,品牌名称)”,“左”)
无法解析数组包含(…)由于数据类型不匹配:函数array_contains的输入应该是array后跟一个具有相同元素类型的值,但它是[array,array]
看起来它们都是相同类型的array
。我还检查了printSchema()
,它们都是array
类型。您是如何设置df1的?如果您按照我帖子中的步骤进行操作,我们将从csv文件中读取df_列表,在groupby之后,brand_name
应该是StringType吗?除非你有其他的后处理步骤?更改订单修复了它…duhexpr(“数组包含(品牌,品牌名称)”)
我使用了expr(“数组包含(品牌,品牌)”)