Python 3.x 如何将数据帧(包括数组)中的两列与CSV(数据帧/字典)中的两列进行匹配

Python 3.x 如何将数据帧(包括数组)中的两列与CSV(数据帧/字典)中的两列进行匹配,python-3.x,dataframe,pyspark,Python 3.x,Dataframe,Pyspark,我有一个这样的数据帧 df = spark.createDataFrame([ [["Apple"],['iPhone EE','iPhone 11', 'iPhone 11 Pro']], [["Acer"],['Iconia Talk S','liquid Z6 Plus']], [["Casio"],['Casio G\'zOne Brigade']], [["Alcatel"[,[

我有一个这样的数据帧

df = spark.createDataFrame([
  [["Apple"],['iPhone EE','iPhone 11', 'iPhone 11 Pro']],   
  [["Acer"],['Iconia Talk S','liquid Z6 Plus']],   
  [["Casio"],['Casio G\'zOne Brigade']],
  [["Alcatel"[,[]],
  [["HTC", "Honor"].["Play 4", "Play 7"]]
]).toDF("brand","type")
Apple;iPhone EE
Apple;iPhone 11 Pro
Apple;iPhone XS
Acer;liquid Z6 Plus
Acer;Acer Predator 8
Casio;Casio G'zOne Ravine
Alcatel;3L
HTC;Play 4
Honor;Play 7
像这样的csv

df = spark.createDataFrame([
  [["Apple"],['iPhone EE','iPhone 11', 'iPhone 11 Pro']],   
  [["Acer"],['Iconia Talk S','liquid Z6 Plus']],   
  [["Casio"],['Casio G\'zOne Brigade']],
  [["Alcatel"[,[]],
  [["HTC", "Honor"].["Play 4", "Play 7"]]
]).toDF("brand","type")
Apple;iPhone EE
Apple;iPhone 11 Pro
Apple;iPhone XS
Acer;liquid Z6 Plus
Acer;Acer Predator 8
Casio;Casio G'zOne Ravine
Alcatel;3L
HTC;Play 4
Honor;Play 7
我需要创建一个新的布尔列
match

如果
品牌
类型
的组合与CSV中的一行匹配,则为
,否则为

预期产出:

    Brand      | Type                                  | Match
    -------------------------------------------------------------
    Apple      | [iPhone EE, iPhone 11, iPhone 11 Pro] | True
    Acer       | [Iconia Talk S, liquid Z6 Plus]       | True
    Casio      | [Casio G\'zOne Brigade]               | False
    Alcatel    | []                                    | False
    HTC, Honor | [Play 4, Play 7]                      | True
更新
品牌
也是
阵列类型

csv文件只是一个开始。它可以转换为数据帧或字典(或任何最适合的)

如何才能最好地完成此任务?

这可能会很有用

>>> import pyspark.sql.functions as F

>>> df = spark.createDataFrame([
...   ["Apple",['iPhone EE','iPhone 11', 'iPhone 11 Pro']],   
...   ["Acer",['Iconia Talk S','liquid Z6 Plus']],   
...   ["Casio",['Casio G\'zOne Brigade']],
...   ["Alcatel",[]]
... ]).toDF("brand","type")
>>> df.show(df.count(), False)
+-------+-------------------------------------+
|brand  |type                                 |
+-------+-------------------------------------+
|Apple  |[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer   |[Iconia Talk S, liquid Z6 Plus]      |
|Casio  |[Casio G'zOne Brigade]               |
|Alcatel|[]                                   |
+-------+-------------------------------------+

>>> file_df = sqlcontext.read.csv('/home/chai/brand.csv', header='true')
>>> file_df.show(file_df.count(), False)
+-------+-------------------+
|brand  |types              |
+-------+-------------------+
|Apple  |iPhone EE          |
|Apple  |iPhone 11 Pro      |
|Apple  |iPhone XS          |
|Acer   |liquid Z6 Plus     |
|Acer   |Acer Predator 8    |
|Casio  |Casio G'zOne Ravine|
|Alcatel|3L                 |
+-------+-------------------+

>>> file_df = file_df.groupBy('brand').agg(F.collect_list('types').alias('new'))
>>> file_df.show(file_df.count(), False)
+-------+-------------------------------------+
|brand  |new                                  |
+-------+-------------------------------------+
|Casio  |[Casio G'zOne Ravine]                |
|Alcatel|[3L]                                 |
|Acer   |[liquid Z6 Plus, Acer Predator 8]    |
|Apple  |[iPhone EE, iPhone 11 Pro, iPhone XS]|
+-------+-------------------------------------+

>>> def test(row_dict):
...     new_dict = dict()
...     for i in row_dict.get('type'):
...             if i in row_dict.get('new'):
...                     new_dict['flag'] = 'True'
...             else:
...                     new_dict['flag'] = 'False'
...     if len(row_dict.get('type')) == 0 and len(row_dict.get('new')) > 0:
...             new_dict['flag'] = 'False'
...     new_dict['brand'] = row_dict.get('brand')
...     new_dict['type'] = row_dict.get('type')
...     new_dict['new'] = row_dict.get('new')
...     return new_dict
... 
>>> def row_to_dict(row):
...     return row.asDict(recursive=True)
>>> rdd = all.rdd.map(row_to_dict)
>>> rdd1 = rdd.map(test)
>>> final_df = sqlcontext.createDataFrame(rdd1)
>>> final_df.show(final_df.count(), False)
+-------+-----+-------------------------------------+-------------------------------------+
|brand  |flag |new                                  |type                                 |
+-------+-----+-------------------------------------+-------------------------------------+
|Apple  |True |[iPhone EE, iPhone 11 Pro, iPhone XS]|[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer   |True |[liquid Z6 Plus, Acer Predator 8]    |[Iconia Talk S, liquid Z6 Plus]      |
|Casio  |False|[Casio G'zOne Ravine]                |[Casio G'zOne Brigade]               |
|Alcatel|False|[3L]                                 |[]                                   |
+-------+-----+-------------------------------------+-------------------------------------+
这可能有用

>>> import pyspark.sql.functions as F

>>> df = spark.createDataFrame([
...   ["Apple",['iPhone EE','iPhone 11', 'iPhone 11 Pro']],   
...   ["Acer",['Iconia Talk S','liquid Z6 Plus']],   
...   ["Casio",['Casio G\'zOne Brigade']],
...   ["Alcatel",[]]
... ]).toDF("brand","type")
>>> df.show(df.count(), False)
+-------+-------------------------------------+
|brand  |type                                 |
+-------+-------------------------------------+
|Apple  |[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer   |[Iconia Talk S, liquid Z6 Plus]      |
|Casio  |[Casio G'zOne Brigade]               |
|Alcatel|[]                                   |
+-------+-------------------------------------+

>>> file_df = sqlcontext.read.csv('/home/chai/brand.csv', header='true')
>>> file_df.show(file_df.count(), False)
+-------+-------------------+
|brand  |types              |
+-------+-------------------+
|Apple  |iPhone EE          |
|Apple  |iPhone 11 Pro      |
|Apple  |iPhone XS          |
|Acer   |liquid Z6 Plus     |
|Acer   |Acer Predator 8    |
|Casio  |Casio G'zOne Ravine|
|Alcatel|3L                 |
+-------+-------------------+

>>> file_df = file_df.groupBy('brand').agg(F.collect_list('types').alias('new'))
>>> file_df.show(file_df.count(), False)
+-------+-------------------------------------+
|brand  |new                                  |
+-------+-------------------------------------+
|Casio  |[Casio G'zOne Ravine]                |
|Alcatel|[3L]                                 |
|Acer   |[liquid Z6 Plus, Acer Predator 8]    |
|Apple  |[iPhone EE, iPhone 11 Pro, iPhone XS]|
+-------+-------------------------------------+

>>> def test(row_dict):
...     new_dict = dict()
...     for i in row_dict.get('type'):
...             if i in row_dict.get('new'):
...                     new_dict['flag'] = 'True'
...             else:
...                     new_dict['flag'] = 'False'
...     if len(row_dict.get('type')) == 0 and len(row_dict.get('new')) > 0:
...             new_dict['flag'] = 'False'
...     new_dict['brand'] = row_dict.get('brand')
...     new_dict['type'] = row_dict.get('type')
...     new_dict['new'] = row_dict.get('new')
...     return new_dict
... 
>>> def row_to_dict(row):
...     return row.asDict(recursive=True)
>>> rdd = all.rdd.map(row_to_dict)
>>> rdd1 = rdd.map(test)
>>> final_df = sqlcontext.createDataFrame(rdd1)
>>> final_df.show(final_df.count(), False)
+-------+-----+-------------------------------------+-------------------------------------+
|brand  |flag |new                                  |type                                 |
+-------+-----+-------------------------------------+-------------------------------------+
|Apple  |True |[iPhone EE, iPhone 11 Pro, iPhone XS]|[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer   |True |[liquid Z6 Plus, Acer Predator 8]    |[Iconia Talk S, liquid Z6 Plus]      |
|Casio  |False|[Casio G'zOne Ravine]                |[Casio G'zOne Brigade]               |
|Alcatel|False|[3L]                                 |[]                                   |
+-------+-----+-------------------------------------+-------------------------------------+
您可以尝试+设置此标志

from pyspark.sql.functions import collect_set, size, array_intersect, broadcast, expr, flatten, collect_list, array_join

df_list = spark.read.csv("/path/to/csv_list", sep=';').toDF('brand_name','type')

df1 = df_list.groupby('brand_name').agg(collect_set('type').alias('types'))
  
df_new = df.join(broadcast(df1), expr("array_contains(brand, brand_name)"), "left") \
    .groupby('brand', 'Type') \
    .agg(flatten(collect_list('types')).alias('types')) \
    .select(array_join('brand', ', ').alias('brand'), 'Type', (size(array_intersect('type', 'types'))>0).alias("Match"))

df_new.show(5,0)
+----------+-------------------------------------+-----+                        
|brand     |Type                                 |Match|
+----------+-------------------------------------+-----+
|Alcatel   |[]                                   |false|
|HTC, Honor|[Play 4, Play 7]                     |true |
|Casio     |[Casio G'zOne Brigade]               |false|
|Acer      |[Iconia Talk S, liquid Z6 Plus]      |true |
|Apple     |[iPhone EE, iPhone 11, iPhone 11 Pro]|true |
+----------+-------------------------------------+-----+
方法2:使用地图(
Map
):

从pyspark.sql.functions导入数组\u重叠、数组、照明、列、创建映射、列、单调递增\u id、第一、分解、数组\u连接
dict1=df1.rdd.collectAsMap()
map1=创建映射([t代表k,v在dict1中。items()代表t在[lit(k),array(*map(lit,v))]))
#纵队
df_new=df.withColumn('id',单调递增的\u id())\
.withColumn('brand',explode('brand'))\
.withColumn('Match',数组重叠('type',map1[col('brand')]))\
.groupby('id'))\
阿格先生(
数组联接(集合集合('brand'),',')。别名('brand'),
第一个('Type')。别名('Type'),
expr(“总和(整数(匹配))>0作为匹配”)
)
df_新节目(5,0)
+---+----------+-------------------------------------+-----+
|id |品牌|类型|匹配|
+---+----------+-------------------------------------+-----+
|0 |苹果|[iPhone EE、iPhone 11、iPhone 11 Pro]|正确|
|1 | Acer |[Iconia Talk S,liquid Z6 Plus]|正确|
|3 |阿尔卡特|[]错|
|2 |卡西欧|[Casio G'zOne旅]|假|
|4 | HTC,荣誉|[Play 4,Play 7]|正确|
+---+----------+-------------------------------------+-----+
您可以尝试+设置此标志

from pyspark.sql.functions import collect_set, size, array_intersect, broadcast, expr, flatten, collect_list, array_join

df_list = spark.read.csv("/path/to/csv_list", sep=';').toDF('brand_name','type')

df1 = df_list.groupby('brand_name').agg(collect_set('type').alias('types'))
  
df_new = df.join(broadcast(df1), expr("array_contains(brand, brand_name)"), "left") \
    .groupby('brand', 'Type') \
    .agg(flatten(collect_list('types')).alias('types')) \
    .select(array_join('brand', ', ').alias('brand'), 'Type', (size(array_intersect('type', 'types'))>0).alias("Match"))

df_new.show(5,0)
+----------+-------------------------------------+-----+                        
|brand     |Type                                 |Match|
+----------+-------------------------------------+-----+
|Alcatel   |[]                                   |false|
|HTC, Honor|[Play 4, Play 7]                     |true |
|Casio     |[Casio G'zOne Brigade]               |false|
|Acer      |[Iconia Talk S, liquid Z6 Plus]      |true |
|Apple     |[iPhone EE, iPhone 11, iPhone 11 Pro]|true |
+----------+-------------------------------------+-----+
方法2:使用地图(
Map
):

从pyspark.sql.functions导入数组\u重叠、数组、照明、列、创建映射、列、单调递增\u id、第一、分解、数组\u连接
dict1=df1.rdd.collectAsMap()
map1=创建映射([t代表k,v在dict1中。items()代表t在[lit(k),array(*map(lit,v))]))
#纵队
df_new=df.withColumn('id',单调递增的\u id())\
.withColumn('brand',explode('brand'))\
.withColumn('Match',数组重叠('type',map1[col('brand')]))\
.groupby('id'))\
阿格先生(
数组联接(集合集合('brand'),',')。别名('brand'),
第一个('Type')。别名('Type'),
expr(“总和(整数(匹配))>0作为匹配”)
)
df_新节目(5,0)
+---+----------+-------------------------------------+-----+
|id |品牌|类型|匹配|
+---+----------+-------------------------------------+-----+
|0 |苹果|[iPhone EE、iPhone 11、iPhone 11 Pro]|正确|
|1 | Acer |[Iconia Talk S,liquid Z6 Plus]|正确|
|3 |阿尔卡特|[]错|
|2 |卡西欧|[Casio G'zOne旅]|假|
|4 | HTC,荣誉|[Play 4,Play 7]|正确|
+---+----------+-------------------------------------+-----+

我必须道歉。品牌也是数组类型。您能帮助我更改方法1吗?添加了更新以反映您的最新更改,包括csv列表。我在
df_new=df.join(广播(df1),expr(“数组包含(品牌,品牌名称)”,“左”)
无法解析数组包含(…)由于数据类型不匹配:函数array_contains的输入应该是array后跟一个具有相同元素类型的值,但它是[array,array]
看起来它们都是相同类型的
array
。我还检查了
printSchema()
,它们都是
array
类型。您是如何设置df1的?如果您按照我帖子中的步骤进行操作,我们将从csv文件中读取df_列表,在groupby之后,
brand_name
应该是StringType吗?除非你有其他的后处理步骤?更改订单修复了它…duh
expr(“array\u contains(brand,brand\u name)”)
我使用了
expr(“array\u contains(brand\u name,brand)”)
我必须道歉。品牌也是数组类型。您能帮助我更改方法1吗?添加了更新以反映您的最新更改,包括csv列表。我在
df_new=df.join(广播(df1),expr(“数组包含(品牌,品牌名称)”,“左”)
无法解析数组包含(…)由于数据类型不匹配:函数array_contains的输入应该是array后跟一个具有相同元素类型的值,但它是[array,array]
看起来它们都是相同类型的
array
。我还检查了
printSchema()
,它们都是
array
类型。您是如何设置df1的?如果您按照我帖子中的步骤进行操作,我们将从csv文件中读取df_列表,在groupby之后,
brand_name
应该是StringType吗?除非你有其他的后处理步骤?更改订单修复了它…duh
expr(“数组包含(品牌,品牌名称)”)
我使用了
expr(“数组包含(品牌,品牌)”)