Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/blackberry/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python PySpark查找一列中的模式是否存在于另一列中_Python_Apache Spark_Dataframe_Pyspark - Fatal编程技术网

Python PySpark查找一列中的模式是否存在于另一列中

Python PySpark查找一列中的模式是否存在于另一列中,python,apache-spark,dataframe,pyspark,Python,Apache Spark,Dataframe,Pyspark,我有两个pyspark数据帧。一个包含FullAddress字段(如col1),另一个数据框在其中一列(如col2)中包含城市/城镇/郊区的名称。我想比较col2和col1,如果有匹配的话返回col2 此外,郊区名称可以是郊区名称的列表 包含完整地址的Dataframe1 我想要这样的 +-----------------------------------------------------------++-----------+ |FullAddress

我有两个pyspark数据帧。一个包含FullAddress字段(如col1),另一个数据框在其中一列(如col2)中包含城市/城镇/郊区的名称。我想比较col2和col1,如果有匹配的话返回col2

此外,郊区名称可以是郊区名称的列表

包含完整地址的Dataframe1

我想要这样的

+-----------------------------------------------------------++-----------+
|FullAddress                                                |suburb      |
+-----------------------------------------------------------++-----------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia               |NORTH RYDE  |
| HAY STREET HAYMARKET 2000, NSW, Australia                 |HAYMARKET   |
| SMART STREET FAIRFIELD 2165, NSW, Australia               |NULL        |
|CLARENCE STREET SYDNEY 2000, NSW, Australia                |SYDNEY      |
+-----------------------------------------------------------++-----------+

有两个
数据帧
-

数据帧1:
包含完整地址的数据帧

数据框2:
数据框
包含基本数据-
邮政编码
地区
&
城市/城镇/郊区

df_City_Town_Suburb = sqlContext.createDataFrame([(2000,'Sydney','Dawes Point, Haymarket, Millers Point, Sydney, The Rocks'),
                                             (2001,'Sydney','Sydney'),(2113,'Sydney','North Ryde')],
                                             ('Postcode','District','City_Town_Suburb'))
df_City_Town_Suburb.show(truncate=False)

+--------+--------+--------------------------------------------------------+
|Postcode|District|City_Town_Suburb                                        |
+--------+--------+--------------------------------------------------------+
|2000    |Sydney  |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2001    |Sydney  |Sydney                                                  |
|2113    |Sydney  |North Ryde                                              |
+--------+--------+--------------------------------------------------------+
该问题的目的是从
DataFrame 2
中为
DataFrame 1
提取适当的
。虽然OP没有明确指定可以连接两个数据帧的
,但是
邮政编码
似乎只是合理的选择

# Importing requisite functions
from pyspark.sql.functions import col,regexp_extract,split,udf
from pyspark.sql.types import StringType
让我们将
数据帧1
创建为
df
。在这个
DataFrame
中,我们需要提取
Postcode
。在澳大利亚,所有邮政编码都是,因此我们使用从
字符串
列中提取4位数字

df = sqlContext.createDataFrame([('BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia ',),
                                 ('HAY STREET HAYMARKET 2000, NSW, Australia',),
                                 ('SMART STREET FAIRFIELD 2165, NSW, Australia',),
                                 ('CLARENCE STREET SYDNEY 2000, NSW, Australia',)],
                                 ('FullAddress',))
df = df.withColumn('Postcode', regexp_extract('FullAddress', "(\\d{4})" , 1 ))
df.show(truncate=False)
+---------------------------------------------+--------+
|FullAddress                                  |Postcode|
+---------------------------------------------+--------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |2113    |
|HAY STREET HAYMARKET 2000, NSW, Australia    |2000    |
|SMART STREET FAIRFIELD 2165, NSW, Australia  |2165    |
|CLARENCE STREET SYDNEY 2000, NSW, Australia  |2000    |
+---------------------------------------------+--------+
现在,我们已经提取了
邮政编码
,我们已经创建了
来连接两个
数据帧
。让我们创建
数据帧2
,我们需要从中提取相应的
数据帧

df_City_Town_Suburb = sqlContext.createDataFrame([(2000,'Sydney','Dawes Point, Haymarket, Millers Point, Sydney, The Rocks'),
                                             (2001,'Sydney','Sydney'),(2113,'Sydney','North Ryde')],
                                             ('Postcode','District','City_Town_Suburb'))
df_City_Town_Suburb.show(truncate=False)

+--------+--------+--------------------------------------------------------+
|Postcode|District|City_Town_Suburb                                        |
+--------+--------+--------------------------------------------------------+
|2000    |Sydney  |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2001    |Sydney  |Sydney                                                  |
|2113    |Sydney  |North Ryde                                              |
+--------+--------+--------------------------------------------------------+
将两个
数据帧
连接-

df = df.join(df_City_Town_Suburb.select('Postcode','City_Town_Suburb'), ['Postcode'],how='left')
df.show(truncate=False)
+--------+---------------------------------------------+--------------------------------------------------------+
|Postcode|FullAddress                                  |City_Town_Suburb                                        |
+--------+---------------------------------------------+--------------------------------------------------------+
|2113    |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |North Ryde                                              |
|2165    |SMART STREET FAIRFIELD 2165, NSW, Australia  |null                                                    |
|2000    |HAY STREET HAYMARKET 2000, NSW, Australia    |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2000    |CLARENCE STREET SYDNEY 2000, NSW, Australia  |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
+--------+---------------------------------------------+--------------------------------------------------------+
使用函数将列
City\u Town\u郊区
拆分为一个数组-

df = df.select('Postcode','FullAddress',split(col("City_Town_Suburb"), ",\s*").alias("City_Town_Suburb"))
df.show(truncate=False)
+--------+---------------------------------------------+----------------------------------------------------------+
|Postcode|FullAddress                                  |City_Town_Suburb                                          |
+--------+---------------------------------------------+----------------------------------------------------------+
|2113    |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |[North Ryde]                                              |
|2165    |SMART STREET FAIRFIELD 2165, NSW, Australia  |null                                                      |
|2000    |HAY STREET HAYMARKET 2000, NSW, Australia    |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
|2000    |CLARENCE STREET SYDNEY 2000, NSW, Australia  |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
+--------+---------------------------------------------+----------------------------------------------------------+
最后创建一个数组,检查数组的每个元素
City\u Town\u郊区
是否存在于列
FullAddress
中。如果存在一个,我们立即返回它,否则
None
返回

def suburb(FullAddress,City_Town_Suburb):
   # Check for the case where there is no Array, otherwise we will get an Error
   if City_Town_Suburb == None:
      return None
   # Checking each and every Array element if it exists in 'FullAddress',
   # and if a match is found, it's immediately returned.
   for sub in City_Town_Suburb:
      if sub.strip().upper() in FullAddress:
         return sub.upper()
   return None
suburb_udf = udf(suburb,StringType())
应用此自定义项

df = df.withColumn('suburb', suburb_udf(col('FullAddress'),col('City_Town_Suburb'))).drop('City_Town_Suburb')
df.show(truncate=False)
+--------+---------------------------------------------+----------+
|Postcode|FullAddress                                  |suburb    |
+--------+---------------------------------------------+----------+
|2113    |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |NORTH RYDE|
|2165    |SMART STREET FAIRFIELD 2165, NSW, Australia  |null      |
|2000    |HAY STREET HAYMARKET 2000, NSW, Australia    |HAYMARKET |
|2000    |CLARENCE STREET SYDNEY 2000, NSW, Australia  |SYDNEY    |
+--------+---------------------------------------------+----------+

您需要加入邮政编码上的数据帧吗?