Apache spark PySpark-使用LIKE操作符中的列表_Apache Spark_Pyspark_Sql Like

Apache spark PySpark-使用LIKE操作符中的列表

apache-spark pyspark

Apache spark PySpark-使用LIKE操作符中的列表,apache-spark,pyspark,sql-like,Apache Spark,Pyspark,Sql Like,我想在pyspark上的like操作符中使用list来创建列我有以下输入：输入\u df: +------+--------------------+-------+ | ID| customers|country| +------+--------------------+-------+ |161 |xyz Limited |U.K. | |262 |ABC Limited |U.K. | |165 |Sons

我想在pyspark上的like操作符中使用list来创建列

我有以下输入：

输入\u df:

+------+--------------------+-------+
|    ID|           customers|country|
+------+--------------------+-------+
|161   |xyz Limited         |U.K.   |
|262   |ABC  Limited        |U.K.   |
|165   |Sons & Sons         |U.K.   |
|361   |TÜV GmbH            |Germany|
|462   |Mueller GmbH        |Germany|
|369   |Schneider AG        |Germany|
|467   |Sahm UG             |Austria|
+------+--------------------+-------+

+------+--------------------+-------+-------+
|    ID|           customers|country|Cat_ID |
+------+--------------------+-------+-------+
|161   |xyz Limited         |U.K.   |1      |
|262   |ABC  Limited        |U.K.   |1      |
|165   |Sons & Sons         |U.K.   |1      |
|361   |TÜV GmbH            |Germany|2      |
|462   |Mueller GmbH        |Germany|2      |
|369   |Schneider AG        |Germany|2      |
|467   |Sahm UG             |Austria|2      |
+------+--------------------+-------+-------+

我想添加一列CAT_ID。如果“ID”包含“16”或“26”，则CAT_ID的值为1。如果“ID”包含“36”或“46”，则CAT_ID取值2。
所以，我希望我的输出df像这样-

所需的输出\u df:

+------+--------------------+-------+
|    ID|           customers|country|
+------+--------------------+-------+
|161   |xyz Limited         |U.K.   |
|262   |ABC  Limited        |U.K.   |
|165   |Sons & Sons         |U.K.   |
|361   |TÜV GmbH            |Germany|
|462   |Mueller GmbH        |Germany|
|369   |Schneider AG        |Germany|
|467   |Sahm UG             |Austria|
+------+--------------------+-------+

+------+--------------------+-------+-------+
|    ID|           customers|country|Cat_ID |
+------+--------------------+-------+-------+
|161   |xyz Limited         |U.K.   |1      |
|262   |ABC  Limited        |U.K.   |1      |
|165   |Sons & Sons         |U.K.   |1      |
|361   |TÜV GmbH            |Germany|2      |
|462   |Mueller GmbH        |Germany|2      |
|369   |Schneider AG        |Germany|2      |
|467   |Sahm UG             |Austria|2      |
+------+--------------------+-------+-------+

我有兴趣学习如何使用LIKE语句和列表来实现这一点

我知道如何在没有列表的情况下实现它，这非常有效：

from pyspark.sql import functions as F

def add_CAT_ID(df):
    return df.withColumn(
        'CAT_ID', 
        F.when( ( (F.col('ID').like('16%')) | (F.col('ID').like('26%'))  ) , "1") \
         .when( ( (F.col('ID').like('36%')) | (F.col('ID').like('46%'))  ) , "2") \
         .otherwise('999')
    )


    output_df = add_CAT_ID(input_df)

然而，我希望使用列表，并有如下内容：

list1 =['16', '26']
list2 =['36', '46']


def add_CAT_ID(df):
    return df.withColumn(
        'CAT_ID', 
        F.when( ( (F.col('ID').like(list1 %))  ) , "1") \
         .when( ( (F.col('ID').like('list2 %'))  ) , "2") \
         .otherwise('999')
    )


    output_df = add_CAT_ID(input_df)

提前非常感谢，

SQL通配符不支持“或”子句。不过，有几种方法可以处理它

1。正则表达式

您可以将

rlike

与正则表达式一起使用：

import pyspark.sql.函数作为psf
列表1=['16'，'26']
列表2=['36'，'46']
df.withColumn(
“CAT_ID”，
psf.when（psf.col（'ID'）.rlike（'（{}）\d'.format（'|'.join（list1）），'1'））\
.when（psf.col（'ID'）.rlike（'（{}）\d'.format（'|'.join（list2））），'2'）\
.否则（'999'））\
.show（）
+---+------------+-------+------+
|ID |客户|国家|类别ID|
+---+------------+-------+------+
〔161〕XYZ有限〔英国〕1
（262）ABC有限公司（英国1）
165子孙子英国1
|361 | TÜV GmbH |德国| 2|
|462 |穆勒股份有限公司|德国| 2|
|369 |施耐德公司|德国| 2|
|467 | Sahm UG |奥地利| 2|
+---+------------+-------+------+

在这里，我们得到

list1

正则表达式

（16 | 26）\d

匹配16或26，后跟一个整数（

\d

相当于

[0-9]

）

2。动态构建SQL子句

如果希望保留sql，可以使用

selectExpr

并将值与

或'

链接：

df.selectExpr(
'*', 
“当（{}）然后是'1'，当（{}）然后是'2'，否则'999'以CAT_ID结尾时的大小写”
.format（*['或'.join（['ID LIKE'{}%'..format（x）表示l中的x]）表示l中的[list1，list2]]）

3。动态构建Python表达式

如果不想编写SQL，也可以使用

eval

：

df.withColumn(
“CAT_ID”，
psf.when（eval（“|”）.join（[“psf.col（'ID'）。like（“{}%”）”）。在列表1中x的格式（x）），“1”）
.when（eval（“|”）.join（[“psf.col（'ID'）。like（“{}%”）”）。在列表2中为x设置格式（x）），“2”）
.否则（'999'））

SQL通配符不支持“或”子句。不过，有几种方法可以处理它

1。正则表达式

您可以将

rlike

与正则表达式一起使用：

import pyspark.sql.函数作为psf
列表1=['16'，'26']
列表2=['36'，'46']
df.withColumn(
“CAT_ID”，
psf.when（psf.col（'ID'）.rlike（'（{}）\d'.format（'|'.join（list1）），'1'））\
.when（psf.col（'ID'）.rlike（'（{}）\d'.format（'|'.join（list2））），'2'）\
.否则（'999'））\
.show（）
+---+------------+-------+------+
|ID |客户|国家|类别ID|
+---+------------+-------+------+
〔161〕XYZ有限〔英国〕1
（262）ABC有限公司（英国1）
165子孙子英国1
|361 | TÜV GmbH |德国| 2|
|462 |穆勒股份有限公司|德国| 2|
|369 |施耐德公司|德国| 2|
|467 | Sahm UG |奥地利| 2|
+---+------------+-------+------+

在这里，我们得到

list1

正则表达式

（16 | 26）\d

匹配16或26，后跟一个整数（

\d

相当于

[0-9]

）

2。动态构建SQL子句

如果希望保留sql，可以使用

selectExpr

并将值与

或'

链接：

df.selectExpr(
'*', 
“当（{}）然后是'1'，当（{}）然后是'2'，否则'999'以CAT_ID结尾时的大小写”
.format（*['或'.join（['ID LIKE'{}%'..format（x）表示l中的x]）表示l中的[list1，list2]]）

3。动态构建Python表达式

如果不想编写SQL，也可以使用

eval

：

df.withColumn(
“CAT_ID”，
psf.when（eval（“|”）.join（[“psf.col（'ID'）。like（“{}%”）”）。在列表1中x的格式（x）），“1”）
.when（eval（“|”）.join（[“psf.col（'ID'）。like（“{}%”）”）。在列表2中为x设置格式（x）），“2”）
.否则（'999'））

对于Spark 2.4以后的版本，您可以在Spark sql中使用更高阶的函数

试试下面的一个，sql解决方案对于scala/python都是相同的

val df = Seq(
  ("161","xyz Limited","U.K."),
  ("262","ABC  Limited","U.K."),
  ("165","Sons & Sons","U.K."),
  ("361","TÜV GmbH","Germany"),
  ("462","Mueller GmbH","Germany"),
  ("369","Schneider AG","Germany"),
  ("467","Sahm UG","Germany")
).toDF("ID","customers","country")

df.show(false)
df.createOrReplaceTempView("secil")
spark.sql(
  """ with t1 ( select id, customers, country, array('16','26') as a1, array('36','46') as a2 from secil),
     t2 (select id, customers, country,  filter(a1, x -> id like x||'%') a1f,  filter(a2, x -> id like x||'%') a2f from t1),
     t3 (select id, customers, country, a1f, a2f,
               case when size(a1f) > 0 then 1 else 0 end a1r,
               case when size(a2f) > 0 then 2 else 0 end a2r
               from t2)
     select id, customers, country, a1f, a2f, a1r, a2r, a1r+a2r as Cat_ID from t3
  """).show(false)

结果：

+---+------------+-------+
|ID |customers   |country|
+---+------------+-------+
|161|xyz Limited |U.K.   |
|262|ABC  Limited|U.K.   |
|165|Sons & Sons |U.K.   |
|361|TÜV GmbH    |Germany|
|462|Mueller GmbH|Germany|
|369|Schneider AG|Germany|
|467|Sahm UG     |Germany|
+---+------------+-------+

+---+------------+-------+----+----+---+---+------+
|id |customers   |country|a1f |a2f |a1r|a2r|Cat_ID|
+---+------------+-------+----+----+---+---+------+
|161|xyz Limited |U.K.   |[16]|[]  |1  |0  |1     |
|262|ABC  Limited|U.K.   |[26]|[]  |1  |0  |1     |
|165|Sons & Sons |U.K.   |[16]|[]  |1  |0  |1     |
|361|TÜV GmbH    |Germany|[]  |[36]|0  |2  |2     |
|462|Mueller GmbH|Germany|[]  |[46]|0  |2  |2     |
|369|Schneider AG|Germany|[]  |[36]|0  |2  |2     |
|467|Sahm UG     |Germany|[]  |[46]|0  |2  |2     |
+---+------------+-------+----+----+---+---+------+

使用Spark 2.4以后的版本，您可以在Spark sql中使用高阶函数

试试下面的一个，sql解决方案对于scala/python都是相同的

val df = Seq(
  ("161","xyz Limited","U.K."),
  ("262","ABC  Limited","U.K."),
  ("165","Sons & Sons","U.K."),
  ("361","TÜV GmbH","Germany"),
  ("462","Mueller GmbH","Germany"),
  ("369","Schneider AG","Germany"),
  ("467","Sahm UG","Germany")
).toDF("ID","customers","country")

df.show(false)
df.createOrReplaceTempView("secil")
spark.sql(
  """ with t1 ( select id, customers, country, array('16','26') as a1, array('36','46') as a2 from secil),
     t2 (select id, customers, country,  filter(a1, x -> id like x||'%') a1f,  filter(a2, x -> id like x||'%') a2f from t1),
     t3 (select id, customers, country, a1f, a2f,
               case when size(a1f) > 0 then 1 else 0 end a1r,
               case when size(a2f) > 0 then 2 else 0 end a2r
               from t2)
     select id, customers, country, a1f, a2f, a1r, a2r, a1r+a2r as Cat_ID from t3
  """).show(false)

结果：

+---+------------+-------+
|ID |customers   |country|
+---+------------+-------+
|161|xyz Limited |U.K.   |
|262|ABC  Limited|U.K.   |
|165|Sons & Sons |U.K.   |
|361|TÜV GmbH    |Germany|
|462|Mueller GmbH|Germany|
|369|Schneider AG|Germany|
|467|Sahm UG     |Germany|
+---+------------+-------+

+---+------------+-------+----+----+---+---+------+
|id |customers   |country|a1f |a2f |a1r|a2r|Cat_ID|
+---+------------+-------+----+----+---+---+------+
|161|xyz Limited |U.K.   |[16]|[]  |1  |0  |1     |
|262|ABC  Limited|U.K.   |[26]|[]  |1  |0  |1     |
|165|Sons & Sons |U.K.   |[16]|[]  |1  |0  |1     |
|361|TÜV GmbH    |Germany|[]  |[36]|0  |2  |2     |
|462|Mueller GmbH|Germany|[]  |[46]|0  |2  |2     |
|369|Schneider AG|Germany|[]  |[36]|0  |2  |2     |
|467|Sahm UG     |Germany|[]  |[46]|0  |2  |2     |
+---+------------+-------+----+----+---+---+------+

非常感谢你！我使用了正则表达式，它工作得非常好。我还有一个问题，关于如何使用诸如运算符for和语句之类的列表。当我把

改成

时，我没有得到我想要的

list1=['6'，'2']

list2=['6'，'4']

df=df.withColumn（'CAT_ID'，psf.when（psf.col（'ID'））.rlike（'（{}）\d.format（'&.join（list1）），'1'）\。when（psf.col（'ID'）.rlike（'（{}）\d.format（'&.join（list2）），'2'）。否则（'999'））

，我试图实现这样一个规则：如果“ID”包含“6”和“2”，则CAT_ID取值1。如果“ID”包含“6”或“4”，则CAT_ID取值2。对于

ID=262

，它没有

CAT_ID=1

。相反，它具有

CAT_ID=999

。在输出df中，对于每一行，它给我

CAT_ID=999