Apache spark PySpark-使用LIKE操作符中的列表
我想在pyspark上的like操作符中使用list来创建列 我有以下输入:Apache spark PySpark-使用LIKE操作符中的列表,apache-spark,pyspark,sql-like,Apache Spark,Pyspark,Sql Like,我想在pyspark上的like操作符中使用list来创建列 我有以下输入: 输入\u df: +------+--------------------+-------+ | ID| customers|country| +------+--------------------+-------+ |161 |xyz Limited |U.K. | |262 |ABC Limited |U.K. | |165 |Sons
输入\u df:
+------+--------------------+-------+
| ID| customers|country|
+------+--------------------+-------+
|161 |xyz Limited |U.K. |
|262 |ABC Limited |U.K. |
|165 |Sons & Sons |U.K. |
|361 |TÜV GmbH |Germany|
|462 |Mueller GmbH |Germany|
|369 |Schneider AG |Germany|
|467 |Sahm UG |Austria|
+------+--------------------+-------+
+------+--------------------+-------+-------+
| ID| customers|country|Cat_ID |
+------+--------------------+-------+-------+
|161 |xyz Limited |U.K. |1 |
|262 |ABC Limited |U.K. |1 |
|165 |Sons & Sons |U.K. |1 |
|361 |TÜV GmbH |Germany|2 |
|462 |Mueller GmbH |Germany|2 |
|369 |Schneider AG |Germany|2 |
|467 |Sahm UG |Austria|2 |
+------+--------------------+-------+-------+
我想添加一列CAT_ID。如果“ID”包含“16”或“26”,则CAT_ID的值为1。如果“ID”包含“36”或“46”,则CAT_ID取值2。所以,我希望我的输出df像这样-
所需的输出\u df:
+------+--------------------+-------+
| ID| customers|country|
+------+--------------------+-------+
|161 |xyz Limited |U.K. |
|262 |ABC Limited |U.K. |
|165 |Sons & Sons |U.K. |
|361 |TÜV GmbH |Germany|
|462 |Mueller GmbH |Germany|
|369 |Schneider AG |Germany|
|467 |Sahm UG |Austria|
+------+--------------------+-------+
+------+--------------------+-------+-------+
| ID| customers|country|Cat_ID |
+------+--------------------+-------+-------+
|161 |xyz Limited |U.K. |1 |
|262 |ABC Limited |U.K. |1 |
|165 |Sons & Sons |U.K. |1 |
|361 |TÜV GmbH |Germany|2 |
|462 |Mueller GmbH |Germany|2 |
|369 |Schneider AG |Germany|2 |
|467 |Sahm UG |Austria|2 |
+------+--------------------+-------+-------+
我有兴趣学习如何使用LIKE语句和列表来实现这一点
我知道如何在没有列表的情况下实现它,这非常有效:
from pyspark.sql import functions as F
def add_CAT_ID(df):
return df.withColumn(
'CAT_ID',
F.when( ( (F.col('ID').like('16%')) | (F.col('ID').like('26%')) ) , "1") \
.when( ( (F.col('ID').like('36%')) | (F.col('ID').like('46%')) ) , "2") \
.otherwise('999')
)
output_df = add_CAT_ID(input_df)
然而,我希望使用列表,并有如下内容:
list1 =['16', '26']
list2 =['36', '46']
def add_CAT_ID(df):
return df.withColumn(
'CAT_ID',
F.when( ( (F.col('ID').like(list1 %)) ) , "1") \
.when( ( (F.col('ID').like('list2 %')) ) , "2") \
.otherwise('999')
)
output_df = add_CAT_ID(input_df)
提前非常感谢,
SQL通配符不支持“或”子句。不过,有几种方法可以处理它
1。正则表达式
您可以将rlike
与正则表达式一起使用:
import pyspark.sql.函数作为psf
列表1=['16','26']
列表2=['36','46']
df.withColumn(
“CAT_ID”,
psf.when(psf.col('ID').rlike('({})\d'.format('|'.join(list1)),'1'))\
.when(psf.col('ID').rlike('({})\d'.format('|'.join(list2))),'2')\
.否则('999'))\
.show()
+---+------------+-------+------+
|ID |客户|国家|类别ID|
+---+------------+-------+------+
〔161〕XYZ有限〔英国〕1
(262)ABC有限公司(英国1)
165子孙子英国1
|361 | TÜV GmbH |德国| 2|
|462 |穆勒股份有限公司|德国| 2|
|369 |施耐德公司|德国| 2|
|467 | Sahm UG |奥地利| 2|
+---+------------+-------+------+
在这里,我们得到list1
正则表达式(16 | 26)\d
匹配16或26,后跟一个整数(\d
相当于[0-9]
)
2。动态构建SQL子句
如果希望保留sql,可以使用selectExpr
并将值与或'
链接:
df.selectExpr(
'*',
“当({})然后是'1',当({})然后是'2',否则'999'以CAT_ID结尾时的大小写”
.format(*['或'.join(['ID LIKE'{}%'..format(x)表示l中的x])表示l中的[list1,list2]])
3。动态构建Python表达式
如果不想编写SQL,也可以使用eval
:
df.withColumn(
“CAT_ID”,
psf.when(eval(“|”).join([“psf.col('ID')。like(“{}%”)”)。在列表1中x的格式(x)),“1”)
.when(eval(“|”).join([“psf.col('ID')。like(“{}%”)”)。在列表2中为x设置格式(x)),“2”)
.否则('999'))
SQL通配符不支持“或”子句。不过,有几种方法可以处理它
1。正则表达式
您可以将rlike
与正则表达式一起使用:
import pyspark.sql.函数作为psf
列表1=['16','26']
列表2=['36','46']
df.withColumn(
“CAT_ID”,
psf.when(psf.col('ID').rlike('({})\d'.format('|'.join(list1)),'1'))\
.when(psf.col('ID').rlike('({})\d'.format('|'.join(list2))),'2')\
.否则('999'))\
.show()
+---+------------+-------+------+
|ID |客户|国家|类别ID|
+---+------------+-------+------+
〔161〕XYZ有限〔英国〕1
(262)ABC有限公司(英国1)
165子孙子英国1
|361 | TÜV GmbH |德国| 2|
|462 |穆勒股份有限公司|德国| 2|
|369 |施耐德公司|德国| 2|
|467 | Sahm UG |奥地利| 2|
+---+------------+-------+------+
在这里,我们得到list1
正则表达式(16 | 26)\d
匹配16或26,后跟一个整数(\d
相当于[0-9]
)
2。动态构建SQL子句
如果希望保留sql,可以使用selectExpr
并将值与或'
链接:
df.selectExpr(
'*',
“当({})然后是'1',当({})然后是'2',否则'999'以CAT_ID结尾时的大小写”
.format(*['或'.join(['ID LIKE'{}%'..format(x)表示l中的x])表示l中的[list1,list2]])
3。动态构建Python表达式
如果不想编写SQL,也可以使用eval
:
df.withColumn(
“CAT_ID”,
psf.when(eval(“|”).join([“psf.col('ID')。like(“{}%”)”)。在列表1中x的格式(x)),“1”)
.when(eval(“|”).join([“psf.col('ID')。like(“{}%”)”)。在列表2中为x设置格式(x)),“2”)
.否则('999'))
对于Spark 2.4以后的版本,您可以在Spark sql中使用更高阶的函数
试试下面的一个,sql解决方案对于scala/python都是相同的
val df = Seq(
("161","xyz Limited","U.K."),
("262","ABC Limited","U.K."),
("165","Sons & Sons","U.K."),
("361","TÜV GmbH","Germany"),
("462","Mueller GmbH","Germany"),
("369","Schneider AG","Germany"),
("467","Sahm UG","Germany")
).toDF("ID","customers","country")
df.show(false)
df.createOrReplaceTempView("secil")
spark.sql(
""" with t1 ( select id, customers, country, array('16','26') as a1, array('36','46') as a2 from secil),
t2 (select id, customers, country, filter(a1, x -> id like x||'%') a1f, filter(a2, x -> id like x||'%') a2f from t1),
t3 (select id, customers, country, a1f, a2f,
case when size(a1f) > 0 then 1 else 0 end a1r,
case when size(a2f) > 0 then 2 else 0 end a2r
from t2)
select id, customers, country, a1f, a2f, a1r, a2r, a1r+a2r as Cat_ID from t3
""").show(false)
结果:
+---+------------+-------+
|ID |customers |country|
+---+------------+-------+
|161|xyz Limited |U.K. |
|262|ABC Limited|U.K. |
|165|Sons & Sons |U.K. |
|361|TÜV GmbH |Germany|
|462|Mueller GmbH|Germany|
|369|Schneider AG|Germany|
|467|Sahm UG |Germany|
+---+------------+-------+
+---+------------+-------+----+----+---+---+------+
|id |customers |country|a1f |a2f |a1r|a2r|Cat_ID|
+---+------------+-------+----+----+---+---+------+
|161|xyz Limited |U.K. |[16]|[] |1 |0 |1 |
|262|ABC Limited|U.K. |[26]|[] |1 |0 |1 |
|165|Sons & Sons |U.K. |[16]|[] |1 |0 |1 |
|361|TÜV GmbH |Germany|[] |[36]|0 |2 |2 |
|462|Mueller GmbH|Germany|[] |[46]|0 |2 |2 |
|369|Schneider AG|Germany|[] |[36]|0 |2 |2 |
|467|Sahm UG |Germany|[] |[46]|0 |2 |2 |
+---+------------+-------+----+----+---+---+------+
使用Spark 2.4以后的版本,您可以在Spark sql中使用高阶函数 试试下面的一个,sql解决方案对于scala/python都是相同的
val df = Seq(
("161","xyz Limited","U.K."),
("262","ABC Limited","U.K."),
("165","Sons & Sons","U.K."),
("361","TÜV GmbH","Germany"),
("462","Mueller GmbH","Germany"),
("369","Schneider AG","Germany"),
("467","Sahm UG","Germany")
).toDF("ID","customers","country")
df.show(false)
df.createOrReplaceTempView("secil")
spark.sql(
""" with t1 ( select id, customers, country, array('16','26') as a1, array('36','46') as a2 from secil),
t2 (select id, customers, country, filter(a1, x -> id like x||'%') a1f, filter(a2, x -> id like x||'%') a2f from t1),
t3 (select id, customers, country, a1f, a2f,
case when size(a1f) > 0 then 1 else 0 end a1r,
case when size(a2f) > 0 then 2 else 0 end a2r
from t2)
select id, customers, country, a1f, a2f, a1r, a2r, a1r+a2r as Cat_ID from t3
""").show(false)
结果:
+---+------------+-------+
|ID |customers |country|
+---+------------+-------+
|161|xyz Limited |U.K. |
|262|ABC Limited|U.K. |
|165|Sons & Sons |U.K. |
|361|TÜV GmbH |Germany|
|462|Mueller GmbH|Germany|
|369|Schneider AG|Germany|
|467|Sahm UG |Germany|
+---+------------+-------+
+---+------------+-------+----+----+---+---+------+
|id |customers |country|a1f |a2f |a1r|a2r|Cat_ID|
+---+------------+-------+----+----+---+---+------+
|161|xyz Limited |U.K. |[16]|[] |1 |0 |1 |
|262|ABC Limited|U.K. |[26]|[] |1 |0 |1 |
|165|Sons & Sons |U.K. |[16]|[] |1 |0 |1 |
|361|TÜV GmbH |Germany|[] |[36]|0 |2 |2 |
|462|Mueller GmbH|Germany|[] |[46]|0 |2 |2 |
|369|Schneider AG|Germany|[] |[36]|0 |2 |2 |
|467|Sahm UG |Germany|[] |[46]|0 |2 |2 |
+---+------------+-------+----+----+---+---+------+
非常感谢你!我使用了正则表达式,它工作得非常好。我还有一个问题,关于如何使用诸如运算符for和语句之类的列表。当我把
|
改成&
时,我没有得到我想要的list1=['6','2']
list2=['6','4']
df=df.withColumn('CAT_ID',psf.when(psf.col('ID')).rlike('({})\d.format('&.join(list1)),'1')\。when(psf.col('ID').rlike('({})\d.format('&.join(list2)),'2')。否则('999'))
,我试图实现这样一个规则:如果“ID”包含“6”和“2”,则CAT_ID取值1。如果“ID”包含“6”或“4”,则CAT_ID取值2。对于ID=262
,它没有CAT_ID=1
。相反,它具有CAT_ID=999
。在输出df中,对于每一行,它给我CAT_ID=999