Python 正则表达式，以查找PySpark Dataframe列中不包含(下划线）和：（冒号）的所有字符串_Python_Python 3.x_Regex_Apache Spark_Pyspark

Python 正则表达式，以查找PySpark Dataframe列中不包含(下划线）和：（冒号）的所有字符串

python python-3.x regex apache-spark pyspark

Python 正则表达式，以查找PySpark Dataframe列中不包含(下划线）和：（冒号）的所有字符串,python,python-3.x,regex,apache-spark,pyspark,Python,Python 3.x,Regex,Apache Spark,Pyspark,我在数据框中有一列名为“”tags“。我需要根据条件提取值。条件是，它不应包含u（下划线）和：（冒号）例如： “标签”：“海，你好，金额10，金额90，总计：100” 预期结果： “新专栏”：“海，你好” 供您参考：我提取了所有的金额标签 collectAmount = udf(lambda s: list(map(lambda amount: amount.split('_')[1] if len(collection) > 0 els

我在数据框中有一列名为“”tags“。我需要根据条件提取值。条件是，它不应包含u（下划线）和：（冒号）

例如：

“标签”：“海，你好，金额10，金额90，总计：100”

预期结果：

“新专栏”：“海，你好”

供您参考：

我提取了所有的金额标签

collectAmount = udf(lambda s: list(map(lambda amount: amount.split('_')[1] if len(collection) > 0
                        else amount, re.findall(r'(amount_\w+)', s))), ArrayType(StringType()))

productsDF = productsDF.withColumn('amount_tag', collectAmount('tags'))

真的不需要正则表达式：

tags = ["hai", "hello", "amount_10", "amount_90", "total:100"]

new_column = [tag for tag in tags if not any(junk in tag for junk in ["_", ":"])]
print(new_column)

如果您坚持使用正则表达式：

import re
rx = re.compile(r'^(?!.*_)(?!.*:).+$')
new_column = [tag for tag in tags if rx.match(tag)]
print(new_column)

请参阅。

您可以根据上述答案使用正则表达式，但您需要将其包装在

udf

中，或者如我下面所示，使用

pyspark

内置：

from pyspark.sql import functions as F

df= df.withColumn("extracted", F.regexp_extract("tags","[_:]", 0))
df.filter(df["extracted"] == '').select("tags").show()

试试这个

df.withColumn('new_column',expr('''concat_ws(',',array_remove(transform(split(tags,','), x -> regexp_extract(x,'^(?!.*_)(?!.*:).+$',0)),''))''')).show(2,False)

+-------------------------------------------+----------+
|tags                                       |new_column|
+-------------------------------------------+----------+
|hai, hello, amount_10, amount_90, total:100|hai, hello|
|hai, hello, amount_10, amount_90, total:100|hai, hello|
+-------------------------------------------+----------+

我们是否应该先按

，

分开单词？标签列的类型是什么？你能创建一个吗？是的，@Ankur。您是正确的….标记的类型为字符串@cronoikHello@Jan。它不是python中的普通列表。这是pyspark数据帧。它没有给出精确的结果@ags29.+-------------+---+----+----+-------------++----++----+----：| | | | | | | | | | | | | | | | | | | | | |我自己。如果你能提供一个样本，那么我将修改我的答案。完美@Shubham JainUse使用了一个更简单的正则表达式，

regexp_extract（x，“^[^ u:::+$”，0）