Apache spark 跨多个列搜索子字符串_Apache Spark_Pyspark

Apache spark 跨多个列搜索子字符串

apache-spark pyspark

Apache spark 跨多个列搜索子字符串,apache-spark,pyspark,Apache Spark,Pyspark,我正在尝试使用PySpark在spark数据帧的所有列中查找子字符串。我目前知道如何使用筛选器通过一列搜索子字符串，并包含： df.filter(df.col_name.contains('substring')) 如何扩展此语句或利用另一个语句在多个列中搜索子字符串匹配项？您可以一次将语句概括为筛选器： from pyspark.sql.functions import col, count, when # Converts all unmatched filters to NULL an

我正在尝试使用PySpark在spark数据帧的所有列中查找子字符串。我目前知道如何使用筛选器通过一列搜索子字符串，并包含：

df.filter(df.col_name.contains('substring'))

如何扩展此语句或利用另一个语句在多个列中搜索子字符串匹配项？

您可以一次将语句概括为筛选器：

from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()

或

您只需在列上循环并应用相同的筛选器：

for col in df.columns:
    df = df.filter(df[col].contains("substring"))

您可以一次性将语句概括为筛选器：

from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()

或

您只需在列上循环并应用相同的筛选器：

for col in df.columns:
    df = df.filter(df[col].contains("substring"))