Apache spark 跨多个列搜索子字符串

Apache spark 跨多个列搜索子字符串,apache-spark,pyspark,Apache Spark,Pyspark,我正在尝试使用PySpark在spark数据帧的所有列中查找子字符串。我目前知道如何使用筛选器通过一列搜索子字符串,并包含: df.filter(df.col_name.contains('substring')) 如何扩展此语句或利用另一个语句在多个列中搜索子字符串匹配项?您可以一次将语句概括为筛选器: from pyspark.sql.functions import col, count, when # Converts all unmatched filters to NULL an

我正在尝试使用PySpark在spark数据帧的所有列中查找子字符串。我目前知道如何使用筛选器通过一列搜索子字符串,并包含:

df.filter(df.col_name.contains('substring'))


如何扩展此语句或利用另一个语句在多个列中搜索子字符串匹配项?

您可以一次将语句概括为筛选器:

from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()

您只需在列上循环并应用相同的筛选器:

for col in df.columns:
    df = df.filter(df[col].contains("substring"))

您可以一次性将语句概括为筛选器:

from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()

您只需在列上循环并应用相同的筛选器:

for col in df.columns:
    df = df.filter(df[col].contains("substring"))