Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 在Pyspark数据帧中的多列上执行函数_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Dataframes - Fatal编程技术网

Apache spark 在Pyspark数据帧中的多列上执行函数

Apache spark 在Pyspark数据帧中的多列上执行函数,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我必须在Pyspark dataframe中的多列上应用某些函数。下面是我的代码: finaldf=df.withColumn('phone_number',regexp_replace("phone_number","[^0-9]",""))\ .withColumn('account_id',regexp_replace("account_id","[^0-9]","&quo

我必须在Pyspark dataframe中的多列上应用某些函数。下面是我的代码:

finaldf=df.withColumn('phone_number',regexp_replace("phone_number","[^0-9]",""))\
    .withColumn('account_id',regexp_replace("account_id","[^0-9]",""))\
    .withColumn('credit_card_limit',regexp_replace("credit_card_limit","[^0-9]",""))\
    .withColumn('credit_card_number',regexp_replace("credit_card_number","[^0-9]",""))\
    .withColumn('full_name',regexp_replace("full_name","[^a-zA-Z ]",""))\
    .withColumn('transaction_code',regexp_replace("transaction_code","[^a-zA-Z]",""))\
    .withColumn('shop',regexp_replace("shop","[^a-zA-Z ]",""))

finaldf=finaldf.filter(finaldf.account_id.isNotNull())\
    .filter(finaldf.phone_number.isNotNull())\
    .filter(finaldf.credit_card_number.isNotNull())\
    .filter(finaldf.credit_card_limit.isNotNull())\
    .filter(finaldf.transaction_code.isNotNull())\
    .filter(finaldf.amount.isNotNull())
从代码中你可以看到我写的冗余代码也扩展了程序的长度。我还了解到spark UDF效率不高


有没有办法优化这段代码?请让我知道。非常感谢

对于多个过滤器,您应该这样做

filter_cols= ['account_id','phone_number','credit_card_number','credit_card_limit','transaction_code','amount']
    
final_df.filter(' and '.join([x+' is not null' for x in  filter_cols]))

您可以将所有这些条件放在一条语句中:
.filter(finaldf.account\u id.isNotNull()&finaldf.phone\u number.isNotNull()&…
好的,谢谢!有没有一种方法可以使用单个循环来完成所有这些常见的功能?非常感谢您的帮助!我得到了结果。“and”是否会导致使用该“x+”执行多个筛选条件?是的,您可以只打印
”and“。加入([x+”在filter\u cols中对于x不是null])
以检查它提供的sql表达式非常感谢!