Python 用于根据py spark中的业务规则创建要执行的可执行文件的框架

Python 用于根据py spark中的业务规则创建要执行的可执行文件的框架,python,pyspark,apache-spark-sql,expression,dynamicquery,Python,Pyspark,Apache Spark Sql,Expression,Dynamicquery,我有一个数据帧df,其样本值如下 from pyspark.sql.types import DateType, LongType, StringType, StructType, StructField,BooleanType import os import pyspark.sql.functions as F import datetime from pyspark.sql import DataFrame from pyspark.sql.types import StringType,

我有一个数据帧df,其样本值如下

from pyspark.sql.types import DateType, LongType, StringType, StructType, StructField,BooleanType
import os
import pyspark.sql.functions as F
import datetime
from pyspark.sql import DataFrame
from pyspark.sql.types import StringType,IntegerType,ArrayType
from pyspark.sql import Row

l = [('test',1,0,1,0),('prod',0,1,0,1),('local',1,0,1,0)]
rdd = sc.parallelize(l)
sdf = rdd.map(lambda x: Row(col1=x[0], col2=int(x[1]),col3=int(x[2]),col4=int(x[3]),col5=int(x[4])))
df = sqlContext.createDataFrame(sdf)

-----+----+----+----+----+
| col1|col2|col3|col4|col5|
+-----+----+----+----+----+
| test|   1|   0|   1|   0|
| prod|   0|   1|   0|   1|
|local|   1|   0|   1|   0|
+-----+----+----+----+----+
还有一些业务规则如下。到目前为止,这被保存为字典中的元数据(但是规则元数据可以保存为:agg_级别、agg_函数、转换、源、源列等)

我想创建一个函数,比如df_extract(),它动态生成可执行代码,如下所示。这将返回要执行的以下查询(不是作为数据帧)

当使用三个特性调用时,返回的查询中只应存在三个特性,以此类推

df1 = df_extract(df,col6,col7,col8) 
df1 = **df.filter('col1 = "test"') \
    .withColumn('col6', F.when(F.col('col2') > 0, F.lit(1)).otherwise(F.lit(0))) \
    .withColumn('col7', F.when(F.col('col3') > 0, F.lit(1)).otherwise(F.lit(0))) \
    .withColumn('col8', F.when(F.col('col4') > 0, F.lit(1)).otherwise(F.lit(0)))**
最后,如果没有特性,所有特性都应该存在于表达式/查询中

df1 = df_extract(df)
df1 = **df.filter('col1 = "test"') \
    .withColumn('col6', F.when(F.col('col2') > 0, F.lit(1)).otherwise(F.lit(0))) \
    .withColumn('col7', F.when(F.col('col3') > 0, F.lit(1)).otherwise(F.lit(0))) \
    .withColumn('col8', F.when(F.col('col4') > 0, F.lit(1)).otherwise(F.lit(0))) \
    .withColumn('col9', F.when(F.col('col5') > 0, F.lit(1)).otherwise(F.lit(0)))**
无论如何,至少通过在pyspark中创建sql表,这是可能的。N—此类转换规则的数量将与每个数据帧相关联,函数应能够动态返回定义


我厌倦了为它想一个解决方案。

我想的一个解决方案是使用规则作为案例条件

   journey_features = {
      "Rules":{
      "col6": "case when col2 > 0 then 1 else 0 end as col6",
      "col7": "case when col3 > 0 then 1 else 0 end as col7",
      "col8": "case when col4 > 0 then 1 else 0 end as col8",
      "col9": "case when col5 > 0 then 1 else 0 end as col9"
      },
      "filter":"col1 == 'test'"
      }
extract_feature()函数创建如下,以将规则用作表达式

    def extract_feature(df : DataFrame,*featurenames):
  retrieved_features = ""
  for featurename in featurenames:
    if featurename  in journey_features.get('Rules'):
      retrieved_features += "'" + str(journey_features.get('Rules')[featurename]) +"'" + ","
  retrieved_features = retrieved_features.rstrip(',')
  if journey_features['filter']:
    filter_feature = ".filter({df}.".format(df=df) + str(journey_features['filter']) + ")"
  else:
    filte_feature = ""
  return "{0}{1}.selectExpr({2})".format(df,filter_feature,retrieved_features)
并将df和pass、features传递给函数

extract_feature('df','col6','col7')
结果是

Out[139]: "df.filter(df.measurement_group == 'test').selectExpr('case when col2 > 0 then 1 else 0 end as col6','case when col3 > 0 then 1 else 0 end as col7')"
可以使用eval函数分配给数据帧

df1 = eval(extract_feature('df','col6','col7'))

我喜欢那个对象
F
谁允许
.when()。否则()
:D.它从哪里来?:)我刚刚更新了代码
Out[139]: "df.filter(df.measurement_group == 'test').selectExpr('case when col2 > 0 then 1 else 0 end as col6','case when col3 > 0 then 1 else 0 end as col7')"
df1 = eval(extract_feature('df','col6','col7'))