在SparkyR中创建和应用带有外部参数的ml_lib管道_R_Apache Spark_Apache Spark Ml_Sparklyr

在SparkyR中创建和应用带有外部参数的ml_lib管道

r apache-spark

在SparkyR中创建和应用带有外部参数的ml_lib管道,r,apache-spark,apache-spark-ml,sparklyr,R,Apache Spark,Apache Spark Ml,Sparklyr,我正在尝试创建并应用Spark ml_管道对象，该对象可以处理通常随日期变化的外部参数。根据Spark文档，这似乎是可能的：请参见带有ParamMap的零件我还没试过怎么做。我在想这样的事情： table.df <- data.frame("a" = c(1,2,3)) table.sdf <- sdf_copy_to(sc, table.df) param = 5 param2 = 4 # operation declaration table2.sdf <- tabl

我正在尝试创建并应用Spark ml_管道对象，该对象可以处理通常随日期变化的外部参数。根据Spark文档，这似乎是可能的：请参见带有ParamMap的零件

我还没试过怎么做。我在想这样的事情：

table.df <- data.frame("a" = c(1,2,3))
table.sdf <- sdf_copy_to(sc, table.df)

param = 5
param2 = 4

# operation declaration
table2.sdf <- table.sdf %>% 
  mutate(test = param)

# pipeline creation
pipeline_1 = ml_pipeline(sc) %>%
  ft_dplyr_transformer(table2.sdf) %>%
  ml_fit(table.sdf, list("param" = param))

# pipeline application with another value for param
table2.sdf <- pipeline_1 %>% 
  ml_transform(table.sdf, list("param" = param2))

#result

glimpse(table2.sdf %>% select(test))
# doesn work...

这并不是Spark ML管道的用途。通常，将输入数据集转换为适合管道的格式所需的所有转换都应事先应用，并且只有公共组件应作为阶段嵌入

在使用本机Scala API时，在这种简单的情况下（如本例），技术上可以使用空的SQLTransformer：

导入org.apache.spark.ml.Pipeline 导入org.apache.spark.ml.feature.SQLTransformer 导入org.apache.spark.ml.param.ParamPair val df=spark.range1，4.toDFa val sqlTransformer=新的sqlTransformer val pipeline=新pipeline.SETSTAGESRARRAYSQLSTRANSOR 并提供两个fit的声明参数

+--+--+ |a |测试| +--+--+ | 1| 4| | 2| 4| | 3| 4| +--+--+ 和转换：

模型转换 df， ParamPairsqlTransformer.statement，从中选择*，5作为'test'__ 显示 +--+--+ |a |测试| +--+--+ | 1| 5| | 2| 5| | 3| 5| +--+--+

但是，正如您所看到的，目前既不支持也不支持其他参数。。。都被忽略了。

谢谢你的回答。目标是能够使用外部参数应用这些转换，这些参数可能会发生变化，比如日期。

val model = pipeline.fit(
  df,
  ParamPair(sqlTransformer.statement, "SELECT *, 4 AS `test` FROM __THIS__")
)

model.transform(df).show