使用MongoDB和RStudio的Sparkr2.x应用程序

使用MongoDB和RStudio的Sparkr2.x应用程序,r,mongodb,apache-spark,rstudio,sparkr,R,Mongodb,Apache Spark,Rstudio,Sparkr,我正在尝试开发一个ApacheSpark应用程序,该应用程序应该在MongoDB数据库上运行聚合查询并写回结果。我能够解决这个问题的Java版本,但现在需要使用RStudio将其移植到R语言 有效的Java版本:- public static void main(String args[]) { SparkConf sparkConf = new SparkConf(true) .setMaster("local[*]") .setSparkHome(SPARK

我正在尝试开发一个ApacheSpark应用程序,该应用程序应该在MongoDB数据库上运行聚合查询并写回结果。我能够解决这个问题的Java版本,但现在需要使用RStudio将其移植到R语言

有效的Java版本:-

public static void main(String args[]) {

SparkConf sparkConf = new SparkConf(true)
        .setMaster("local[*]")
        .setSparkHome(SPARK_HOME)
        .setAppName("SparklingMongoApp")
        .set("spark.ui.enabled", "false")
        .set("spark.app.id", APP)
        .set("spark.mongodb.input.uri", "mongodb://admin:password@host:27017/input_collection")
        .set("spark.mongodb.output.uri", "mongodb://admin:password@host:27017/output_collection");


JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
JavaMongoRDD<Document> javaMongoRDD = MongoSpark.load(javaSparkContext);

Dataset<Row> dataset = javaMongoRDD.toDF();

dataset.createOrReplaceTempView(TEMP_VIEW);

// a valid spark sql QUERY
Dataset<Row> computedDataSet = dataset.sqlContext().sql(QUERY);
MongoSpark.save(computedDataSet);
javaSparkContext.close();
publicstaticvoidmain(字符串参数[]){
SparkConf SparkConf=新SparkConf(真)
.setMaster(“本地[*]”)
.setSparkHome(SPARK_HOME)
.setAppName(“SparklingMongoApp”)
.set(“spark.ui.enabled”、“false”)
.set(“spark.app.id”,app)
.set(“spark.mongodb.input.uri”mongodb://admin:password@主持人:27017/输入(收集)
.set(“spark.mongodb.output.uri”mongodb://admin:password@主机:27017/输出(U集合);
JavaSparkContext JavaSparkContext=新的JavaSparkContext(sparkConf);
JavaMongoRDD JavaMongoRDD=MongoSpark.load(javaSparkContext);
Dataset Dataset=javaMongoRDD.toDF();
dataset.createOrReplaceTempView(临时视图);
//有效的spark sql查询
Dataset computedDataSet=Dataset.sqlContext().sql(查询);
MongoSpark.save(computedDataSet);
javaSparkContext.close();
}

我正在尝试训练的等效R/RStudio版本:-

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

##PROBLEM - Is this correct way of setting configuration?
sparkConfig <- list("spark.driver.memory"="1g","spark.mongodb.input.uri"="mongodb://username:password@localhost:27017/price_subset?authSource=admin","spark.mongodb.output.uri"="mongodb://username:password@localhost:27017/price_subset_output?authSource=admin")

customSparkPackages <- c("org.mongodb.spark:mongo-spark1-connector_2.11:1.0.0");

##Starting Up: SparkSession
##PROBLEM-1 Is this correct way of initializing spark session ?
sparkSession <- sparkR.session(appName="MongoSparkConnectorTour",master = "local[*]",enableHiveSupport = FALSE,sparkConfig = sparkConfig,sparkPackages = customSparkPackages)


##PROBLEM-2 - This complains about being deprecated. How to fix this ?
sqlContext <- sparkRSQL.init(sparkSession)

## Save some data
charactersRdf <- data.frame(list(name=c("Bilbo Baggins", "Gandalf", "Thorin", "Balin", "Kili", "Dwalin", "Oin", "Gloin", "Fili", "Bombur"),
                                 age=c(50, 1000, 195, 178, 77, 169, 167, 158, 82, NA)))

charactersSparkdf <- createDataFrame(sqlContext, charactersRdf)
#PROBLEM-3 This throws an error - Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
#  java.lang.NoClassDefFoundError: com/mongodb/ConnectionString
write.df(charactersSparkdf, "", source = "com.mongodb.spark.sql.DefaultSource", mode = "overwrite")
library(SparkR,lib.loc=c(file.path(Sys.getenv(“SPARK_HOME”),“R”,“lib”))
##问题-这种设置配置的方法正确吗?

sparkConfig终于让它工作了。事实证明MongoDB文档中有Spark 1.6的例子,我运行的是Spark 2.0.1

不管怎么说,这就是我使用RStudio的原因:-

 ## Make sure you have SPARK_HOME environment variable set to your spark home director.
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

spark <- sparkR.session(master="local[*]", appName = "mongoSparkR",enableHiveSupport = FALSE,sparkPackages = c("org.mongodb.spark:mongo-spark-connector_2.11:2.0.0-rc0"),sparkConfig = list(spark.mongodb.input.uri="mongodb://username:password@hostname:27017/database.collection_name?authSource=admin",spark.mongodb.output.uri="mongodb://username:password@hostname:27017/database.collection_name_output?authSource=admin"))

pricing_df <- read.df(source = "com.mongodb.spark.sql.DefaultSource",x=10000)
head(pricing_df)
createOrReplaceTempView(pricing_df,"T_YOUR_TABLE")

 ## Obviously this is just a dummy SQL, replace with it yours.
result_df <- sql("SELECT year(price) as YEAR, month(price) as MONTH , SUM(midPrice) as SUM_PRICING_DATA FROM T_YOUR_TABLE GROUP BY year(price),month(price)  ORDER BY year(price),month(price)")


 ## stop instance when done.
sparkR.stop()

如果您使用的是RStudio,您还可以尝试一下他们的SparkyR包(它在RStudio的预览版中与IDE集成)。
org.mongodb.spark_mongo-spark-connector_2.11-2.0.0-rc0.jar

org.mongodb_mongo-java-driver-3.2.2.jar