Apache spark 如何转换Iterable<;com.datastax.driver.core.Row>;到数据集?
我使用Spark 2.0和Scala 2.11.8 我有一个来自select查询的Cassandra结果集,我想将其转换为Spark数据帧或数据集。怎么做 我一直在尝试使用此连接器:Apache spark 如何转换Iterable<;com.datastax.driver.core.Row>;到数据集?,apache-spark,apache-spark-sql,spark-cassandra-connector,Apache Spark,Apache Spark Sql,Spark Cassandra Connector,我使用Spark 2.0和Scala 2.11.8 我有一个来自select查询的Cassandra结果集,我想将其转换为Spark数据帧或数据集。怎么做 我一直在尝试使用此连接器: "com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-RC1" 而且后来呢, "com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-M3" 守则: import
"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-RC1"
而且后来呢,
"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-M3"
守则:
import com.datastax.spark.connector._
val sparkConf = new SparkConf().
setAppName(appName).
set("spark.cassandra.connection.host", "10.60.50.134").
set("spark.cassandra.auth.username", "xyz").
set("spark.cassandra.auth.password", "abc")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val rdd = spark.
sparkContext.
cassandraTable(keyspace = s"$keyspace", table = s"$table")
rdd.take(10).foreach(println)
在这两种情况下,我得到以下错误:
Exception in thread "main" java.lang.NoSuchMethodError: com.datastax.driver.core.KeyspaceMetadata.getMaterializedViews()Ljava/util/Collection;
at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:281)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:305)
at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:304)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:683)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:682)
at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:304)
at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:325)
at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:322)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:122)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:121)
at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:322)
at com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:342)
at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.tableDef(CassandraTableRowReaderProvider.scala:50)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.verify(CassandraTableRowReaderProvider.scala:137)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:232)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1297)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.take(RDD.scala:1292)
at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:121)
at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:122)
您似乎正在使用Spark Cassandra Connector的预数据集API,因为它支持开箱即用的数据集(但可能需要从Cassandra表加载数据的不同方式) 我的建议是重新编写/升级您的代码以使用Spark Cassandra连接器 发件人: 后来在(我的): 创建数据集的最编程方式是调用
SparkSession
上的read
命令。这将构建一个DataFrameReader
。将格式指定为org.apache.spark.sql.cassandra。然后,您可以使用options
给出map[String,String]
选项的映射,如上所述。然后调用load
以实际获取数据集。这段代码都是惰性的,在调用操作之前不会实际加载任何数据
有一个对象似乎提供了从com.datasax.driver.core.Row
到org.apache.spark.sql.cassandra.CassandraSQLRow
的转换:
fromJavaDriverRow(row: com.datastax.driver.core.Row, metaData: CassandraRowMetadata): CassandraSQLRow
我对Spark Cassandra连接器的有限经验表明,如果需要,可以使用隐式转换
// bring all the implicit goodies from the Spark Cassandra Connector
import com.datastax.spark.connector._
最新版本的连接器是2.0.1-s_2.11
。你能试试吗?您也可以使用spark.read.format
阅读Cassandra,因为这是推荐的方法(在spark Cassandra连接器和spark 2.1本身中)。非常感谢,链接非常有用。我将代码转换为spark.read.format,现在可以使用了。我到处都能看到saveToCassandra
,但如果要在数据帧中追加大量行,我应该使用哪个API?Cassandra持久性在.forEachRDD
中,因此需要以性能为导向,这就引出了另一个问题。
// bring all the implicit goodies from the Spark Cassandra Connector
import com.datastax.spark.connector._