Apache spark 如何转换Iterable<;com.datastax.driver.core.Row>;到数据集?

Apache spark 如何转换Iterable<;com.datastax.driver.core.Row>;到数据集?,apache-spark,apache-spark-sql,spark-cassandra-connector,Apache Spark,Apache Spark Sql,Spark Cassandra Connector,我使用Spark 2.0和Scala 2.11.8 我有一个来自select查询的Cassandra结果集,我想将其转换为Spark数据帧或数据集。怎么做 我一直在尝试使用此连接器: "com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-RC1" 而且后来呢, "com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-M3" 守则: import

我使用Spark 2.0和Scala 2.11.8

我有一个来自select查询的Cassandra结果集,我想将其转换为Spark数据帧或数据集。怎么做

我一直在尝试使用此连接器:

"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-RC1"
而且后来呢,

"com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0-M3"
守则:

import com.datastax.spark.connector._

val sparkConf = new SparkConf().
  setAppName(appName).
  set("spark.cassandra.connection.host", "10.60.50.134").
  set("spark.cassandra.auth.username", "xyz").
  set("spark.cassandra.auth.password", "abc")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val rdd = spark.
  sparkContext.
  cassandraTable(keyspace = s"$keyspace", table = s"$table")
rdd.take(10).foreach(println)
在这两种情况下,我得到以下错误:

Exception in thread "main" java.lang.NoSuchMethodError: com.datastax.driver.core.KeyspaceMetadata.getMaterializedViews()Ljava/util/Collection;
    at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:281)
    at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:305)
    at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:304)
    at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:683)
    at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
    at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
    at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:682)
    at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:304)
    at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:325)
    at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:322)
    at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:122)
    at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)
    at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
    at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
    at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
    at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
    at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:121)
    at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:322)
    at com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:342)
    at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.tableDef(CassandraTableRowReaderProvider.scala:50)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute(CassandraTableScanRDD.scala:60)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:60)
    at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.verify(CassandraTableRowReaderProvider.scala:137)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify(CassandraTableScanRDD.scala:60)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:232)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1297)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1292)
    at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:121)
    at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:122)

您似乎正在使用Spark Cassandra Connector的预数据集API,因为它支持开箱即用的数据集(但可能需要从Cassandra表加载数据的不同方式)

我的建议是重新编写/升级您的代码以使用Spark Cassandra连接器

发件人:

后来在(我的):

创建数据集的最编程方式是调用
SparkSession
上的
read
命令。这将构建一个
DataFrameReader
。将
格式指定为org.apache.spark.sql.cassandra。然后,您可以使用
options
给出
map[String,String]
选项的映射,如上所述。然后调用
load
以实际获取数据集。这段代码都是惰性的,在调用操作之前不会实际加载任何数据


有一个对象似乎提供了从
com.datasax.driver.core.Row
org.apache.spark.sql.cassandra.CassandraSQLRow
的转换:

fromJavaDriverRow(row: com.datastax.driver.core.Row, metaData: CassandraRowMetadata): CassandraSQLRow
我对Spark Cassandra连接器的有限经验表明,如果需要,可以使用隐式转换

// bring all the implicit goodies from the Spark Cassandra Connector
import com.datastax.spark.connector._

最新版本的连接器是
2.0.1-s_2.11
。你能试试吗?您也可以使用
spark.read.format
阅读Cassandra,因为这是推荐的方法(在spark Cassandra连接器和spark 2.1本身中)。非常感谢,链接非常有用。我将代码转换为spark.read.format,现在可以使用了。我到处都能看到
saveToCassandra
,但如果要在数据帧中追加大量行,我应该使用哪个API?Cassandra持久性在
.forEachRDD
中,因此需要以性能为导向,这就引出了另一个问题。
// bring all the implicit goodies from the Spark Cassandra Connector
import com.datastax.spark.connector._