DataStax企业:Spark Cassandra批量大小

DataStax企业:Spark Cassandra批量大小,cassandra,apache-spark,datastax-enterprise,Cassandra,Apache Spark,Datastax Enterprise,我在SparkConf中设置参数spark.cassandra.output.batch.size.rows,如下所示: val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "host") .set("spark.cassandra.auth.username", "cassandra") .set("spark.cassandra.a

我在SparkConf中设置参数spark.cassandra.output.batch.size.rows,如下所示:

val conf = new SparkConf(true)
        .set("spark.cassandra.connection.host", "host")
        .set("spark.cassandra.auth.username", "cassandra")            
        .set("spark.cassandra.auth.password", "cassandra")
        .set("spark.cassandra.output.batch.size.rows", "5120")
        .set("spark.cassandra.output.concurrent.writes", "10")
但是当我表演的时候

saveToCassandra(“数据”,“十天”)

我继续在system.log中看到警告

NFO [FlushWriter:7] 2014-11-20 11:11:16,498 Memtable.java (line 395) Completed flushing /var/lib/cassandra/data/system/hints/system-hints-jb-76-Data.db (5747287 bytes) for commitlog position ReplayPosition(segmentId=1416480663951, position=44882909)
 INFO [FlushWriter:7] 2014-11-20 11:11:16,499 Memtable.java (line 355) Writing Memtable-ten_days@1656582530(32979978/329799780 serialized/live bytes, 551793 ops)
 WARN [Native-Transport-Requests:761] 2014-11-20 11:11:16,499 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36825, exceeding specified threshold of 5120 by 31705.
 WARN [Native-Transport-Requests:777] 2014-11-20 11:11:16,500 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36813, exceeding specified threshold of 5120 by 31693.
 WARN [Native-Transport-Requests:822] 2014-11-20 11:11:16,501 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36823, exceeding specified threshold of 5120 by 31703.
 WARN [Native-Transport-Requests:835] 2014-11-20 11:11:16,500 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36817, exceeding specified threshold of 5120 by 31697.
 WARN [Native-Transport-Requests:781] 2014-11-20 11:11:16,501 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36817, exceeding specified threshold of 5120 by 31697.
 WARN [Native-Transport-Requests:755] 2014-11-20 11:11:16,501 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36822, exceeding specified threshold of 5120 by 31702.
我知道这些只是警告,但我想了解为什么我的设置没有按预期工作。然后我可以在集群中看到很多提示。批量大小会影响集群中的提示数吗


谢谢

您设置了批量大小行,而不是批量大小字节。这意味着连接器限制的是行的数量,而不是批处理的内存大小

spark.cassandra.output.batch.size.rows:每一行的行数 批次;默认值为“自动”,这意味着连接器将调整 基于每行数据量的行数

spark.cassandra.output.batch.size.bytes:的最大总大小 以字节为单位的批处理;默认值为64 kB

更重要的是,如果批处理大小更大(64kb)并更改cassandra.yaml文件中的警告限制,您很可能会过得更好

编辑:
最近,我们发现,较大的批处理可能会导致某些C*配置不稳定,因此,如果系统变得不稳定,则会降低值。

先生,我在将2.04亿条记录写入cassandra时也遇到类似问题。这需要24小时,我只需从oracle获取数据帧并将其放入C*。。。。请帮帮我,我该怎么做?下面是我的cassandra配置,spark.cassandra.output.concurrent.writes=2 spark.cassandra.output.batch.size.rows=1我们可以在eclipse中将C*参数作为VM参数传递吗?如果是,如何取样