Scala Spark&x2B;卡桑德拉-只插入了一条记录
我试图从hive中获取数据,并使用Spark将数据插入Cassandra。非常令人惊讶的是,我只看到一条记录插入Cassandra,尽管数据帧中有4000多条记录Scala Spark&x2B;卡桑德拉-只插入了一条记录,scala,apache-spark,cassandra,hive,Scala,Apache Spark,Cassandra,Hive,我试图从hive中获取数据,并使用Spark将数据插入Cassandra。非常令人惊讶的是,我只看到一条记录插入Cassandra,尽管数据帧中有4000多条记录 import org.apache.spark.sql.SparkSession import com.datastax.spark.connector.cql.CassandraConnector import com.typesafe.config.ConfigFactory import org.apache.spark.sql
import org.apache.spark.sql.SparkSession
import com.datastax.spark.connector.cql.CassandraConnector
import com.typesafe.config.ConfigFactory
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector._
import java.math.BigDecimal
case class sales(wk_nbr: Int,
store_nbr: Int,
sales_amt: BigDecimal)
object HiveConnector extends App {
val cassandraConfig = ConfigFactory.load("cassandra.conf")
println("cassandraConfig loaded = " + cassandraConfig)
val spark = SparkSession.builder().appName("HiveConnector")
.config("spark.sql.warehouse.dir", "file:/data/raw/historical/tables")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("mapred.input.dir.recursive","true")
.config("mapreduce.input.fileinputformat.input.dir.recursive","true")
.config("spark.cassandra.connection.host", "***********")
.config("spark.cassandra.auth.username", "*****un****")
.config("spark.cassandra.auth.password", "******pw*****")
.enableHiveSupport()
.master("yarn").getOrCreate()
import spark.implicits._
val query = "select wk_nbr,store_nbr,sum(sales_amt) as sales_amt from scan where visit_dt between '2018-05-08' and '2018-05-11' group by wm_yr_wk_nbr,store_nbr"
val resDF = spark.sql(query)
resDF.persist()
println("RESDF size = " + resDF.count()) //prints the record count
println("RESDF sample rec = " + resDF.show(2)) //see 2 records in the log
CassandraConnector(spark.sparkContext).withSessionDo { spark =>
spark.execute("CREATE TABLE raaspoc.sales_data (wk_nbr INT PRIMARY KEY, store_nbr INT, sales_amt DOUBLE)")
}
/*
None of the following saveToCassandra work - meaning not inserting all records but only one record
*/
resDF.map { x => sales.apply(x.get(0).asInstanceOf[Int], x.get(1).asInstanceOf[Int],x.get(2).asInstanceOf[BigDecimal])
}.rdd.saveToCassandra("raaspoc","sales_data") // Not working
resDF.rdd.saveToCassandra("raaspoc","sales_data") // Not working
resDF.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "sales_data", "keyspace" -> "raaspoc")).save() // Not working
resDF.write.cassandraFormat("sales_data","raaspoc").save() // Not working
/*
When the data frame is written to HDFS, i see all 4000+ records in the sales.csv
*/
resDF.write.format("csv").save("hdfs:/dev/test/sales.csv")
println("RESDF size after write to cassandra = " + resDF.count()) //prints 4732 (record count)
spark.close()
}
我在日志中没有看到任何错误,Spark submit完成时没有任何错误,但只插入一条记录。下面是我的pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.test.raas</groupId>
<artifactId>RaasDataPipelines</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.0</scala.version>
<spark.version>2.2.0</spark.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.7</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.11</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.0.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.5.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-mapping</artifactId>
<version>3.5.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-extras</artifactId>
<version>3.5.0</version>
<scope>compile</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>
shade
</goal>
</goals>
</execution>
</executions>
<configuration>
<minimizeJar>true</minimizeJar>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>fat</shadedClassifierName>
<relocations>
<relocation>
<pattern>com.google</pattern>
<shadedPattern>shaded.guava</shadedPattern>
<includes>
<include>com.google.**</include>
</includes>
<excludes>
<exclude>com.google.common.base.Optional</exclude>
<exclude>com.google.common.base.Absent</exclude>
<exclude>com.google.common.base.Present</exclude>
</excludes>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
4.0.0
com.test.raas
RAASDataPipeline
1.0-快照
2008
2.11.0
2.2.0
scala-tools.org
Scala工具Maven2存储库
http://scala-tools.org/repo-releases
scala-tools.org
Scala工具Maven2存储库
http://scala-tools.org/repo-releases
org.apache.spark
spark-core_2.11
${spark.version}
编译
org.apache.spark
spark-sql_2.11
${spark.version}
编译
org.scala-lang
scala图书馆
${scala.version}
编译
朱尼特
朱尼特
4.4
测试
org.specs
规格
1.2.5
测试
com.datasax.spark
spark-cassandra-connector_2.11
2.0.7
编译
org.scala-tools
maven scala插件
2.11
编译
org.apache.spark
spark-hive_2.10
1.0.0
编译
com.datasax.cassandra
卡桑德拉驱动核心
3.5.0
编译
com.datasax.cassandra
卡桑德拉驱动映射
3.5.0
编译
com.datasax.cassandra
卡桑德拉额外驾驶员
3.5.0
编译
src/main/scala
src/test/scala
org.scala-tools
maven scala插件
编译
测试编译
${scala.version}
-目标:jvm-1.5
org.apache.maven.plugins
maven阴影插件
2.4.3
包裹
阴凉处
真的
真的
脂肪
谷歌
番石榴
com.google**
com.google.common.base.Optional
com.google.common.base.缺席
com.google.common.base.Present
*:*
META-INF/*.SF
META-INF/*.DSA
META-INF/*.RSA
org.scala-tools
maven scala插件
${scala.version}
您的主键(wk\u nbr)是否有可能在所有4000+行上都相同?您的主键(wk\u nbr)是否有可能在所有4000+行上都相同?这是一种情况,大约99%的人报告这种情况。大约99%的人报告这种情况。