Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark 1.6和Spark 2.2在将rdd[行]转换为rdd[元组]时的行为差异_Scala_Apache Spark_Spark Dataframe - Fatal编程技术网

Scala Spark 1.6和Spark 2.2在将rdd[行]转换为rdd[元组]时的行为差异

Scala Spark 1.6和Spark 2.2在将rdd[行]转换为rdd[元组]时的行为差异,scala,apache-spark,spark-dataframe,Scala,Apache Spark,Spark Dataframe,因此,我在Spark 1.6中的代码运行得很好,而在Spark 2.2中运行时,相同的代码给出了一个空指针异常 我目前正在通过IntelliJ在本地运行所有程序: val sparkConf = new SparkConf() .setAppName("HbaseSpark") .setMaster("local[*]") .set("spark.hbase.host", "localhost") val sc = new SparkContext(sparkConf) val s

因此,我在Spark 1.6中的代码运行得很好,而在Spark 2.2中运行时,相同的代码给出了一个空指针异常 我目前正在通过IntelliJ在本地运行所有程序:

val sparkConf = new SparkConf()
  .setAppName("HbaseSpark")
  .setMaster("local[*]")
  .set("spark.hbase.host", "localhost")

val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
val df = sqlContext
  .read
  .format("com.databricks.spark.csv")
  .option("delimiter", "\001")
  .load("/Users/11130/small")

val df1 = df.withColumn("row_key", concat(col("C3"), lit("_"), col("C5"), lit("_"), col("C0")))
df1.registerTempTable("mytable")

val newDf = sqlContext.sql("Select row_key, C0, C1, C2, C3, C4, C5, C6, C7," +
  "C8, C9, C10, C11, C12, C13, C14, C15, C16, C17, C18, C19 from mytable")

val rdd = newDf.rdd

val finalRdd = rdd.map(row => (row(0).toString, row(1).toString, row(2).toString, row(3).toString, row(4).toString, row(5).toString, row(6).toString,
  row(7).toString, row(8).toString, row(9).toString, row(10).toString, row(11).toString, row(12).toString, row(13).toString,
  row(14).toString, row(15).toString, row(16).toString, row(17).toString, row(18).toString, row(19).toString, row(20).toString))


finalRdd.toHBaseTable("mytable")
  .toColumns("event_id", "device_id", "uidx", "session_id", "server_ts", "client_ts", "event_type", "data_set_name",
    "screen_name", "card_type", "widget_item_whom", "widget_whom", "widget_v_position", "widget_item0_h_position",
    "publisher_tag", "utm_medium", "utm_source", "utmCampaign", "referrer_url", "notificationClass")
  .inColumnFamily("mycf")
  .save()
然而,我在Spark2.2中编写的代码在将rdd转换为finalRdd时给出了空指针异常

val spark = SparkSession
  .builder
  .appName("FunnelSpark")
  .master("local[*]")
  .config("spark.hbase.host", "localhost")
  .getOrCreate

val sc = spark.sparkContext
sc.hadoopConfiguration.set("spark.hbase.host", "localhost")

val df = spark
  .read
  .option("delimiter", "\001")
  .csv("/Users/11130/small")

val df1 = df.withColumn("row_key", concat(col("_c3"), lit("_"), col("_c5"), lit("_"), col("_c0")))
df1.createOrReplaceTempView("mytable")

val newDf = spark.sql("Select row_key, _c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7," +
  "_c8, _c9, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19 from mytable")

val rdd = newDf.rdd
val finalRdd = rdd.map(row => (row(0).toString, row(1).toString, row(2).toString, row(3).toString, row(4).toString, row(5).toString, row(6).toString,
  row(7).toString, row(8).toString, row(9).toString, row(10).toString, row(11).toString, row(12).toString, row(13).toString,
  row(14).toString, row(15).toString, row(16).toString, row(17).toString, row(18).toString, row(19).toString, row(20).toString))

println(finalRdd.first())
spark.stop()

Stacktrace:

发生这种情况是因为您的代码非常不安全。当你打电话时:

row(i).toString
每次遇到
null
value时,它都会抛出NPE

你应使用:

row.getString(i) 

您的1.6程序使用的源代码与2.2不同,
spark csv
与2.2程序类似,但与内置的
csv
格式不同。第一种方法将空字符串视为空字符串,第二种方法将空字符串视为
nulls

这是因为您的代码非常不安全。当你打电话时:

row(i).toString
每次遇到
null
value时,它都会抛出NPE

你应使用:

row.getString(i) 

您的1.6程序使用的源代码与2.2不同,
spark csv
与2.2程序类似,但与内置的
csv
格式不同。第一个将空字符串视为空字符串,第二个将其视为
nulls

为什么在Spark2中使用RDD?@cricket_007我需要将某些内容写入HBase作为POC的一部分,我可以想出的大多数解决方案都是基于RDD的。好吧,你能添加stacktrace吗?@cricket_007 stacktrace你能成功显示或
head
数据帧吗?你为什么在Spark2中使用RDD?@cricket_007我需要在HBase中编写一些东西作为POC的一部分,我能想出的大多数解决方案都是基于RDD的。好吧,你能添加stacktrace吗?@cricket\u 007 stacktrace你能成功地显示或
head
数据帧吗?