Apache spark 无法使用spark RDD API作为序列文件写入
我使用以下代码将RDD作为序列文件编写Apache spark 无法使用spark RDD API作为序列文件写入,apache-spark,Apache Spark,我使用以下代码将RDD作为序列文件编写 @Test def testSparkWordCount(): Unit = { val words = Array("Hello", "Hello", "World", "Hello", "Welcome", "World") val conf = new SparkConf().setMaster("local").setAppName("testSparkWordCount") val sc = new SparkCo
@Test
def testSparkWordCount(): Unit = {
val words = Array("Hello", "Hello", "World", "Hello", "Welcome", "World")
val conf = new SparkConf().setMaster("local").setAppName("testSparkWordCount")
val sc = new SparkContext(conf)
val dir = "file:///" + System.currentTimeMillis()
sc.parallelize(words).map(x => (x, 1)).saveAsHadoopFile(
dir,
classOf[Text],
classOf[IntWritable],
classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat[Text, IntWritable]]
)
sc.stop()
}
当我运行它时,它会抱怨
Caused by: java.io.IOException: wrong key class: java.lang.String is not class org.apache.hadoop.io.Text
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1373)
at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:76)
at org.apache.spark.internal.io.SparkHadoopWriter.write(SparkHadoopWriter.scala:94)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1139)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1137)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1137)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1360)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
我应该使用sc.parallelize(words).map(x=>(newtext(x),newintwriteable(1))
而不是sc.parallelize(words).map(x=>(x,1))
?我认为我不需要显式地包装它,因为SparkContext已经提供了将前置类型包装到相应可写类型的隐式
那么,我应该怎么做才能使这段代码正常工作呢?是的,SparkContext提供了隐式转换。但是在保存过程中不应用此转换,必须以通常的Scala方式使用:
import org.apache.spark.SparkContext._
val mapperFunction: String=> (Text,IntWritable) = x => (x, 1)
... parallelize(words).map(mapperFunction).saveAsHadoopFile ...
理解,感谢@pashaz提供的帮助性回答。也可以使用包含隐式转换的“saveAsSequenceFile”方法:.map(x=>(x,1)).saveAsSequenceFile(dir)