Apache spark 如何将多行json文件作为rdd放入单个记录中
但我的输出应该是一行中的每一个想法Apache spark 如何将多行json文件作为rdd放入单个记录中,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,但我的输出应该是一行中的每一个想法 rdd=sc.textFile(json or xml) rdd.collect() [u'{', u' "glossary": {', u' "title": "example glossary",', u'\t\t"GlossDiv": {', u' "title": "S",', u'\t\t\t"GlossList": {', u' "GlossEntry": {', u'
rdd=sc.textFile(json or xml)
rdd.collect()
[u'{', u' "glossary": {', u' "title": "example glossary",', u'\t\t"GlossDiv": {', u' "title": "S",', u'\t\t\t"GlossList": {', u' "GlossEntry": {', u' "ID": "SGML",', u'\t\t\t\t\t"SortAs": "SGML",', u'\t\t\t\t\t"GlossTerm": "Standard Generalized Markup Language",', u'\t\t\t\t\t"Acronym": "SGML",', u'\t\t\t\t\t"Abbrev": "ISO 8879:1986",', u'\t\t\t\t\t"GlossDef": {', u' "para": "A meta-markup language, used to create markup languages such as DocBook.",', u'\t\t\t\t\t\t"GlossSeeAlso": ["GML", "XML"]', u' },', u'\t\t\t\t\t"GlossSee": "markup"', u' }', u' }', u' }', u' }', u'}', u'']
改用
sc.wholeTextFiles()
。还可以查看sqlContext.jsonFile
:
我建议使用Spark SQL JSON,然后保存对toJson的调用(请参阅) 但是,如果由于多行问题或其他问题,Spark SQL无法解析您的json记录,我们可以从中选取一个示例(当然作为合著者略有偏见),并将其修改为使用
wholeTextFiles
val input = sqlContext.jsonFile(path)
val output = input...
output.toJSON.saveAsTextFile(outputath)
在Python中:
case class Person(name: String, lovesPandas: Boolean)
// Read the input and throw away the file names
val input = sc.wholeTextFiles(inputFile).map(_._2)
// Parse it into a specific case class. We use mapPartitions beacuse:
// (a) ObjectMapper is not serializable so we either create a singleton object encapsulating ObjectMapper
// on the driver and have to send data back to the driver to go through the singleton object.
// Alternatively we can let each node create its own ObjectMapper but that's expensive in a map
// (b) To solve for creating an ObjectMapper on each node without being too expensive we create one per
// partition with mapPartitions. Solves serialization and object creation performance hit.
val result = input.mapPartitions(records => {
// mapper object created on each executor node
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
// We use flatMap to handle errors
// by returning an empty list (None) if we encounter an issue and a
// list with one element if everything is ok (Some(_)).
records.flatMap(record => {
try {
Some(mapper.readValue(record, classOf[ioRecord]))
} catch {
case e: Exception => None
}
})
}, true)
result.filter(_.lovesPandas).map(mapper.writeValueAsString(_))
.saveAsTextFile(outputFile)
}
我收到错误请确保文件的每一行(或RDD中的每一字符串)都是有效的JSON对象或JSON对象数组。如果我有多行文件,我收到错误请确保文件的每一行(或RDD中的每一字符串)都是有效的JSON对象或JSON对象数组。Oops,抱歉,我会用多行文件的解决方案更新答案,抱歉。当然,我今晚会尝试这样做,在接下来的几个小时的工作中。如果多行JSON文件非常大,我想并行读取,那么使用Spark可以吗?(如果我理解正确,这种方法必须在单个线程中处理json。)
case class Person(name: String, lovesPandas: Boolean)
// Read the input and throw away the file names
val input = sc.wholeTextFiles(inputFile).map(_._2)
// Parse it into a specific case class. We use mapPartitions beacuse:
// (a) ObjectMapper is not serializable so we either create a singleton object encapsulating ObjectMapper
// on the driver and have to send data back to the driver to go through the singleton object.
// Alternatively we can let each node create its own ObjectMapper but that's expensive in a map
// (b) To solve for creating an ObjectMapper on each node without being too expensive we create one per
// partition with mapPartitions. Solves serialization and object creation performance hit.
val result = input.mapPartitions(records => {
// mapper object created on each executor node
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
// We use flatMap to handle errors
// by returning an empty list (None) if we encounter an issue and a
// list with one element if everything is ok (Some(_)).
records.flatMap(record => {
try {
Some(mapper.readValue(record, classOf[ioRecord]))
} catch {
case e: Exception => None
}
})
}, true)
result.filter(_.lovesPandas).map(mapper.writeValueAsString(_))
.saveAsTextFile(outputFile)
}
from pyspark import SparkContext
import json
import sys
if __name__ == "__main__":
if len(sys.argv) != 4:
print "Error usage: LoadJson [sparkmaster] [inputfile] [outputfile]"
sys.exit(-1)
master = sys.argv[1]
inputFile = sys.argv[2]
outputFile = sys.argv[3]
sc = SparkContext(master, "LoadJson")
input = sc.wholeTextFiles(inputFile).map(_._2)
data = input.flatMap(lambda x: json.loads(x))
data.filter(lambda x: 'lovesPandas' in x and x['lovesPandas']).map(
lambda x: json.dumps(x)).saveAsTextFile(outputFile)
sc.stop()
print "Done!"