Python 使用spark解析/加载巨大的XML文件_Python_Xml_Apache Spark_Hive

Python 使用spark解析/加载巨大的XML文件

python xml apache-spark hive

Python 使用spark解析/加载巨大的XML文件,python,xml,apache-spark,hive,Python,Xml,Apache Spark,Hive,我有一个包含以下设置的XML文件 <?xml version="1.0" encoding="utf-8"?> <SomeRoottag> <row Id="47513849" PostTypeId="1" /> <row Id="4751323" PostTypeId="4" /> <row Id="475546" PostTypeId="1" /> <row Id="47597" PostTypeId="2" />

我有一个包含以下设置的XML文件

<?xml version="1.0" encoding="utf-8"?>
<SomeRoottag>
 <row Id="47513849" PostTypeId="1" />
 <row Id="4751323" PostTypeId="4" />
 <row Id="475546" PostTypeId="1" />
 <row Id="47597" PostTypeId="2" />
</SomeRoottag>

对于我的测试数据（10mb），一切正常，但是当我加载大文件（>50G）时，它失败了。 spark JVM似乎试图加载整个文件，但失败了，因为它只有20G大

处理这样的文件的最佳方法是什么

更新：

root

++
||
++
++

如果我执行以下操作，我不会收到任何数据：

df = (sqlContext.read.format('xml').option("rowTag", "row").load("/tmp/someXML.xml"))
df.printSchema()
df.show()

输出：

root

++
||
++
++

不要将

SomeRoottag

用作

rowTag

。它指示Spark将整个文档作为一行使用。相反：

df = (sqlContext.read.format('xml')
    .option("rowTag", "row")
    .load("/tmp/xmlfile.xml"))

现在也没有必要爆炸：

df.write.format("parquet").saveAsTable("xml_table")

编辑：

考虑到您的编辑，您会受到已知错误的影响。请看。目前在解决这一问题上似乎没有任何进展，因此您可能必须：

你自己做个公关来解决这个问题

手动解析文件。如果元素始终是单行，则可以使用

udf

轻松完成

from pyspark.sql.functions import col, udf
from lxml import etree

@udf("struct<id: string, postTypeId: string>")
def parse(s):
    try:
        attrib = etree.fromstring(s).attrib
        return attrib.get("Id"), attrib.get("PostTypeId")
    except:
        pass

(spark.read.text("/tmp/someXML.xml")
    .where(col("value").rlike("^\\s*<row "))
    .select(parse("value").alias("value"))
    .select("value.*")
    .show())

# +--------+----------+
# |      id|postTypeId|
# +--------+----------+
# |47513849|         1|
# | 4751323|         4|
# |  475546|         1|
# |   47597|         2|
# +--------+----------+

从pyspark.sql.functions导入col、udf
从lxml导入etree
@自定义项（“结构”）
def解析：
尝试：
attrib=etree.fromstring.attrib
返回attrib.get（“Id”）、attrib.get（“PostTypeId”）
除：
通过
（spark.read.text（“/tmp/someXML.xml”）
.where（col（“value”）.rlike（^\\s*