如何使用pyspark从xml的每个嵌套节点创建表
我有一个嵌套的XML结构,如下所示-如何使用pyspark从xml的每个嵌套节点创建表,xml,scala,pyspark,databricks,Xml,Scala,Pyspark,Databricks,我有一个嵌套的XML结构,如下所示- <parent> <root1 detail = "something"> <ID type="typeA">id1</ID> <ID type="typeB">id2</ID> <ID type="typeC">id3</ID> </root1>
<parent>
<root1 detail = "something">
<ID type="typeA">id1</ID>
<ID type="typeB">id2</ID>
<ID type="typeC">id3</ID>
</root1>
<root2 detail = "something">
<ID type="typeA">id1</ID>
<ID type="typeB">id2</ID>
<ID type="typeC">id3</ID>
</root2>
<parent>
记录:
detail ID type
something id1 typeA
something id2 typeB
something id3 typeC
我试过使用
spark.read.format(file_type) \
.option("rootTag", "root1") \
.option("rowTag", "ID") \
.load(file_location)
但这只生成描述(字符串)和ID(数组)作为列
提前谢谢 这个技巧似乎是通过在xml文件中读取的名为“ID”列中的StructField中的名称(
\u VALUE
和\u type
)来提取ID
和type
:
from pyspark.sql.functions import explode, col
dfs = []
n = 2
for i in range(1,n+1):
df = spark.read.format('xml') \
.option("rowTag","root{}".format(i))\
.load('file.xml')
df = df.select([explode('ID'),'_detail'])\
.withColumn('ID',col('col').getItem('_VALUE'))\
.withColumn('type',col('col').getItem('_TYPE'))\
.drop('col')\
.withColumnRenamed('_detail','detail')
dfs.append(df)
df.show()
# +---------+---+-----+
# | detail| ID| type|
# +---------+---+-----+
# |something|id1|typeA|
# |something|id2|typeB|
# |something|id3|typeC|
# +---------+---+-----+
#
# +---------+---+-----+
# | detail| ID| type|
# +---------+---+-----+
# |something|id1|typeA|
# |something|id2|typeB|
# |something|id3|typeC|
# +---------+---+-----+
如果您不想手动指定表的数量(由上述代码中的变量n
控制),则可以先运行以下代码:
from xml.etree import ElementTree
tree = ElementTree.parse("file.xml")
root = tree.getroot()
children = root.getchildren()
n = 0
for child in children:
ElementTree.dump(child)
n+=1
print("n = {}".format(n))
# <root1 detail="something">
# <ID type="typeA">id1</ID>
# <ID type="typeB">id2</ID>
# <ID type="typeC">id3</ID>
# </root1>
#
# <root2 detail="something">
# <ID type="typeA">id1</ID>
# <ID type="typeB">id2</ID>
# <ID type="typeC">id3</ID>
# </root2>
# n = 2
从xml.etree导入元素树
tree=ElementTree.parse(“file.xml”)
root=tree.getroot()
children=root.getchildren()
n=0
对于儿童中的儿童:
ElementTree.dump(子级)
n+=1
打印(“n={}”。格式(n))
#
#id1
#id2
#id3
#
#
#
#id1
#id2
#id3
#
#n=2
我假设,因为这个问题有pyspark标记,您正在使用python,但是它也用scala标记,所以您是否灵活使用哪种语言来回答这个问题?它抛出“Can';t从col#1225中提取值:需要结构类型但得到字符串;”error当您执行df=spark.read.format('xml')时会得到什么.option(“rowTag”、“root1”).load('file.xml')
(用文件名替换'file.xml'),然后df.printSchema()
?我得到:root |--ID:array(nullable=true)|--element:struct(containsnall=true)| |--u值:string(nullable=true)|--u类型:string(nullable=true)|--u细节:string(nullable=true)也。选项(“rowTag”、“root”)。选项(“rootTag”、“root”)我需要添加rootTag,没有它的给出错误“根标记必须存在”我得到根|--ID:array(nullable=true)|--element:string(containsNull=true)|--detail:string(nullable=true)得到了问题。我必须输入maven坐标并安装com.databricks.cml-spark,否则会安装其他版本的spark。现在它给出了与您相同的模式
from xml.etree import ElementTree
tree = ElementTree.parse("file.xml")
root = tree.getroot()
children = root.getchildren()
n = 0
for child in children:
ElementTree.dump(child)
n+=1
print("n = {}".format(n))
# <root1 detail="something">
# <ID type="typeA">id1</ID>
# <ID type="typeB">id2</ID>
# <ID type="typeC">id3</ID>
# </root1>
#
# <root2 detail="something">
# <ID type="typeA">id1</ID>
# <ID type="typeB">id2</ID>
# <ID type="typeC">id3</ID>
# </root2>
# n = 2