Python 遍历巨大的XML文件并获取值？_Python_Xml_Python 3.x_Parsing_Data Processing

Python 遍历巨大的XML文件并获取值？

python xml python-3.x parsing

Python 遍历巨大的XML文件并获取值？,python,xml,python-3.x,parsing,data-processing,Python,Xml,Python 3.x,Parsing,Data Processing,我想遍历用户Stackoverflow转储文件。问题是它非常庞大，而且是XML。对我来说，xml是一个新的主题。我读了一些文档和Stackoverflow帖子，但由于某种原因，它不起作用 XML格式： <users> <row Id="-1" Reputation="1" CreationDate="2008-07-31T00:00:00.000" DisplayName="Community" LastAccessDate="2008-08-26T00:

我想遍历用户Stackoverflow转储文件。问题是它非常庞大，而且是XML。对我来说，xml是一个新的主题。我读了一些文档和Stackoverflow帖子，但由于某种原因，它不起作用

XML格式：

<users>
  <row Id="-1" Reputation="1" 
  CreationDate="2008-07-31T00:00:00.000" 
  DisplayName="Community" 
  LastAccessDate="2008-08-26T00:16:53.810" 
  WebsiteUrl="http://meta.stackexchange.com/" 
  Location="on the server farm" AboutMe="&lt;p&gt;Hi, I'm not really a person.&" Views="649" UpVotes="245983" DownVotes="924377" AccountId="-1" 
  />
</users>

我得到的是：

For循环输出了一堆六进制代码。最后我得到了一个内存异常。可能这很正常，因为我第二次尝试它，它以非常快的速度迭代xml

0.13秒

start <Element 'row' at 0x04CC16F0>
end <Element 'row' at 0x04CC16F0>
start <Element 'row' at 0x04CC1810>

开始
结束
开始

我希望你们能帮上忙。如何获得此输出的值？我想把它保存到SQL中

所有文件都是199GB（徽章、评论、帖子链接、帖子历史记录、用户、帖子、标签和投票）。这个问题的Users.xml特定值是2,49 GB。但我想把SO的所有数据都存入数据库

你忠实的

HanahDevelope

看起来您只需要对所有

行

元素循环执行

结束

事件，并对属性执行一些操作：

from xml.etree.ElementTree import iterparse

for evt, elem in iterparse('data/Users.xml', events=('end',)):
    if elem.tag == 'row':
        user_fields = elem.attrib
        print(user_fields)

这将输出：

{'DisplayName': 'Community', 'Views': '649', 'DownVotes': '924377', 'LastAccessDate': '2008-08-26T00:16:53.810', 'Id': '-1', 'WebsiteUrl': 'http://meta.stackexchange.com/', 'Reputation': '1', 'Location': 'on the server farm', 'UpVotes': '245983', 'CreationDate': '2008-07-31T00:00:00.000', 'AboutMe': "<p>Hi, I'm not really a person.", 'AccountId': '-1'}

{'DisplayName'：'Community'、'Views'：'649'、'downvots'：'924377'、'LastAccessDate'：'2008-08-26T00:16:53.810'、'Id'：'-1'、'WebsiteUrl'：'http://meta.stackexchange.com/“，”声誉“：”1“，”位置“：”服务器场上“，”投票“，”245983“，”创作日期“，”2008-07-31T00:00:00.000“，”关于我“，”嗨，我不是一个真正的人。“，'AccountId'：'-1'}

“巨大”没有任何意义。1Mb还是1Tb？给我们数字。所有文件都是199 GB（徽章、评论、帖子链接、帖子历史记录、用户、帖子、标签和投票）。这个问题的Users.xml特定值是2,49 GB。但是我想把SO的所有数据都放到数据库里，我不会查什么东西。但是，您不是故意在事件中写入“start”吗？每个XML节点都将触发

start

和

end

事件，因此您实际上只需要处理其中一个事件。当触发

start

事件时，不能保证当前XML节点的所有属性都已处理，因此更安全的方法是只处理

end

事件。

{'DisplayName': 'Community', 'Views': '649', 'DownVotes': '924377', 'LastAccessDate': '2008-08-26T00:16:53.810', 'Id': '-1', 'WebsiteUrl': 'http://meta.stackexchange.com/', 'Reputation': '1', 'Location': 'on the server farm', 'UpVotes': '245983', 'CreationDate': '2008-07-31T00:00:00.000', 'AboutMe': "<p>Hi, I'm not really a person.", 'AccountId': '-1'}