Python 使用lxml解析XML时出现问题
我一直在尝试将XML提要解析到Pandas数据框架中,但不知道哪里出了问题Python 使用lxml解析XML时出现问题,python,xml,pandas,lxml,lxml.objectify,Python,Xml,Pandas,Lxml,Lxml.objectify,我一直在尝试将XML提要解析到Pandas数据框架中,但不知道哪里出了问题 import pandas as pd import requests import lxml.objectify path = "http://www2.cineworld.co.uk/syndication/listings.xml" xml = lxml.objectify.parse(path) root = xml.getroot() 下一段代码是解析我想要的位,并创建一个show字典列表 shows_li
import pandas as pd
import requests
import lxml.objectify
path = "http://www2.cineworld.co.uk/syndication/listings.xml"
xml = lxml.objectify.parse(path)
root = xml.getroot()
下一段代码是解析我想要的位,并创建一个show字典列表
shows_list = []
for r in root.cinema:
rec = {}
rec['name'] = r.attrib['name']
rec['info'] = r.attrib["root"] + r.attrib['url']
listing = r.find("listing")
for f in listing.film:
film = rec
film['title'] = f.attrib['title']
film['rating'] = f.attrib['rating']
shows = f.find("shows")
for s in shows['show']:
show = rec
show['time'] = s.attrib['time']
show['url'] = s.attrib['url']
#print show
shows_list.append(rec)
df = pd.DataFrame(show_list)
当我运行代码时,胶片和时间字段似乎在行中被复制了多次。但是,如果我在代码中放入print语句(它被注释掉),字典看起来就像我期望的那样
我做错了什么?请随时让我知道是否有一个更python的方式做解析过程
编辑:澄清:
如果我在循环过程中使用print语句检查发生了什么,那么这些就是最后五行数据
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729365&seats=STANDARD', 'time': '2016-02-07T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729366&seats=STANDARD', 'time': '2016-02-08T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729367&seats=STANDARD', 'time': '2016-02-09T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729368&seats=STANDARD', 'time': '2016-02-10T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729369&seats=STANDARD', 'time': '2016-02-11T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'PG', 'name': 'Cineworld Stoke-on-Trent', 'title': 'Autism Friendly Screening - Goosebumps', 'url': '/booking?performance=4782937&seats=STANDARD', 'time': '2016-02-07T11:00:00'}
以下是列表的结尾:
您的代码只有一个不断更新的对象:
rec
。试试这个:
from copy import copy
shows_list = []
for r in root.cinema:
rec = {}
rec['name'] = r.attrib['name']
rec['info'] = r.attrib["root"] + r.attrib['url']
listing = r.find("listing")
for f in listing.film:
film = copy(rec) # New object
film['title'] = f.attrib['title']
film['rating'] = f.attrib['rating']
shows = f.find("shows")
for s in shows['show']:
show = copy(film) # New object, changed reference
show['time'] = s.attrib['time']
show['url'] = s.attrib['url']
#print show
shows_list.append(show) # Changed reference
df = pd.DataFrame(show_list)
使用此结构,rec
中的数据被复制到每个影片
,并且每个影片
中的数据被复制到每个放映
。然后,在末尾,show
被添加到shows\u列表中
您可能想通过阅读了解更多关于您的行中发生的事情,即您正在为原始词典命名,而不是创建新词典。print(show_list)
-可能您在show_list中有多次数据?可能xml中存在多次数据?请使用更多print
查看发生了什么。您在循环中使用append
insidefor
循环,因此您可能会添加相同的rect
,然后使用相同的名称
,但是使用不同的标题
,或者使用相同的标题
但是不同的时间
。您的字典中只有一个标题
和时间
键,你不打算有多个条目吗?(你每次都在覆盖按键)@salparadise我的想法是,使用字典意味着某家电影院某部电影的个别时间会出现在不同的字典中。这太棒了。非常感谢。我在numpy和pandas中也遇到过类似的问题,但我认为我可以简单地用旧词典的数据设置新词典的值。显然不是。
from copy import copy
shows_list = []
for r in root.cinema:
rec = {}
rec['name'] = r.attrib['name']
rec['info'] = r.attrib["root"] + r.attrib['url']
listing = r.find("listing")
for f in listing.film:
film = copy(rec) # New object
film['title'] = f.attrib['title']
film['rating'] = f.attrib['rating']
shows = f.find("shows")
for s in shows['show']:
show = copy(film) # New object, changed reference
show['time'] = s.attrib['time']
show['url'] = s.attrib['url']
#print show
shows_list.append(show) # Changed reference
df = pd.DataFrame(show_list)