Python 使用“选择”访问孙辈
我已经为这个问题挣扎了很长一段时间 给定以下XML文件Python 使用“选择”访问孙辈,python,css,xml,beautifulsoup,Python,Css,Xml,Beautifulsoup,我已经为这个问题挣扎了很长一段时间 给定以下XML文件 <?xml version='1.0' encoding='UTF-8'?> <html> <body> <feed xml:base="https:newrecipes.org" xmlns="http://www.w3.org/2005/Atom" xmlns:d="http
<?xml version='1.0' encoding='UTF-8'?>
<html>
<body>
<feed xml:base="https:newrecipes.org"
xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
<id>https://recipes.com</id>
<title>Cuisine</title>
<updated>2020-08-10T08:48:56.800Z</updated>
<link href="Cuisine" rel="self" title="Cuisine"/>
<entry>
<id>https://www.cuisine.org(53198770598313985)</id>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
<title></title>
<updated>1970-01-01T00:00:00.000Z</updated>
<content type="application/xml">
<m:properties>
<d:id m:type="Edm.Int64">53198770598313985</d:id>
<d:name m:type="Edm.String">American</d:name>
</m:properties>
</content>
</entry>
<entry>
<id>https://www.cuisine.org(53198770598313986)</id>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
<title></title>
<updated>1970-01-01T00:00:00.000Z</updated>
<content type="application/xml">
<m:properties>
<d:id m:type="Edm.Int64">53198770598313986</d:id>
<d:name m:type="Edm.String">Asian</d:name>
</m:properties>
</content>
</entry>
</feed>
</body>
</html>
这将返回文件中位于标记括号内的所有ID。但我还想访问每个条目中的属性。因为它们包含菜肴的id和名称,而不需要任何解析。
不幸的是,使用css中的childcombinator>我无法深入,我想知道除了迭代元素以提取值之外,是否还有更好的方法。比如:
cuisine_ids_unparsed = xml_soup.select("entry > content > properties > id")
检索所有ID和
cuisine_names_unparsed = xml_soup.select("entry > content > properties > name")
检索所有名称。您可以使用zip函数将两个标记绑定在一起:
import re
from bs4 import BeautifulSoup
txt = '''<?xml version='1.0' encoding='UTF-8'?>
<html>
<body>
<feed xml:base="https:newrecipes.org"
xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
<id>https://recipes.com</id>
<title>Cuisine</title>
<updated>2020-08-10T08:48:56.800Z</updated>
<link href="Cuisine" rel="self" title="Cuisine"/>
<entry>
<id>https://www.cuisine.org(53198770598313985)</id>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
<title></title>
<updated>1970-01-01T00:00:00.000Z</updated>
<content type="application/xml">
<m:properties>
<d:id m:type="Edm.Int64">53198770598313985</d:id>
<d:name m:type="Edm.String">American</d:name>
</m:properties>
</content>
</entry>
<entry>
<id>https://www.cuisine.org(53198770598313986)</id>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
<title></title>
<updated>1970-01-01T00:00:00.000Z</updated>
<content type="application/xml">
<m:properties>
<d:id m:type="Edm.Int64">53198770598313986</d:id>
<d:name m:type="Edm.String">Asian</d:name>
</m:properties>
</content>
</entry>
</feed>
</body>
</html>'''
soup = BeautifulSoup(txt, 'xml')
for id_, name in zip(soup.select('entry > id'), soup.select('entry > content > m|properties > d|name')):
print(re.search(r'\((.*?)\)', id_.text).group(1))
print(name.text)
print('-' * 80)
您可以使用zip功能将两个标签绑定在一起:
import re
from bs4 import BeautifulSoup
txt = '''<?xml version='1.0' encoding='UTF-8'?>
<html>
<body>
<feed xml:base="https:newrecipes.org"
xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
<id>https://recipes.com</id>
<title>Cuisine</title>
<updated>2020-08-10T08:48:56.800Z</updated>
<link href="Cuisine" rel="self" title="Cuisine"/>
<entry>
<id>https://www.cuisine.org(53198770598313985)</id>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
<title></title>
<updated>1970-01-01T00:00:00.000Z</updated>
<content type="application/xml">
<m:properties>
<d:id m:type="Edm.Int64">53198770598313985</d:id>
<d:name m:type="Edm.String">American</d:name>
</m:properties>
</content>
</entry>
<entry>
<id>https://www.cuisine.org(53198770598313986)</id>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
<title></title>
<updated>1970-01-01T00:00:00.000Z</updated>
<content type="application/xml">
<m:properties>
<d:id m:type="Edm.Int64">53198770598313986</d:id>
<d:name m:type="Edm.String">Asian</d:name>
</m:properties>
</content>
</entry>
</feed>
</body>
</html>'''
soup = BeautifulSoup(txt, 'xml')
for id_, name in zip(soup.select('entry > id'), soup.select('entry > content > m|properties > d|name')):
print(re.search(r'\((.*?)\)', id_.text).group(1))
print(name.text)
print('-' * 80)
使用了一点@Andrej Kesely的建议,但是您可以使用regex执行以下操作,而不是使用zip:
txt = '''<?xml version='1.0' encoding='UTF-8'?>
<html>
<body>
<feed xml:base="https:newrecipes.org"
xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
<id>https://recipes.com</id>
<title>Cuisine</title>
<updated>2020-08-10T08:48:56.800Z</updated>
<link href="Cuisine" rel="self" title="Cuisine"/>
<entry>
<id>https://www.cuisine.org(53198770598313985)</id>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
<title></title>
<updated>1970-01-01T00:00:00.000Z</updated>
<content type="application/xml">
<m:properties>
<d:id m:type="Edm.Int64">53198770598313985</d:id>
<d:name m:type="Edm.String">American</d:name>
</m:properties>
</content>
</entry>
<entry>
<id>https://www.cuisine.org(53198770598313986)</id>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
<title></title>
<updated>1970-01-01T00:00:00.000Z</updated>
<content type="application/xml">
<m:properties>
<d:id m:type="Edm.Int64">53198770598313986</d:id>
<d:name m:type="Edm.String">Asian</d:name>
</m:properties>
</content>
</entry>
</feed>
</body>
</html>'''
xml_soup = BeautifulSoup(txt, features="xml")
properties_unparsed = xml_soup.select('entry > content > m|properties')
for prop in properties_unparsed:
# Extract the id and name from the text of the property
# The id is going to be a sequence of numbers
# the name a sequence of letters.
tup = re.match(r'(\d+)(\w+)', prop.text).groups()
id_ = tup[0]
name = tup[1]
print(id_, name)
使用了一点@Andrej Kesely的建议,但是您可以使用regex执行以下操作,而不是使用zip:
txt = '''<?xml version='1.0' encoding='UTF-8'?>
<html>
<body>
<feed xml:base="https:newrecipes.org"
xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
<id>https://recipes.com</id>
<title>Cuisine</title>
<updated>2020-08-10T08:48:56.800Z</updated>
<link href="Cuisine" rel="self" title="Cuisine"/>
<entry>
<id>https://www.cuisine.org(53198770598313985)</id>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
<title></title>
<updated>1970-01-01T00:00:00.000Z</updated>
<content type="application/xml">
<m:properties>
<d:id m:type="Edm.Int64">53198770598313985</d:id>
<d:name m:type="Edm.String">American</d:name>
</m:properties>
</content>
</entry>
<entry>
<id>https://www.cuisine.org(53198770598313986)</id>
<category scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" term="DefaultNamespace.Cuisine"></category>
<title></title>
<updated>1970-01-01T00:00:00.000Z</updated>
<content type="application/xml">
<m:properties>
<d:id m:type="Edm.Int64">53198770598313986</d:id>
<d:name m:type="Edm.String">Asian</d:name>
</m:properties>
</content>
</entry>
</feed>
</body>
</html>'''
xml_soup = BeautifulSoup(txt, features="xml")
properties_unparsed = xml_soup.select('entry > content > m|properties')
for prop in properties_unparsed:
# Extract the id and name from the text of the property
# The id is going to be a sequence of numbers
# the name a sequence of letters.
tup = re.match(r'(\d+)(\w+)', prop.text).groups()
id_ = tup[0]
name = tup[1]
print(id_, name)
嗨,安德烈!谢谢你的回复!您使用的是哪个Python版本?因为这部分:soup.select'entry>content>m | properties>d | name'会为我返回一个空列表:/@Jack请确保使用最新版本的bs4和xml解析器。嗨,Andrej!谢谢你的回复!您使用的是哪个Python版本?因为这部分:soup.select'entry>content>m | properties>d | name'为我返回一个空列表:/@Jack请确保使用最新版本的bs4和xml解析器。