Python 使用BeautifulSoup解析HTML结构
这是我需要解析的HTML文档的结构(请参阅更新3): 我试过:Python 使用BeautifulSoup解析HTML结构,python,beautifulsoup,Python,Beautifulsoup,这是我需要解析的HTML文档的结构(请参阅更新3): 我试过: soup = BeautifulSoup(page,'xml') divText = soup.find_all('div', {'class':'Normal-P1'}) for item in divText: spanTitle = soup.find_all('span',{'class':'Normal-C2'}) spanOptopnal = soup.find_all('span',{'class':'
soup = BeautifulSoup(page,'xml')
divText = soup.find_all('div', {'class':'Normal-P1'})
for item in divText:
spanTitle = soup.find_all('span',{'class':'Normal-C2'})
spanOptopnal = soup.find_all('span',{'class':'Normal-C3'})
然而,这种方法不允许我分离出Normal-P1
类,这样我就从C2
到C4
然后重新开始。C4
和下一个C2
之间的C3
并不总是存在。在这些情况下,C4
是下一个C2
之前的最后一个标记
我考虑过将所有div
s放在一个列表中,然后根据C2
将它们拆分成子列表来处理它们。我试图找出是否有一个更优雅的解决方案使用bs4
更新1
过一会儿再回到这件事上。我只是用下面的答案回顾了我的输出,并看到了一个问题
看着
titles = soup.select(".Normal-P1 .Normal-C2")
for entry in titles:
print "entry:",entry
parent = entry.parent
print "parent: ",parent
subtitles = [
subtitle.text for subtitle in
parent.select(' ~ .Normal-P1 .Normal-C3')
]
print "subtitles:",subtitles
我发现字幕
包含来自父级以外的结果(即所有标题
)。输出如下所示:
Main Title Optional Subtitle 1 Optional Subtitle 2 Text Blurb
---------- ------------------- ------------------- ------------------------
Main title 1 Optional Subtitle Second Optional Subtitle Text blurb1. Textblurb2. Text blurb 4.
Main Title 2 Subtitle 1 Other text blurb 2.
entry: <span class="Normal-C2">Main title 1<br/></span>
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 1<br/></span></div>
subtitles: [Optional Subtitle,Second Optional Subtitle,Subtitle 1]
entry: <span class="Normal-C2">Main title 2<br/></span>
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 2<br/></span></div>
subtitles: [Subtitle 1]
这是我看到的输出
entry: <span class="Normal-C2">Main title 1<br/></span>
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 1<br/></span>
</div>
subtitle: <span class="Normal-C3">Optional Subtitle<br/></span>
subtitle: <span class="Normal-C3">Second Optional Subtitle</span>
subtitle: <span class="Normal-C3"><br/></span>
subtitle: <span class="Normal-C3">New Subtitle 1</span>
entry: <span class="Normal-C2">Main title 2<br/></span>
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 2<br/></span>
</div>
subtitle: <span class="Normal-C3">New Subtitle 1</span>
使用CSS选择器,您将希望通过
.class Name
定位类名,并通过ParentTag~定位同级。Child class
给它们涂上一层很好的底漆
python文件:
import bs4
import csv
entries = []
with open("example.html", "r") as page:
soup = bs4.BeautifulSoup(page, 'lxml')
# CSS Selectors for items with class Normal-P1 followed by
# Normal-C2
titles = soup.select(".Normal-P1 .Normal-C2")
for entry in titles:
entry_dict = {
'Main Title': '',
'Optional Subtitle 1': '',
'Optional Subtitle 2': '',
'Text Blurb': ''
}
parent = entry.parent
entry_dict['Main Title'] = entry.text
subtitles = [
subtitle.text for subtitle in
parent.select(' ~ .Normal-P1 .Normal-C3')
# CSS Selector for siblings of the same parent element that have
# classes Normal-P1 followed by Normal-C3
]
try:
entry_dict['Optional Subtitle 1'] = subtitles[0]
entry_dict['Optional Subtitle 2'] = subtitles[1]
except IndexError:
pass
entry_dict['Text Blurb'] = ' '.join(
blurb.text for blurb in
parent.select(' ~ .Normal-P1 .Normal-C4')
# CSS Selector for siblings of the same parent element that have
# classes Normal-P1 followed by Normal-C4
)
entries.append(entry_dict)
with open('out.csv', 'w') as csv_file:
fieldnames = [
'Main Title',
'Optional Subtitle 1',
'Optional Subtitle 2',
'Text Blurb'
]
writer = csv.DictWriter(
csv_file,
fieldnames=fieldnames,
quoting=csv.QUOTE_ALL,
)
writer.writeheader()
for entry in entries:
writer.writerow(entry)
使用的html文件:
你好,世界
主标题1
可选字幕
第二个可选字幕
文本简介1.
文本简介2.
文本简介4.
主标题2
副标题1
其他文字简介1.
其他文字简介2.
titles=soup时,我看不到任何输出。选择(“.Normal-P1.Normal-C2”)打印“titles:”,titles。HTML结构是
是导致我看不到任何结果的结构。另外,我还必须做:pageFile=codecs.open(file,'r')page=pageFile.read()soup=beautifulsou(page,'xml')titles=soup.select(“.Normal-P1.Normal-C2”)print“titles:”,titles
确保您正在使用lxml
而不是xml
作为bs4.beautifulsou()的第二个参数!确实如此,您将try/except块下面的行缩进,这将它们放在except部分的范围内。将传递
后的行移回一个级别。父级。选择(“~.Normal-P1.Normal-C3”)
似乎是问题的原因。结果不限于每个副标题
entry: <span class="Normal-C2">Main title 1<br/></span>
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 1<br/></span>
</div>
subtitle: <span class="Normal-C3">Optional Subtitle<br/></span>
subtitle: <span class="Normal-C3">Second Optional Subtitle</span>
subtitle: <span class="Normal-C3"><br/></span>
subtitle: <span class="Normal-C3">New Subtitle 1</span>
entry: <span class="Normal-C2">Main title 2<br/></span>
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 2<br/></span>
</div>
subtitle: <span class="Normal-C3">New Subtitle 1</span>
file = filepath + "test-page.html"
parser = HTMLParser.HTMLParser()
pageFile = codecs.open(file, 'r', encoding='utf-8')
pageRaw = pageFile.read()
page = parser.unescape(pageRaw)
soup = bs4.BeautifulSoup(page,'lxml')
titles = soup.select(".Normal-P1 .Normal-C2")
for entry in titles:
print "entry:",entry
parent = entry.parent
print "parent: ",parent
for subtitle in parent.select(" ~ .Normal-P1 .Normal-C3"):
print "subtitle:", subtitle
import bs4
import csv
entries = []
with open("example.html", "r") as page:
soup = bs4.BeautifulSoup(page, 'lxml')
# CSS Selectors for items with class Normal-P1 followed by
# Normal-C2
titles = soup.select(".Normal-P1 .Normal-C2")
for entry in titles:
entry_dict = {
'Main Title': '',
'Optional Subtitle 1': '',
'Optional Subtitle 2': '',
'Text Blurb': ''
}
parent = entry.parent
entry_dict['Main Title'] = entry.text
subtitles = [
subtitle.text for subtitle in
parent.select(' ~ .Normal-P1 .Normal-C3')
# CSS Selector for siblings of the same parent element that have
# classes Normal-P1 followed by Normal-C3
]
try:
entry_dict['Optional Subtitle 1'] = subtitles[0]
entry_dict['Optional Subtitle 2'] = subtitles[1]
except IndexError:
pass
entry_dict['Text Blurb'] = ' '.join(
blurb.text for blurb in
parent.select(' ~ .Normal-P1 .Normal-C4')
# CSS Selector for siblings of the same parent element that have
# classes Normal-P1 followed by Normal-C4
)
entries.append(entry_dict)
with open('out.csv', 'w') as csv_file:
fieldnames = [
'Main Title',
'Optional Subtitle 1',
'Optional Subtitle 2',
'Text Blurb'
]
writer = csv.DictWriter(
csv_file,
fieldnames=fieldnames,
quoting=csv.QUOTE_ALL,
)
writer.writeheader()
for entry in entries:
writer.writerow(entry)