Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/344.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用BeautifulSoup解析HTML结构_Python_Beautifulsoup - Fatal编程技术网

Python 使用BeautifulSoup解析HTML结构

Python 使用BeautifulSoup解析HTML结构,python,beautifulsoup,Python,Beautifulsoup,这是我需要解析的HTML文档的结构(请参阅更新3): 我试过: soup = BeautifulSoup(page,'xml') divText = soup.find_all('div', {'class':'Normal-P1'}) for item in divText: spanTitle = soup.find_all('span',{'class':'Normal-C2'}) spanOptopnal = soup.find_all('span',{'class':'

这是我需要解析的HTML文档的结构(请参阅更新3):

我试过:

soup = BeautifulSoup(page,'xml')
divText = soup.find_all('div', {'class':'Normal-P1'})
for item in divText:
    spanTitle = soup.find_all('span',{'class':'Normal-C2'})
    spanOptopnal = soup.find_all('span',{'class':'Normal-C3'})
然而,这种方法不允许我分离出
Normal-P1
类,这样我就从
C2
C4
然后重新开始。
C4
和下一个
C2
之间的
C3
并不总是存在。在这些情况下,
C4
是下一个
C2
之前的最后一个标记

我考虑过将所有
div
s放在一个列表中,然后根据
C2
将它们拆分成子列表来处理它们。我试图找出是否有一个更优雅的解决方案使用bs4

更新1

过一会儿再回到这件事上。我只是用下面的答案回顾了我的输出,并看到了一个问题

看着

   titles = soup.select(".Normal-P1 .Normal-C2")
   for entry in titles:
            print "entry:",entry
            parent = entry.parent
            print "parent: ",parent
            subtitles = [
                subtitle.text for subtitle in
                parent.select(' ~ .Normal-P1 .Normal-C3')
            ]
            print "subtitles:",subtitles
我发现
字幕
包含来自父级以外的结果(即所有
标题
)。输出如下所示:

    Main Title     Optional Subtitle 1     Optional Subtitle 2        Text Blurb
    ----------     -------------------     -------------------       ------------------------     
    Main title 1   Optional Subtitle       Second Optional Subtitle   Text blurb1. Textblurb2. Text blurb 4.
    Main Title 2     Subtitle 1                                         Other text blurb 2.
entry: <span class="Normal-C2">Main title 1<br/></span>
parent:  <div class="Normal-P1">
<span class="Normal-C2">Main title 1<br/></span></div>
subtitles: [Optional Subtitle,Second Optional Subtitle,Subtitle 1]


entry: <span class="Normal-C2">Main title 2<br/></span>
parent:  <div class="Normal-P1">
<span class="Normal-C2">Main title 2<br/></span></div>
subtitles: [Subtitle 1]
这是我看到的输出

    entry: <span class="Normal-C2">Main title 1<br/></span>
    parent:  <div class="Normal-P1">

<span class="Normal-C2">Main title 1<br/></span>

    </div>
    subtitle: <span class="Normal-C3">Optional Subtitle<br/></span>
    subtitle: <span class="Normal-C3">Second Optional Subtitle</span>
    subtitle: <span class="Normal-C3"><br/></span>
    subtitle: <span class="Normal-C3">New Subtitle 1</span>
    entry: <span class="Normal-C2">Main title 2<br/></span>
    parent:  <div class="Normal-P1">
    <span class="Normal-C2">Main title 2<br/></span>
    </div>
    subtitle: <span class="Normal-C3">New Subtitle 1</span>

使用CSS选择器,您将希望通过
.class Name
定位类名,并通过
ParentTag~定位同级。Child class

给它们涂上一层很好的底漆

python文件:

import bs4
import csv

entries = []

with open("example.html", "r") as page:
    soup = bs4.BeautifulSoup(page, 'lxml')

    # CSS Selectors for items with class Normal-P1 followed by
    # Normal-C2
    titles = soup.select(".Normal-P1 .Normal-C2")

    for entry in titles:
        entry_dict = {
            'Main Title': '',
            'Optional Subtitle 1': '',
            'Optional Subtitle 2': '',
            'Text Blurb': ''
        }
        parent = entry.parent

        entry_dict['Main Title'] = entry.text

        subtitles = [
            subtitle.text for subtitle in
            parent.select(' ~ .Normal-P1 .Normal-C3')
            # CSS Selector for siblings of the same parent element that have
            # classes Normal-P1 followed by Normal-C3
        ]
        try:
            entry_dict['Optional Subtitle 1'] = subtitles[0]
            entry_dict['Optional Subtitle 2'] = subtitles[1]
        except IndexError:
            pass

        entry_dict['Text Blurb'] = ' '.join(
            blurb.text for blurb in
            parent.select(' ~ .Normal-P1 .Normal-C4')
            # CSS Selector for siblings of the same parent element that have
            # classes Normal-P1 followed by Normal-C4
        )

        entries.append(entry_dict)

    with open('out.csv', 'w') as csv_file:
        fieldnames = [
            'Main Title',
            'Optional Subtitle 1',
            'Optional Subtitle 2',
            'Text Blurb'
        ]
        writer = csv.DictWriter(
            csv_file,
            fieldnames=fieldnames,
            quoting=csv.QUOTE_ALL,
        )
        writer.writeheader()
        for entry in entries:
            writer.writerow(entry)
使用的html文件:


你好,世界
主标题1
可选字幕
第二个可选字幕 文本简介1.
文本简介2.
文本简介4.


主标题2
副标题1 其他文字简介1.
其他文字简介2.

titles=soup时,我看不到任何输出。选择(“.Normal-P1.Normal-C2”)打印“titles:”,titles。HTML结构是
是导致我看不到任何结果的结构。另外,我还必须做:
pageFile=codecs.open(file,'r')page=pageFile.read()soup=beautifulsou(page,'xml')titles=soup.select(“.Normal-P1.Normal-C2”)print“titles:”,titles
确保您正在使用
lxml
而不是
xml
作为
bs4.beautifulsou()的第二个参数!确实如此,您将try/except块下面的行缩进,这将它们放在except部分的范围内。将
传递
后的行移回一个级别。
父级。选择(“~.Normal-P1.Normal-C3”)
似乎是问题的原因。结果不限于每个
副标题
    entry: <span class="Normal-C2">Main title 1<br/></span>
    parent:  <div class="Normal-P1">

<span class="Normal-C2">Main title 1<br/></span>

    </div>
    subtitle: <span class="Normal-C3">Optional Subtitle<br/></span>
    subtitle: <span class="Normal-C3">Second Optional Subtitle</span>
    subtitle: <span class="Normal-C3"><br/></span>
    subtitle: <span class="Normal-C3">New Subtitle 1</span>
    entry: <span class="Normal-C2">Main title 2<br/></span>
    parent:  <div class="Normal-P1">
    <span class="Normal-C2">Main title 2<br/></span>
    </div>
    subtitle: <span class="Normal-C3">New Subtitle 1</span>
file = filepath + "test-page.html"
parser = HTMLParser.HTMLParser()
pageFile = codecs.open(file, 'r', encoding='utf-8')
pageRaw = pageFile.read()
page = parser.unescape(pageRaw)

soup = bs4.BeautifulSoup(page,'lxml')
titles = soup.select(".Normal-P1 .Normal-C2")

for entry in titles:
    print "entry:",entry
    parent = entry.parent
    print "parent: ",parent

    for subtitle in parent.select(" ~ .Normal-P1 .Normal-C3"):
        print "subtitle:", subtitle
import bs4
import csv

entries = []

with open("example.html", "r") as page:
    soup = bs4.BeautifulSoup(page, 'lxml')

    # CSS Selectors for items with class Normal-P1 followed by
    # Normal-C2
    titles = soup.select(".Normal-P1 .Normal-C2")

    for entry in titles:
        entry_dict = {
            'Main Title': '',
            'Optional Subtitle 1': '',
            'Optional Subtitle 2': '',
            'Text Blurb': ''
        }
        parent = entry.parent

        entry_dict['Main Title'] = entry.text

        subtitles = [
            subtitle.text for subtitle in
            parent.select(' ~ .Normal-P1 .Normal-C3')
            # CSS Selector for siblings of the same parent element that have
            # classes Normal-P1 followed by Normal-C3
        ]
        try:
            entry_dict['Optional Subtitle 1'] = subtitles[0]
            entry_dict['Optional Subtitle 2'] = subtitles[1]
        except IndexError:
            pass

        entry_dict['Text Blurb'] = ' '.join(
            blurb.text for blurb in
            parent.select(' ~ .Normal-P1 .Normal-C4')
            # CSS Selector for siblings of the same parent element that have
            # classes Normal-P1 followed by Normal-C4
        )

        entries.append(entry_dict)

    with open('out.csv', 'w') as csv_file:
        fieldnames = [
            'Main Title',
            'Optional Subtitle 1',
            'Optional Subtitle 2',
            'Text Blurb'
        ]
        writer = csv.DictWriter(
            csv_file,
            fieldnames=fieldnames,
            quoting=csv.QUOTE_ALL,
        )
        writer.writeheader()
        for entry in entries:
            writer.writerow(entry)