Python 使用BeautifulSoup在单个循环周期中解析多个段落_Python_Beautifulsoup

Python 使用BeautifulSoup在单个循环周期中解析多个段落

python

Python 使用BeautifulSoup在单个循环周期中解析多个段落,python,beautifulsoup,Python,Beautifulsoup,我正在解析博客的评论部分。不幸的是，这种结构很不规则我面临两种情况：第一条评论将分为多个段落 My first paragraph. But this a second line And this is a third line 如何在同一循环中打印前两段 soup = BeautifulSoup(html_doc) text = [''.join(s.findAll(text=True)

我正在解析博客的评论部分。不幸的是，这种结构很不规则

我面临两种情况：

第一条评论将分为多个段落

 <p>My first paragraph.<br />But this a second line</p>
 <p>And this is a third line</p>

如何在同一循环中打印前两段

soup = BeautifulSoup(html_doc)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]

text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
print ", ".join(text[:2])
print " ".join(text[2:])

First comment and first line, First comment and second line
Second comment

调用soup.findid=firstDiv.find_allp时，它会创建一个列表，如下所示，因此在列表中的三个元素上进行迭代可以得到三个循环：

[<p>First comment and first line</p>, <p>First comment and second line</p>, <p>Second comment</p>]

您在这里尝试做的不是soup的工作，因为您处理的是结构没有反映在HTML中的平面数据。所以，你想让汤带你尽可能地，然后切换到迭代

获取父div的p和div子级的最简单方法是获取所有子级。我们只需要HTML节点，而不是它们之间的字符串，因此我们可以不带参数地查找。像这样：

def chunkify(parent):
    """yields groups of <p> nodes separated by <div> siblings"""
    chunk = []
    for element in parent.find_all():
        if element.name == 'p':
            chunk.append(element)
        elif element.name == 'div':
            yield chunk
            chunk = []
    if chunk:
        yield chunk

for paras in chunkify(soup.find(id="firstDiv")):
    print "Print comment: " + '\n'.join(p.get_text() for p in paras)
    print "End of loop"

这就是你想要的，对吗

你可以更简洁地编写这个函数，如果你理解itertools，我认为更容易理解……但我想先用一种对新手更有意义的方式来编写它，即使它更笨重。以下是一个简短的版本：

def chunkify(parent):
    """yields groups of <p> nodes separated by <div> siblings"""
    grouped = groupby(parent.find_all(), lambda element: element.name != 'div')
    groups = (g for k, g in grouped if k)
    return ([node for node in g if node.name == 'p'] for g in groups)

不清楚你在问什么。此外，你应该链接到[你以前的问题]你在这里问的什么问题没有在？中得到回答，这几乎肯定是相关的，并解释你现在问的问题是如何不同的。@abarnert这是一个不同的问题。我需要将第一条注释存储在一个字符串中，该注释分为两个不同的段落。我想知道我怎么能做到。我想我可以利用这样一个事实，即在任何情况下，即使在多个段落中，评论在何时结束，但我不确定如何实现……我不100%确定，但我不认为这是他要求的。我想他想在一次迭代中得到前两段，然后在另一次迭代中得到第三段，因为它们之间用div元素隔开，依此类推。@abarnert，soup.findid=firstDiv.find\u allp是一个列表，所以不是find在第一次迭代中没有找到所有内容，op只是在列表中的元素上循环。你的新答案更接近我的想法，我可能是错的……他在寻找，但它实际上不起作用，因为第二组p节点不是secondDiv的子节点，它们只是恰好位于secondDiv之后的firstDiv的子节点。@abarnert，我明白，但如果不使用findall和加入，我看不出OP会得到什么结果there@CptNemo，所以你想把前两段放在一起，然后把最后一段分开，是吗？

def chunkify(parent):
    """yields groups of <p> nodes separated by <div> siblings"""
    chunk = []
    for element in parent.find_all():
        if element.name == 'p':
            chunk.append(element)
        elif element.name == 'div':
            yield chunk
            chunk = []
    if chunk:
        yield chunk

for paras in chunkify(soup.find(id="firstDiv")):
    print "Print comment: " + '\n'.join(p.get_text() for p in paras)
    print "End of loop"

Print comment: First comment and first line
First comment and second line
End of loop
Print comment: Second comment
End of loop

def chunkify(parent):
    """yields groups of <p> nodes separated by <div> siblings"""
    grouped = groupby(parent.find_all(), lambda element: element.name != 'div')
    groups = (g for k, g in grouped if k)
    return ([node for node in g if node.name == 'p'] for g in groups)

    groups = isplit(parent.find_all(), lambda element: element.name != 'div')