Python 如何选择围绕<；的上下文词/字符；a>；使用BeautifulSoup标记？_Python_Html_Beautifulsoup

Python 如何选择围绕<；的上下文词/字符；a>；使用BeautifulSoup标记？

python html

Python 如何选择围绕<；的上下文词/字符；a>；使用BeautifulSoup标记？,python,html,beautifulsoup,Python,Html,Beautifulsoup,我正在使用BeautifulSoup处理来自网络爬虫的HTML。HTML通过“简化”HTML的过滤器运行，剥离和替换标记，以便文档中只包含、正文、和标记和可见文本我目前有一个功能，目前正在提取这些网页的网址和锚文本。除此之外，我还想为每个链接提取标记前后的N个“上下文单词”。例如，如果我有以下文档： <html><body> <div>This is <a href="www.example.com">a test</a> <d

我正在使用BeautifulSoup处理来自网络爬虫的HTML。HTML通过“简化”HTML的过滤器运行，剥离和替换标记，以便文档中只包含

、

正文

、

和

标记和可见文本

我目前有一个功能，目前正在提取这些网页的网址和锚文本。除此之外，我还想为每个链接提取

标记前后的N个“上下文单词”。例如，如果我有以下文档：

<html><body>
<div>This is <a href="www.example.com">a test</a>
<div>There was a big fluffy dog outside the <a href="www.petfood.com">pet food store</a> with such a sad face.<div>
</div>
</body></html>

第一个链接（

www.example.com

）在到达文档开头之前只有两个单词，因此返回这两个单词，以及

标记后面的6个单词，以获得

N=8

的总数。还要注意，返回的单词跨越了包含

的

标记的边界

第二个链接（

www.petfood.com

）的

N\2

=前面有4个单词，后面有4个单词，因此这些单词作为上下文返回。也就是说，如果可能的话，N个单词在

标记之前和之后被分割

如果文本与链接在同一个

范围内，我知道如何执行此操作，但我不知道如何跨

这样的边界执行此操作。基本上，为了提取“上下文词”，我希望将文档视为具有链接的单个可见文本块，忽略包含div的内容

如何使用BeautifulSoup提取像这样的

标记周围的文本？为了简单起见，我甚至对只返回标记前面/后面可见文本的N个字符的答案感到满意（我可以自己处理标记化/拆分）。

这里有一个函数，它将整个HTML代码和N作为输入，并且每次出现


编辑：注意，这个实现的一个缺陷是它使用链接文本作为分隔符来区分之前
和之后
。这可能是一个问题，如果链接文本在HTML文档中重复，在链接本身之前的某个地方，例如
<div>This test is <a href="www.example.com">test</a>

将此测试
转换为此测试
在您的第一个示例（www.example.com
）中，输出中包含链接文本a测试
。但是在第二个示例www.petfood.com，链接文本不在输出中。您是否希望包含链接文本？谢谢@glhr。那是个打字错误。链接文本不应包含在内。我已经在提取锚文本，只需要上下文词。哇，这看起来很棒。我目前无法深入测试，因为我必须开车送人去机场，但我会在稍后返回时这样做，如果成功的话，我会标记为接受（正如我从观察中所预期的那样）。谢谢
def getContext(html,n):
    output = []
    soup = BeautifulSoup(html, 'html.parser')
    for i in soup.findAll("a"):
        n_side = int(n/2)

        text = soup.text.replace('\n',' ')

        context_before = text.split(i.text)[0]
        words_before = list(filter(bool,context_before.split(" ")))

        context_after = text.split(i.text)[1]
        words_after = list(filter(bool,context_after.split(" ")))

        if(len(words_after) >= n_side):
            words_before = words_before[-n_side:]
            words_after = words_after[:(n-len(words_before))]
        else:
            words_after = words_after[:n_side]
            words_before = words_before[-(n-len(words_after)):]

        output.append((i["href"], words_before + words_after))
    return output

html = '''<html><body>
<div>This is <a href="www.example.com">a test</a> 
<div>There was a big fluffy dog outside the <a href="www.petfood.com">pet food store</a> with such a sad face.<div>
</div>
</body></html>'''

print(*getContext(html,8))

('www.example.com', ['This', 'is', 'There', 'was', 'a', 'big', 'fluffy', 'dog'])
('www.petfood.com', ['fluffy', 'dog', 'outside', 'the', 'with', 'such', 'a', 'sad'])

<div>This test is <a href="www.example.com">test</a>

def getContext(html,n):
    output = []
    soup = BeautifulSoup(html, 'html.parser')
    for i in soup.findAll("a"):
        i.string.replace_with(f"[[[[{i.text}]]]]")
        # rest of code here