Python 使用BeautifulSoup提取文本行_Python_Regex_Re

Python 使用BeautifulSoup提取文本行

python regex

Python 使用BeautifulSoup提取文本行,python,regex,re,Python,Regex,Re,我有两个数字（NUM1；NUM2），我试图在具有相同格式的网页上提取： <div style="margin-left:0.5em;"> <div style="margin-bottom:0.5em;"> NUM1 and NUM2 are always followed by the same text across webpages </div> 在函数之外，当指定文本的实际来源（例如，num

我有两个数字（NUM1；NUM2），我试图在具有相同格式的网页上提取：

<div style="margin-left:0.5em;">  
  <div style="margin-bottom:0.5em;">
    NUM1 and NUM2 are always followed by the same text across webpages
  </div>

在函数之外，当指定文本的实际来源（例如，

nums\u regex.search（text）

）时，此代码本身起作用。然而，我正在修改别人的代码，而我自己以前从未真正使用过类或函数。下面是他们的代码示例：

@property
def title(self):
    tag = self.soup.find('span', class_='summary')
    title = unicode(tag.string)
    return title.strip()

正如您可能已经猜到的，我的代码不起作用。我得到一个错误：

nums_match = nums_regex.search(self)
TypeError: expected string or buffer

看起来我没有正确输入原始文本，但如何修复它？

您可以使用相同的正则表达式模式，通过文本查找

BeautifulSoup

，然后提取所需的数字：

import re

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")

for elm in soup.find_all("div", text=pattern):
    print(pattern.search(elm.text).groups())

请注意，由于您试图匹配文本的一部分，而不是任何与HTML结构相关的内容，因此我认为只需将正则表达式应用于整个文档就可以了

完成下面的工作示例代码示例

使用

BeautifulSoup

regex/“按文本”搜索：

重新导入
从bs4导入BeautifulSoup
data=”“”
10和20在网页中始终后跟相同的文本
"""
soup=BeautifulSoup（数据，“html.parser”）
pattern=re.compile（r“（\d+）和（\d+）在网页中始终后跟相同的文本”）
对于汤中的榆树。查找所有（“div”，text=pattern）：
打印（pattern.search（elm.text.groups（））

仅限正则表达式搜索：

import re

data = """<div style="margin-left:0.5em;">
  <div style="margin-bottom:0.5em;">
    10 and 20 are always followed by the same text across webpages
  </div>
</div>
"""

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
print(pattern.findall(data))  # prints [('10', '20')]

重新导入
data=”“”
10和20在网页中始终后跟相同的文本
"""
pattern=re.compile（r“（\d+）和（\d+）在网页中始终后跟相同的文本”）
打印（pattern.findall（data））#打印[（'10'，'20'）]

try

nums\u regex.search（self.soup.text）

BeautifulSoup代码本身就非常有效。我加上了自我。到soup.findall将其与其他代码集成，但这只会导致一个“（）”输出，即使那里应该有数字。@Matt嗯，它适用于您提供的输入。你能分享你正在解析的完整HTML和你目前拥有的代码吗？谢谢，很好！我不确定我昨天做错了什么，但是当我添加self时，您的BeautifulSoup代码可以工作。今天去喝汤。谢谢

import re

from bs4 import BeautifulSoup

data = """<div style="margin-left:0.5em;">
  <div style="margin-bottom:0.5em;">
    10 and 20 are always followed by the same text across webpages
  </div>
</div>
"""

soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")

for elm in soup.find_all("div", text=pattern):
    print(pattern.search(elm.text).groups())

import re

data = """<div style="margin-left:0.5em;">
  <div style="margin-bottom:0.5em;">
    10 and 20 are always followed by the same text across webpages
  </div>
</div>
"""

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
print(pattern.findall(data))  # prints [('10', '20')]