.find（）在Python中的BeautifulSoup中_Python_Python 3.x_Beautifulsoup

.find（）在Python中的BeautifulSoup中

python python-3.x

.find（）在Python中的BeautifulSoup中,python,python-3.x,beautifulsoup,Python,Python 3.x,Beautifulsoup,我正在尝试使用BeautifulSoup搜索html文档。是否有方法在文档中搜索包含特定关键字字符串的表？例如，如果我有桌子： <table> <tr> <td> 'abc' </td> <td> 'other data' </td> </tr> <tr> <td> 'def' </td> <td> 'other data

我正在尝试使用BeautifulSoup搜索html文档。是否有方法在文档中搜索包含特定关键字字符串的表？例如，如果我有桌子：

<table>
  <tr>
    <td> 'abc' </td>
    <td> 'other data' </td>
  </tr>
  <tr>
    <td> 'def' </td>
    <td> 'other data' </td>
  </tr>
  <tr>
    <td> '123' </td>
    <td> 'other data' </td>
  </tr>
  <tr>
    <td> '456' </td>
    <td> 'other data' </td>
  </tr>
</table>

编辑：是的，表需要所有这些字符串。html文件还包含其他表，这些表可能包含这些字符串中的任何一个，但只有一个表包含所有3个字符串。是的，我只会在标记中查找字符串（可能是“文本”而不是“字符串”，我不确定区别）。

一个解决方案是

soup.find（）

。该函数必须只接受一个参数（a

bs4.element.Tag

object），其目的是仅返回True或False：如果表符合您设置的条件，则返回True

如果您想测试不同的字符串，可以从一个具有两个参数的函数开始，然后使用

functools.partial（）

将其归结为一个参数：

from functools import partial

def _table_contains_strs(tag, strings):
    """Return True if `tag` has <td> tags that contain `strings`."""
    if tag.name != 'table':
        return False
    tds = tag.find_all('td')
    if not tds:
        return False
    test = {s: False for s in strings}
    for tag in tds:
        for s in strings:
            if s in tag:
                test[s] = True
        if all(test.values()):
            # You can return early (without full iteration)
            # if all strings already matched.
            return True
    return False

def _make_single_arg_function(func, *args, **kwargs):
    return partial(_table_contains_strs, *args, **kwargs)

table_contains_strs = _make_single_arg_function(
    _table_contains_strs,
    strings=('abc', 'def', '456')
)

注意：我不能说这可以非常好地扩展，因为它使用嵌套的for循环来检查每个

标记中的每个字符串。但希望它能完成任务。仅仅使用

.text

可能不是一个好主意，因为这也将捕获其他嵌套标记，而您指定不需要这些标记。另一种优化方法是将

find_all（）

作为最小、完整和可验证示例的一部分发布示例HTML。感谢您发布html示例。有几件事可以澄清您的问题：表是否需要包含所有这些字符串，或者任何字符串？这些字符串是否只能包含在标记中或表中的任何位置？是的，表需要所有这些字符串。是的，我只会在标签中寻找字符串（可能是“文本”而不是“字符串”，我不确定区别）。谢谢！这个例子是有效的，但正如你所说，它不能很好地扩展到我试图处理的数据量。它确实给了我一些尝试的想法。你能分享实际的网页url吗@Buckbacott是一个循环，读取sec.gov EDGAR数据库上的所有年度报告。问题是文档的格式不尽相同（不同的单词和短语，但也有不同的html格式）。我试图通过查找包含特定关键字的表来找到特定表。

from functools import partial

def _table_contains_strs(tag, strings):
    """Return True if `tag` has <td> tags that contain `strings`."""
    if tag.name != 'table':
        return False
    tds = tag.find_all('td')
    if not tds:
        return False
    test = {s: False for s in strings}
    for tag in tds:
        for s in strings:
            if s in tag:
                test[s] = True
        if all(test.values()):
            # You can return early (without full iteration)
            # if all strings already matched.
            return True
    return False

def _make_single_arg_function(func, *args, **kwargs):
    return partial(_table_contains_strs, *args, **kwargs)

table_contains_strs = _make_single_arg_function(
    _table_contains_strs,
    strings=('abc', 'def', '456')
)

from bs4 import BeautifulSoup

# Add some text to other tags to make sure you're
# finding only in <td>s
html = """\
<table>
    <th>field1</th>
    <th>field2</th>
  <tr>
    <td>abc</td>
    <td>other data</td>
  </tr>
  <tr>
    <td>def</td>
    <td>other data</td>
  </tr>
  <tr>
    <td>123</td>
    <td>other data</td>
  </tr>
  <tr>
    <td>456</td>
    <td>other data</td>
  </tr>
</table>"""

soup = BeautifulSoup(html, 'html.parser')

soup.find(table_contains_strs)
# Should return the table above