使用Python BeautifulSoup解析HTML表_Python_Html_Beautifulsoup

使用Python BeautifulSoup解析HTML表

python html

使用Python BeautifulSoup解析HTML表,python,html,beautifulsoup,Python,Html,Beautifulsoup,我试图使用BeautifulSoup解析我上传到的html表，以便将三列（0到735、0.50到1.0和0.5到0.0）作为列表。为了解释原因，我希望整数0-735是键，十进制数是值在阅读了许多其他的帖子之后，我想到了以下几点，这与创建我想要的列表并不接近。它所做的只是在表中显示文本，如图所示我是Python和BeautifulSoup的新手，所以请对我温柔一点！谢谢像BeautifulSoup这样的HTML解析器假定您想要的是一个反映输入HTML结构的对象模型。但有时（就像在本例中）这种模

我试图使用BeautifulSoup解析我上传到的html表，以便将三列（0到735、0.50到1.0和0.5到0.0）作为列表。为了解释原因，我希望整数0-735是键，十进制数是值

在阅读了许多其他的帖子之后，我想到了以下几点，这与创建我想要的列表并不接近。它所做的只是在表中显示文本，如图所示

我是Python和BeautifulSoup的新手，所以请对我温柔一点！谢谢

像BeautifulSoup这样的HTML解析器假定您想要的是一个反映输入HTML结构的对象模型。但有时（就像在本例中）这种模式会带来更多的阻碍。Pyparsing包括一些HTML解析特性，这些特性比仅使用原始正则表达式更健壮，但在其他方面也以类似的方式工作，允许您定义感兴趣的HTML片段，而忽略其余部分。下面是一个解析器，它可以读取您发布的HTML源代码：

from pyparsing import makeHTMLTags,withAttribute,Suppress,Regex,Group

""" looking for this recurring pattern:
          <td valign="top" bgcolor="#FFFFCC">00-03</td>
          <td valign="top">.50</td>
          <td valign="top">.50</td>

    and want a dict with keys 0, 1, 2, and 3 all with values (.50,.50)
"""

td,tdend = makeHTMLTags("td")
keytd = td.copy().setParseAction(withAttribute(bgcolor="#FFFFCC"))
td,tdend,keytd = map(Suppress,(td,tdend,keytd))

realnum = Regex(r'1?\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'\d{1,3}').setParseAction(lambda t:int(t[0]))
DASH = Suppress('-')

# build up an expression matching the HTML bits above
entryExpr = (keytd + integer("start") + DASH + integer("end") + tdend + 
                    Group(2*(td + realnum + tdend))("vals"))

现在尝试一些查找：

# print out some test values
for test in (0,20,100,700):
    print (test, lookup[test])

印刷品：

0 (0.5, 0.5)
20 (0.53, 0.47)
100 (0.64, 0.36)
700 (0.99, 0.01)

我认为上面的答案比我提供的要好，但我有一个漂亮的答案可以让你开始。这有点老套，但我想我还是会提供的

使用BeautifulSoup，您可以通过以下方式找到具有特定属性的所有标记（假设您已经设置了soup.object）：

那会找到你所有的钥匙。诀窍是将它们与您想要的值关联起来，这些值随后立即显示，并且成对出现（顺便说一句，如果这些值发生变化，这个解决方案将不起作用）

因此，您可以尝试以下方法来访问关键条目之后的内容，并将其放入您的字典中：

 for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
   your_dictionary[node.string] = node.next_sibling

问题是“下一个兄弟节点”实际上是一个“\n”，因此您必须执行以下操作才能捕获下一个值（您想要的第一个值）：

如果需要以下两个值，则必须将其加倍：

for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
  your_dictionary[node.string] = [node.next_sibling.next_sibling.string, node.next_sibling.next_sibling.next_sibling.next_sibling.string]

免责声明：最后一行对我来说很难看

我使用了Beautifulsoup3，但它可能在4以下工作

# Import System libraries
import re

# Import Custom libraries
from BeautifulSoup import BeautifulSoup

# This may be different between BeautifulSoup 3 and BeautifulSoup 4
with open("fide.html") as file_h:
    # Read the file into the BeautifulSoup class
    soup = BeautifulSoup(file_h.read())

tr_location = lambda x: x.name == u"tr" # Row location
key_location = lambda x: x.name == u"td" and bool(set([(u"bgcolor", u"#FFFFCC")]) & set(x.attrs)) # Integer key location
td_location = lambda x: x.name == u"td" and not dict(x.attrs).has_key(u"bgcolor") # Float value location

str_key_dict = {}
num_key_dict = {}
for tr in soup.findAll(tr_location): # Loop through all found rows
    for key in tr.findAll(key_location): # Loop through all found Integer key tds
        key_list = []
        key_str = key.text.strip()
        for td in key.findNextSiblings(td_location)[:2]: # Loop through the next 2 neighbouring Float values
            key_list.append(td.text)
        key_list = map(float, key_list) # Convert the text values to floats

        # String based dictionary section
        str_key_dict[key_str] = key_list

        # Number based dictionary section
        num_range = map(int, re.split("\s*-\s*", key_str)) # Extract a value range to perform interpolation
        if(len(num_range) == 2):
            num_key_dict.update([(x, key_list) for x in range(num_range[0], num_range[1] + 1)])
        else:
            num_key_dict.update([(num_range[0], key_list)])

for x in num_key_dict.items():
    print x

上传您希望数据最终呈现方式的图片+1用于与国际象棋相关的问题。它在表中显示文本，因为代码就是这样做的。为什么不把每个字段都放到字典里，其中键是整数，小数列表是值呢？非常感谢！我只是在检查它以确保我理解，然后我会将它标记为我接受的答案当我运行代码时，我在试图打印一些测试值的行上得到一个KeyError:0，这表明我要查找的键不在字典中？如果你得到错误，那么开始反向工作。查看查找的内容。如果查找为空，则entryExpr可能与任何内容都不匹配。如果entryExpr与任何内容都不匹配，那么您正在解析的文本可能与您发布的示例不匹配。查找为空。对于sourcehtml，我输入了“fide.html”，这是与我的模块位于同一目录中的文件，也是我发布到pastie的文件。“fide.html”只是一个文件名。您必须将实际文件内容传递给scanString，而不是文件名。试着输入sourceHtml:

open（“fide.html”）.read（）

这就是我所说的BS生成的模型，有时它不仅起到了帮助作用，还起到了阻碍作用。BS中的另一个薄弱环节是它必须解析整个HTML，这使得它在面对奇怪的HTML时变得脆弱。基于Pyparsing的scraper可以只查找HTML中的特定构造（即使它们不是有效的HTML-Pyparsing不能判断），而跳过其余的。当然，使用正则表达式在方法上是类似的，但它强制您指定所有可能的空格、大写/小写、HTML属性-makeHTMLTags和pyparsing的空格跳过处理所有这些内容。我同意。这就是为什么我认为你的答案更好的原因，尽管说实话，这有点超出我的理解力。我有点希望有人能指出一些我忽略了的东西，这样就不需要做“下一个兄弟姐妹，下一个兄弟姐妹”（这实际上是我通过阅读《美丽的兄弟姐妹》的文档学到的技巧）。我在考虑一些正则表达式或空格测试，以确保我们捕获了所需的内容，但这并不能消除难看的线条。我还考虑了一些看起来更干净的生成器对象，但我没有继续仔细考虑。嗯，这不仅仅是因为

下一个兄弟很难看，而且它对空格也很敏感-如果
行都在同一行，没有中间的换行符或空格，然后您将使用更少的next_sibling
调用来迭代BS生成的DOM。如果您编写了一个名为next\u element
的简单生成器，在找到元素节点（next
节点）之前，它将执行next\u同级
调用，那么您可以使用next\u element（node）.string，next\u element（next\u element（node））.string
-可能不那么难看，但是在面对不可预测的空白时肯定会更加健壮。啊，我看到在BS API中有一个node.findNext（'td'）
调用，因此您可以编写（node.findNext（'td'）.string，node.findNext（'td'）.findNext（'td'）.string）
来获取接下来两个td标记'内容的两元组。是的！这实际上对我正在做的一些东西很有用。谢谢你帮我想清楚
 for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
   your_dictionary[node.string] = node.next_sibling

for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
  your_dictionary[node.string] = node.next_sibling.next_sibling.string

for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
  your_dictionary[node.string] = [node.next_sibling.next_sibling.string, node.next_sibling.next_sibling.next_sibling.next_sibling.string]

# Import System libraries
import re

# Import Custom libraries
from BeautifulSoup import BeautifulSoup

# This may be different between BeautifulSoup 3 and BeautifulSoup 4
with open("fide.html") as file_h:
    # Read the file into the BeautifulSoup class
    soup = BeautifulSoup(file_h.read())

tr_location = lambda x: x.name == u"tr" # Row location
key_location = lambda x: x.name == u"td" and bool(set([(u"bgcolor", u"#FFFFCC")]) & set(x.attrs)) # Integer key location
td_location = lambda x: x.name == u"td" and not dict(x.attrs).has_key(u"bgcolor") # Float value location

str_key_dict = {}
num_key_dict = {}
for tr in soup.findAll(tr_location): # Loop through all found rows
    for key in tr.findAll(key_location): # Loop through all found Integer key tds
        key_list = []
        key_str = key.text.strip()
        for td in key.findNextSiblings(td_location)[:2]: # Loop through the next 2 neighbouring Float values
            key_list.append(td.text)
        key_list = map(float, key_list) # Convert the text values to floats

        # String based dictionary section
        str_key_dict[key_str] = key_list

        # Number based dictionary section
        num_range = map(int, re.split("\s*-\s*", key_str)) # Extract a value range to perform interpolation
        if(len(num_range) == 2):
            num_key_dict.update([(x, key_list) for x in range(num_range[0], num_range[1] + 1)])
        else:
            num_key_dict.update([(num_range[0], key_list)])

for x in num_key_dict.items():
    print x