Python 3.x 不要解析嵌套表数据_Python 3.x_Html Table_Beautifulsoup

Python 3.x 不要解析嵌套表数据

python-3.x

Python 3.x 不要解析嵌套表数据,python-3.x,html-table,beautifulsoup,Python 3.x,Html Table,Beautifulsoup,我有一个嵌套的表结构。我使用下面的代码来解析数据 for row in table.find_all("tr")[1:][:-1]: for td in row.find_all("td")[1:]: dataset = td.get_text() 这里的问题是当有嵌套的表时，比如在我的例子中，里面有表，所以在最初解析之后会再次解析这些表，因为我使用find_alltr和find_alltd。那么，我如何避免解析嵌套表，因为它已经被解析了输入：但我得到的是：也就是

我有一个嵌套的表结构。我使用下面的代码来解析数据

for row in table.find_all("tr")[1:][:-1]:
    for td in row.find_all("td")[1:]:
        dataset = td.get_text()

这里的问题是当有嵌套的表时，比如在我的例子中，里面有表，所以在最初解析之后会再次解析这些表，因为我使用find_alltr和find_alltd。那么，我如何避免解析嵌套表，因为它已经被解析了

输入：

但我得到的是：

也就是说，再次解析内部表

规格： beautifulsoup4==4.6.3

应保留数据顺序，内容可以是任何内容，包括任何字母数字字符。

您可以检查td标记中是否存在其他表，如果存在，则只需跳过该td，否则将其用作常规td

对于表中的行，查找\u alltr[1:][：-1]：对于第行中的td。查找所有td[1:]： if td.find'table'：检查td是否有嵌套表持续数据集=td.get\u文本

在您的示例中，对于bs4 4.7.1，我使用：has:不排除带有表child的循环行

from bs4 import BeautifulSoup as bs

html = '''
<table>
    <tr>
        <td>1</td>
        <td>2</td>
    </tr>
    <tr>
        <td>3</td>
        <td>4</td>
    </tr>
    <tr>
        <td>
            <table>
                <tr>
                    <td>11</td>
                    <td>22</td>
                </tr>
            </table>
        </td>
    </tr>
</table>'''

soup = bs(html, 'lxml')
for tr in soup.select('tr:not(:has(table))'):
    print([td.text for td in tr.select('td')])

使用bs4和re的组合，您可能会实现您想要的

我使用的是bs4.6.3

from bs4 import BeautifulSoup as bs
import re

html = '''
<table>
<tr>
   <td>1</td><td>2</td>
</tr>
<tr>
   <td>3</td><td>4</td>
</tr>
<tr>
  <td>5 
    <table><tr><td>11</td><td>22</td></tr></table>
      6
  </td>
</tr>
</table>'''

soup = bs(html, 'lxml')

ans = []

for x in soup.findAll('td'):
    if x.findAll('td'):
        for y in re.split('<table>.*</table>', str(x)):
            ans += re.findall('\d+', y)
    else:
        ans.append(x.text)
print(ans)

对于每个td，我们测试这是否是一个嵌套td。如果是这样的话，我们就在桌子上分开，把所有的东西都拿出来，然后用正则表达式匹配每个数字

请注意，这只适用于两个深度级别，但适用于任何深度。我已尝试使用findChilden方法以及一些如何生成输出的方法。我不确定这是否会在任何其他情况下对您有所帮助

from bs4 import BeautifulSoup
data='''<table>
<tr>
   <td>1</td><td>2</td>
</tr>
<tr>
   <td>3</td><td>4</td>
</tr>
<tr>
  <td>5 
    <table><tr><td>11</td><td>22</td></tr></table>
      6
  </td>
</tr>
</table>'''


soup=BeautifulSoup(data,'html.parser')

for child in soup.find('table').findChildren("tr" , recursive=False):
  tdlist = []
  if child.find('table'):
     for td in child.findChildren("td", recursive=False):
         print(td.next_element.strip())
         for td1 in td.findChildren("table", recursive=False):
             for child1 in td1.findChildren("tr", recursive=False):
                 for child2 in child1.findChildren("td", recursive=False):
                     tdlist.append(child2.text)
                 print(' '.join(tdlist))
                 print(child2.next_element.next_element.strip())
  else:

     for td in child.findChildren("td" , recursive=False):
         tdlist.append(td.text)
     print(' '.join(tdlist))

编辑以供解释步骤1：

在表内使用findChilden时，它首先返回3条记录

for child in soup.find('table').findChildren("tr", recursive=False):
    print(child)

输出：步骤3：

按照步骤1，使用findChilden获取标记

一旦你得到了答案，请按照第1步再次得到孩子们

步骤4：

下一个元素将返回标记的第一个文本，因此在这种情况下，它将返回值5

步骤5

如果您看到这里，我只是递归地执行步骤1。是的，我再次使用child2.next\u element.next\u element来获取标记后6的值。

是否有要共享的url，以及是否可以指示预期的输出？如果您的td是行的直接子级，则可以在find\u all方法中使用recursive=False作为参数。比如：row.find\u alltd，递归=False@Maaz试过了，但还是解析了如果你能给出一些具体的输入示例和你想要的结果，那会更好。@InfectedDrake，我已经添加了示例输入和预期输出，但它跳过了嵌套表之外的任何文本。我正在使用beautifulsoup4==4.6.3我建议升级给定的4.7.1更可靠，功能更强。你能解释一下代码吗，以便我能根据我的建议理解和使用它requirements@Nagaraju字体你在吗对解释感到满意，或者您需要进一步的帮助。['1'，'2'，'3'，'4'，'5'，'6'，'11'，'22']输出正常，但顺序已取消！这是预期的内容[1'，2'，3'，4'，5'，11'，22'，6']内容可以是任何内容，而不仅仅是数字。请指出：您应该在您的问题中添加这些规范。我试图给出一个最小的示例，因此无法涵盖所有方面，您是对的！但现在这是规范的一部分，这应该是一个问题：

from bs4 import BeautifulSoup as bs
import re

html = '''
<table>
<tr>
   <td>1</td><td>2</td>
</tr>
<tr>
   <td>3</td><td>4</td>
</tr>
<tr>
  <td>5 
    <table><tr><td>11</td><td>22</td></tr></table>
      6
  </td>
</tr>
</table>'''

soup = bs(html, 'lxml')

ans = []

for x in soup.findAll('td'):
    if x.findAll('td'):
        for y in re.split('<table>.*</table>', str(x)):
            ans += re.findall('\d+', y)
    else:
        ans.append(x.text)
print(ans)

from bs4 import BeautifulSoup
data='''<table>
<tr>
   <td>1</td><td>2</td>
</tr>
<tr>
   <td>3</td><td>4</td>
</tr>
<tr>
  <td>5 
    <table><tr><td>11</td><td>22</td></tr></table>
      6
  </td>
</tr>
</table>'''


soup=BeautifulSoup(data,'html.parser')

for child in soup.find('table').findChildren("tr" , recursive=False):
  tdlist = []
  if child.find('table'):
     for td in child.findChildren("td", recursive=False):
         print(td.next_element.strip())
         for td1 in td.findChildren("table", recursive=False):
             for child1 in td1.findChildren("tr", recursive=False):
                 for child2 in child1.findChildren("td", recursive=False):
                     tdlist.append(child2.text)
                 print(' '.join(tdlist))
                 print(child2.next_element.next_element.strip())
  else:

     for td in child.findChildren("td" , recursive=False):
         tdlist.append(td.text)
     print(' '.join(tdlist))

for child in soup.find('table').findChildren("tr", recursive=False):
    print(child)

<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5 
        <table><tr><td>11</td><td>22</td></tr></table>
          6
      </td>
</tr>

 if child.find('table'):

for td in child.findChildren("td", recursive=False)            
      print(td.next_element.strip())

for td in child.findChildren("td", recursive=False):
             print(td.next_element.strip())
             for td1 in td.findChildren("table", recursive=False):
                 for child1 in td1.findChildren("tr", recursive=False):
                     for child2 in child1.findChildren("td", recursive=False):
                         tdlist.append(child2.text)
                     print(' '.join(tdlist))
                     print(child2.next_element.next_element.strip())