Python BeautifulSoup在元素之间提取文本_Python_Beautifulsoup

Python BeautifulSoup在元素之间提取文本

python

Python BeautifulSoup在元素之间提取文本,python,beautifulsoup,Python,Beautifulsoup,我尝试从以下HTML中提取“这是我的文本”： <html> <body> <table> <td class="MYCLASS">  <a hef="xy">Text</a> <p>something</p> THIS IS MY TEXT <p>something else&

我尝试从以下HTML中提取“这是我的文本”：

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

但是我得到了所有嵌套标记之间的所有文本以及注释

有人能帮我把“这是我的文本”从中去掉吗？

改用：

是的，这是一种舞蹈

输出：

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print ''.join(unicode(child) for child in hit.children 
...         if isinstance(child, NavigableString) and not isinstance(child, Comment))
... 




      THIS IS MY TEXT

您可以使用：

了解有关如何导航的更多信息。解析树具有

标记

和

导航字符串

（因为这是一个文本）。一个例子

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

例如，对于多个子节点，可以有

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

提供了一个关于使用extract方法从文档中删除对象的示例。在以下示例中，目的是从文档中删除所有注释：

删除元素

一旦有了对元素的引用，就可以将其从树的提取方法。此代码删除所有注释从文件中：

从BeautifulSoup导入BeautifulSoup，注释
汤=美丽的汤（“”1
23""")
comments=soup.findAll（text=lambda text:isinstance（text，Comment））
[comment.extract（）用于注释中的注释]
印花汤
# 1
# 23
简短回答：soup.findAll（'p'）[0]。下一步

真正的答案：你需要一个不变的参考点，你可以从它到达你的目标
你在对海德罗回答的评论中提到，你想要的文本并不总是在同一个地方。找到它相对于某个元素处于同一位置的感觉。然后找出如何使BeautifulSoup沿着该不变路径导航解析树
例如，在原始文章中提供的HTML中，目标字符串立即出现在第一个段落元素之后，并且该段落不是空的。由于findAll（'p'）
将查找段落元素，因此soup.find（'p'）[0]
将是第一个段落元素
在这种情况下，您可以使用soup.find（'p'）
，但是soup.findAll（'p'）[n]
更一般，因为您的实际场景可能需要第五段或类似的内容
next
字段属性将是树中的下一个解析元素，包括子元素。因此soup.findAll（'p'）[0]。next
包含段落的文本，而soup.findAll（'p'）[0]。next。next
将以提供的HTML返回目标。
使用您自己的soup对象：
soup.p.next_sibling.strip()

您可以直接使用soup.p
*（这取决于它是解析树中的第一个）
然后对soup.p
返回的标记对象使用next\u sibling
，因为所需文本嵌套在与
.strip（）
只是一种用于删除前导和尾随空格的Python str方法
*否则，仅使用您选择的元素
在解释器中，这类似于：
In [4]: soup.p
Out[4]: <p>something</p>

In [5]: type(soup.p)
Out[5]: bs4.element.Tag

In [6]: soup.p.next_sibling
Out[6]: u'\n      THIS IS MY TEXT\n      '

In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString

In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'

In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode

[4]中的：soup.p
Out[4]：什么
In[5]：类型（soup.p）
Out[5]：bs4.element.Tag
在[6]中：soup.p.next_同胞
Out[6]：u'\n这是我的文本\n'
在[7]中：键入（soup.p.next_sibling）
Out[7]：bs4.element.NavigableString
[8]中的soup.p.next_sibling.strip（）
这是我的文本
在[9]中：键入（soup.p.next_sibling.strip（））
Out[9]：unicode

这将打印：这是我的文本
试试这个..
它会返回u'\n注释\n文本\n某物\n这是我的文本\n其他的东西\n'
或u'a注释文本某物这是我的文本其他的东西，它的文本比要求的多。@CristianCiupitu:当然，你是对的，没有注意这里。更新。这是唯一的解决方案，它不依赖于文本的顺序或与特定其他元素的位置关系，而是从指定的标记/元素中提取所有文本，同时忽略子标记/元素的文本（或其他内容）。谢谢这是尴尬的，但它的工作，并解决了我的问题（我不是OP，但有类似的需要）。谢谢，但文本并不总是在同一个地方。不管怎样，它会起作用吗？@o613ɔo477; nqɹo477; lo613;唉，不会。可能使用其他人的答案数字6
表示什么？@User由于.contents
返回一个列表，我们将从列表中获取第7个元素（即第6个索引），即文本hit.string
为无和hit.contents[0]
为u'\n'
，因此，请提供问题示例的答案。因此，在这里，您可以玩内容游戏，并获取所需索引的内容。问题的答案是否可以添加更多解释性文字，说明如何回答此问题？我也在寻找此内容，以便获得我想在其他地方使用的帖子字符串。我发现很简单：如果汤是一次性的，可以使用soup.html.unwrap（）
和soup.body.unwrap（）
来移除标签，这样print（soup）就可以给出除了这些标签以外的所有东西。
soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                    <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

soup.p.next_sibling.strip()

In [4]: soup.p
Out[4]: <p>something</p>

In [5]: type(soup.p)
Out[5]: bs4.element.Tag

In [6]: soup.p.next_sibling
Out[6]: u'\n      THIS IS MY TEXT\n      '

In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString

In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'

In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode

soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
  hit = hit.text.strip()
  print hit