Python 如何使用BeautifulSoup4获取<；之前的所有文本；br>；标签_Python_Html_Beautifulsoup_Scrapy

Python 如何使用BeautifulSoup4获取<；之前的所有文本；br>；标签

python html scrapy

Python 如何使用BeautifulSoup4获取<；之前的所有文本；br>；标签,python,html,beautifulsoup,scrapy,Python,Html,Beautifulsoup,Scrapy,我正在为我的应用程序收集一些数据。我的问题是我需要一些以下是HTML代码： <tr> <td> This <a class="tip info" href="blablablablabla">is a first</a> sentence. <br> This <a class="tip info" href="blablablablabla">is a second&l

我正在为我的应用程序收集一些数据。我的问题是我需要一些以下是HTML代码：

<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>


这
判决。


这
判决。

这个
判决。

我希望输出看起来像

这是第一句话。
这是第二句话。
这是第三句话

可以这样做吗？

htmlText=“”
htmlText = """<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>"""
from bs4 import BeautifulSoup
# these two steps are to put everything into one line. may not be necessary for you
htmlText = htmlText.replace("\n", " ")
while "  " in htmlText:
    htmlText = htmlText.replace("  ", " ")

# import into bs4
soup = BeautifulSoup(htmlText, "lxml")

# using https://stackoverflow.com/a/34640357/5702157
for br in soup.find_all("br"):
    br.replace_with("\n")

parsedText = soup.get_text()
while "\n " in parsedText:
    parsedText = parsedText.replace("\n ", "\n") # remove spaces at the start of new lines
print(parsedText.strip())

这
判决。


这
判决。

这个
判决。


"""
从bs4导入BeautifulSoup
#这两个步骤是将所有内容放在一行中。对你来说可能没有必要
htmlText=htmlText.replace（“\n”和“”）
htmlText中的“”时：
htmlText=htmlText.replace（“，”）
#导入bs4
汤=美汤（htmlText，“lxml”）
#使用https://stackoverflow.com/a/34640357/5702157
对于汤中的br。查找所有（“br”）：
br.将_替换为（“\n”）
parsedText=soup.get_text（）
当解析文本中出现“\n”时：
parsedText=parsedText.replace（“\n”，“\n”）#删除新行开头的空格
打印（parsedText.strip（））

这当然是可能的。我将以稍微更一般的方式回答，因为我怀疑您是否只想处理这段HTML

首先，获取指向

td

元素的指针

td = soup.find('td')

现在，请注意，您可以获得此元素的子元素的列表

>>> td_kids = list(td.children)
>>> td_kids
['\n    This\n    ', <a class="tip info" href="blablablablabla">is a first</a>, '\n    sentence.\n    ', <br/>, '\n    This\n    ', <a class="tip info" href="blablablablabla">is a second</a>, '\n    sentence.\n    ', <br/>, 'This\n    ', <a class="tip info" href="blablablablabla">is a third</a>, '\n    sentence.\n    ', <br/>, '\n']

对于列表中的每个项目

然后，您可以反复检查每个子列表，通过将标记转换为soup来替换标记，然后获取这些子列表的子列表。最终，您将有几个子列表，其中只包含BeautifulSoup称之为“可导航字符串”的内容，您可以像往常一样操作这些内容

将元素连接在一起，然后我建议您使用regex

sub

消除空白，如下所示：

result = re.sub(r'\s{2,}', '', <joined list>)

result=re.sub（r'\s{2，}'，''）

试试这个。它应该给你想要的输出。只需考虑下面代码中使用的<代码>内容< /代码>变量，作为上面粘贴的代码> HTML元素< /C> >的持有人。

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")
items = ','.join([''.join([item.previous_sibling,item.text,item.next_sibling]) for item in soup.select(".tip.info")])
data = ' '.join(items.split()).replace(",","\n")
print(data)

输出：

This is a first sentence. 
This is a second sentence. 
This is a third sentence.

您可以使用和基本字符串操作轻松完成此操作，如下所示：

from bs4 import BeautifulSoup

data = '''
<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>
'''

soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all('td'):
    print ' '.join(i.text.split()).replace('. ', '.\n')

@新手程序员是的，我知道-但是网页抓取很大程度上取决于内容格式（在这种情况下，OP需要完整的句子-因此点）。无论如何，OP可以根据实际内容轻松修复此问题。这个答案中最重要的是

i.text

，因为许多程序员往往会忘记或忽略它，甚至忽略它的存在！您是否尝试过以下解决方案？人们试图解决您的问题，但您甚至不想回复@user4937980！！对不起，我刚醒了几个小时。最后我使用了SIM的方法，它就像一个老板一样工作。下面所有的解决方案都非常出色。顺便说一句，抓取网页真的很难学：'(

from bs4 import BeautifulSoup

data = '''
<tr>
  <td>
    This
    <a class="tip info" href="blablablablabla">is a first</a>
    sentence.
    <br>
    This
    <a class="tip info" href="blablablablabla">is a second</a>
    sentence.
    <br>This
    <a class="tip info" href="blablablablabla">is a third</a>
    sentence.
    <br>
  </td>
</tr>
'''

soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all('td'):
    print ' '.join(i.text.split()).replace('. ', '.\n')

This is a first sentence.
This is a second sentence.
This is a third sentence.