Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python3 BeautifulSoup返回连接字符串_Python_Python 3.x_Web Scraping_Beautifulsoup - Fatal编程技术网

Python3 BeautifulSoup返回连接字符串

Python3 BeautifulSoup返回连接字符串,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我试图从这个html中提取一个演员列表,一旦我找到它 actors_anchor = soup.find('a', href = re.compile('Actor&p')) parent_tag = actors_anchor.parent next_td_tag = actors_anchor_parent.findNext('td') next_td_tag <font size="2">Wes Bentley<br><a href="/peopl

我试图从这个html中提取一个演员列表,一旦我找到它

actors_anchor = soup.find('a', href = re.compile('Actor&p'))
parent_tag = actors_anchor.parent
next_td_tag = actors_anchor_parent.findNext('td')

next_td_tag

<font size="2">Wes Bentley<br><a href="/people/chart/
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert        
Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font>
我需要把这些名字放到一个列表中,每个名字像这样分开 ['Wes Bentley'、'Bryce Dallas Howard'、'Robert Redford'、'Karl Urban']


任何建议都将不胜感激。

找到找到的
td
中的所有
a
元素:

[a.get_text() for a in next_td_tag.find_all('a')]
但这并不包括悬挂时没有
a
元素的“Wes Bentley”文本

我们可以采用不同的方法,定位所有文本节点:

您可能需要清理并删除“空”项:

将打印:

['Wes Bentley', 'Bryce Dallas Howard', 'Robert Redford', 'Karl Urban']

您可以使用
stripped\u strings
获取列表中的所有字符串

html = '''<td><font size="2">Wes Bentley<br><a href="/people/chart/
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font></td>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

next_td_tag = soup.find('td')

print(list(next_td_tag.stripped_strings))


stripped_strings
是生成器,因此您可以将其与
for
-循环一起使用,或者使用
list()获取所有元素

您不能使用
find_all('a',…)
for loop
而不使用
父对象
findNext
?请详细说明。谢谢你的格式编辑。这是我的第一篇文章。问题是不是所有演员的名字都包含在一个标记中。html中的许多名字出现在
标记之间。当我使用这种方法时,它不允许我获得“Wes Bentley”@ChaceMcguyer是的,答案中提到了,请检查。这解决了我的问题。很简单,我现在明白了,谢谢你的帮助。啊,没错,非常适合这个问题@alecxe在头部或尾部没有空白,这里没有空白。并且答案的html代码被修改,“\n”被删除。
texts = [text.strip().replace("\n", " ") for text in next_td_tag.find_all(text=True)]
texts = [text for text in texts if text]
print(texts)
['Wes Bentley', 'Bryce Dallas Howard', 'Robert Redford', 'Karl Urban']
html = '''<td><font size="2">Wes Bentley<br><a href="/people/chart/
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font></td>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

next_td_tag = soup.find('td')

print(list(next_td_tag.stripped_strings))
['Wes Bentley', 'Bryce Dallas Howard', 'Robert Redford', 'Karl Urban']
import bs4

html = '''<font size="2">Wes Bentley<br><a href="/people/chart/
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert        
Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font>'''

soup = bs4.BeautifulSoup(html, 'lxml')

text = soup.get_text(separator='|') # concat the stings by separator 
# 'Wes Bentley|Bryce Dallas Howard|Robert        \nRedford|Karl Urban'
split_text = text.replace('        \n', '').split('|') # than split string in separator.
# ['Wes Bentley', 'Bryce Dallas Howard', 'RobertRedford', 'Karl Urban']

# do it one line 
list_text = soup.get_text(separator='|').replace('        \n', '').split('|')
[i.replace('        \n', '') for i in soup.strings]