Python 无法获取<；span></span>；文本_Python_Beautifulsoup

Python 无法获取<；span></span>；文本

python

Python 无法获取<；span></span>；文本,python,beautifulsoup,Python,Beautifulsoup,无法获取“表”中的跨度文本，谢谢 from bs4 import BeautifulSoup import urllib2 url1 = "url" content1 = urllib2.urlopen(url1).read() soup = BeautifulSoup(content1,"lxml") table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"}) rows = table.find

无法获取“表”中的跨度文本，谢谢

from bs4 import BeautifulSoup
import urllib2

url1 = "url"

content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1,"lxml")
table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})
rows = table.find_all('span',recursive=False)
for row in rows:
    print(row.text)

您似乎在使用python 2.x，这是一个python 3.x解决方案，因为我目前没有python 2.x环境：

from bs4 import BeautifulSoup
import urllib.request as urllib


url1 = "<URL>"

# Read the HTML page
content1 = urllib.urlopen(url1).read()
soup = BeautifulSoup(content1, "lxml")

# Find the div (there is only one, so you do not need findAll) -> this is your problem
div = soup.find("div", class_="iw_component", id="c1417094965154")
# Now you retrieve all the span within this div
rows = div.find_all("span")

# You can do what you want with it !
line = ""
for row in rows:
    row_str = row.get_text()
    row_str = row_str.replace('\t', '')
    line += row_str + ", "
print(line)

从bs4导入美化组
将urllib.request导入为urllib
url1=“”
#阅读HTML页面
content1=urllib.urlopen（url1.read）（）
汤=美汤（含量1，“lxml”）
#找到div（只有一个，所以不需要findAll）->这是您的问题
div=soup.find（“div”，class=“iw\u component”，id=“c1417094965154”）
#现在检索这个div中的所有跨度
行=div.find_all（“span”）
#你可以用它做你想做的事！
line=“”
对于行中的行：
row\u str=row.get\u text（）
行结构=行结构替换（'\t'，''）
行+=行“，”
打印（行）

table=soup.findAll（“div”，“class”：“iw_组件”，“id”：“c1417094965154”）

在上面的一行中，

findAll（）

返回一个列表。因此，在下一行中，您将看到错误，因为它需要一个HTML字符串

如果只需要一个表，请尝试使用以下代码。替换

rows=table.find_all（'span'，recursive=False）

与

rows=表[0]。查找所有（'span'）

如果希望页面中有多个表，请在该表上运行for循环，然后在for循环中运行其余语句

此外，对于漂亮的输出，您可以使用空格替换

选项卡

，如下代码所示：

row = row.get_text()
row = row.replace('\t', '')
print(row)

您的最终工作代码是：

from bs4 import BeautifulSoup
import urllib2

url1 = "url"

content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1,"lxml")
table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})
rows = table[0].find_all('span')
for row in rows:
    row_str = row.get_text()
    row_str = row_str.replace('\t', '')
    print(row_str)

关于

recursive=False

参数，如果将其设置为False，它将只在直接子级中找到，而在您的情况下，直接子级不会给出任何结果

如果你只想要漂亮的汤来考虑直接儿童，你可以通过<代码>递归= false < /p>

下面是使用lxml而不是beautifulsoup的另一种方法：

import requests
from lxml import html

req = requests.get("<URL>")
raw_html = html.fromstring(req.text)
spans = raw_html.xpath('//div[@id="c1417094965154"]//span/text()')
print("".join([x.replace("\t", "").replace("\r\n","").strip() for x in spans]))

导入请求
从lxml导入html
req=请求。获取（“”）
原始html=html.fromstring（请求文本）
span=raw_html.xpath（'//div[@id=“c1417094965154”]///span/text（）
打印（“.join（[x.replace（“\t”，”）.replace（“\r\n”，”）.strip（）用于跨距中的x]））

输出：Kranji Mile Day同播比赛，Kranji赛马场，SINClass 3障碍-1200米草皮2018年5月26日星期六比赛1，下午5:15

如您所见，输出需要一些格式，span是所有span文本的列表，因此您可以执行任何需要的处理