使用包含Python和beautiful soup URL的.txt文件从多个网页中刮取数据
我有一个.txt文件,其中包含指向多个页面的完整URL,每个页面都包含一个我想从中提取数据的表。我的代码只适用于一个URL,但当我尝试添加一个循环并从.txt文件中读取URL时,会出现以下错误使用包含Python和beautiful soup URL的.txt文件从多个网页中刮取数据,python,web-scraping,beautifulsoup,valueerror,Python,Web Scraping,Beautifulsoup,Valueerror,我有一个.txt文件,其中包含指向多个页面的完整URL,每个页面都包含一个我想从中提取数据的表。我的代码只适用于一个URL,但当我尝试添加一个循环并从.txt文件中读取URL时,会出现以下错误 raise ValueError, "unknown url type: %s" % self.__original ValueError: unknown url type: ? 这是我的密码 from urllib2 import urlopen from bs4 import BeautifulS
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?
这是我的密码
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
f.close()
我已经检查了我的.txt文件,所有条目都正常。它们以HTTP:开头,以.html结尾。它们周围没有撇号或引号。我对for循环的编码不正确
使用
我得到以下信息
??http://www.thegreenpapers.com/PCC/AL-D.html
http://www.thegreenpapers.com/PCC/AL-R.html
http://www.thegreenpapers.com/PCC/AK-D.html
等等,100行。只有第一行有问号。
My.txt文件包含那些仅更改了州和政党缩写的URL。您不能使用“f.read()”将整个文件读入字符串,然后在该字符串上迭代。要解决此问题,请参见下面的更改。我还删除了你的最后一行。使用“with”语句时,代码块完成后,它将关闭文件 for(Python 2)显示url字符串的类型是'str'还是'unicode'
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
# Code from Greg Hewgill
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
with open('urls.txt', 'r') as f:
for url in f:
print(url)
whatisthis(url)
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
使用具有上面列出的URL的文本文件运行代码会产生以下输出:
http://www.thegreenpapers.com/PCC/AL-D.html
ordinary string
Gore, Al
54. 84%
Uncommitted
10. 16%
LaRouche, Lyndon
http://www.thegreenpapers.com/PCC/AL-R.html
ordinary string
Bush, George W.
44. 100%
Keyes, Alan
Uncommitted
http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13. 68%
Uncommitted
6. 32%
Bradley, Bill
不能使用“f.read()”将整个文件读入字符串,然后对该字符串进行迭代。要解决此问题,请参见下面的更改。我还删除了你的最后一行。使用“with”语句时,代码块完成后,它将关闭文件 for(Python 2)显示url字符串的类型是'str'还是'unicode'
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
# Code from Greg Hewgill
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
with open('urls.txt', 'r') as f:
for url in f:
print(url)
whatisthis(url)
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
使用具有上面列出的URL的文本文件运行代码会产生以下输出:
http://www.thegreenpapers.com/PCC/AL-D.html
ordinary string
Gore, Al
54. 84%
Uncommitted
10. 16%
LaRouche, Lyndon
http://www.thegreenpapers.com/PCC/AL-R.html
ordinary string
Bush, George W.
44. 100%
Keyes, Alan
Uncommitted
http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13. 68%
Uncommitted
6. 32%
Bradley, Bill
您尝试的方法可以通过在代码中抽动两行不同的代码来修复 试试这个:
with open('urls.txt', 'r') as f:
urls = f.readlines() #make sure this line is properly indented.
for url in urls:
uClient = urlopen(url.strip())
您尝试的方法可以通过在代码中抽动两行不同的代码来修复 试试这个:
with open('urls.txt', 'r') as f:
urls = f.readlines() #make sure this line is properly indented.
for url in urls:
uClient = urlopen(url.strip())
这些都是很好的建议,但不幸的是,我仍然收到相同的错误。请为url添加一个打印并显示输出。不,只有在我打印时才显示。听起来你有一个文件没有用utf-8编码。您可能有Unicode字符。请查看我的更新代码以了解情况。这些都是很好的建议,但不幸的是,我仍然收到相同的错误。请为url添加打印并显示输出。不,只有在我打印时才显示。听起来您有一个文件没有用utf-8编码。您可能有Unicode字符。请参阅我的更新代码以了解。