使用包含Python和beautiful soup URL的.txt文件从多个网页中刮取数据

使用包含Python和beautiful soup URL的.txt文件从多个网页中刮取数据,python,web-scraping,beautifulsoup,valueerror,Python,Web Scraping,Beautifulsoup,Valueerror,我有一个.txt文件,其中包含指向多个页面的完整URL,每个页面都包含一个我想从中提取数据的表。我的代码只适用于一个URL,但当我尝试添加一个循环并从.txt文件中读取URL时,会出现以下错误 raise ValueError, "unknown url type: %s" % self.__original ValueError: unknown url type: ? 这是我的密码 from urllib2 import urlopen from bs4 import BeautifulS

我有一个.txt文件,其中包含指向多个页面的完整URL,每个页面都包含一个我想从中提取数据的表。我的代码只适用于一个URL,但当我尝试添加一个循环并从.txt文件中读取URL时,会出现以下错误

raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?
这是我的密码

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:

    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    page_soup = soup(page_html, "html.parser")

    containers = page_soup.findAll("tr", {"class":"data"})


    for container in containers:
        unform_name = container.findAll("th", {"width":"30%"})
        name = unform_name[0].text.strip()

        unform_delegate = container.findAll("td", {"id":"y000"})
        delegate = unform_delegate[0].text.strip()

        print(name)
        print(delegate)

f.close()
我已经检查了我的.txt文件,所有条目都正常。它们以HTTP:开头,以.html结尾。它们周围没有撇号或引号。我对for循环的编码不正确

使用

我得到以下信息

??http://www.thegreenpapers.com/PCC/AL-D.html

http://www.thegreenpapers.com/PCC/AL-R.html

http://www.thegreenpapers.com/PCC/AK-D.html
等等,100行。只有第一行有问号。
My.txt文件包含那些仅更改了州和政党缩写的URL。

您不能使用“f.read()”将整个文件读入字符串,然后在该字符串上迭代。要解决此问题,请参见下面的更改。我还删除了你的最后一行。使用“with”语句时,代码块完成后,它将关闭文件

for(Python 2)显示url字符串的类型是'str'还是'unicode'

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

# Code from Greg Hewgill
def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)
        whatisthis(url)
        uClient = urlopen(url)
        page_html = uClient.read()
        uClient.close()

        page_soup = soup(page_html, "html.parser")

        containers = page_soup.findAll("tr", {"class":"data"})

        for container in containers:
            unform_name = container.findAll("th", {"width":"30%"})
            name = unform_name[0].text.strip()

            unform_delegate = container.findAll("td", {"id":"y000"})
            delegate = unform_delegate[0].text.strip()

            print(name)
            print(delegate)
使用具有上面列出的URL的文本文件运行代码会产生以下输出:

http://www.thegreenpapers.com/PCC/AL-D.html

ordinary string
Gore, Al
54.   84%
Uncommitted
10.   16%
LaRouche, Lyndon

http://www.thegreenpapers.com/PCC/AL-R.html

ordinary string
Bush, George W.
44.  100%
Keyes, Alan

Uncommitted

http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13.   68%
Uncommitted
6.   32%
Bradley, Bill

不能使用“f.read()”将整个文件读入字符串,然后对该字符串进行迭代。要解决此问题,请参见下面的更改。我还删除了你的最后一行。使用“with”语句时,代码块完成后,它将关闭文件

for(Python 2)显示url字符串的类型是'str'还是'unicode'

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

# Code from Greg Hewgill
def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)
        whatisthis(url)
        uClient = urlopen(url)
        page_html = uClient.read()
        uClient.close()

        page_soup = soup(page_html, "html.parser")

        containers = page_soup.findAll("tr", {"class":"data"})

        for container in containers:
            unform_name = container.findAll("th", {"width":"30%"})
            name = unform_name[0].text.strip()

            unform_delegate = container.findAll("td", {"id":"y000"})
            delegate = unform_delegate[0].text.strip()

            print(name)
            print(delegate)
使用具有上面列出的URL的文本文件运行代码会产生以下输出:

http://www.thegreenpapers.com/PCC/AL-D.html

ordinary string
Gore, Al
54.   84%
Uncommitted
10.   16%
LaRouche, Lyndon

http://www.thegreenpapers.com/PCC/AL-R.html

ordinary string
Bush, George W.
44.  100%
Keyes, Alan

Uncommitted

http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13.   68%
Uncommitted
6.   32%
Bradley, Bill

您尝试的方法可以通过在代码中抽动两行不同的代码来修复

试试这个:

with open('urls.txt', 'r') as f:
    urls = f.readlines()   #make sure this line is properly indented.
for url in urls:
    uClient = urlopen(url.strip())

您尝试的方法可以通过在代码中抽动两行不同的代码来修复

试试这个:

with open('urls.txt', 'r') as f:
    urls = f.readlines()   #make sure this line is properly indented.
for url in urls:
    uClient = urlopen(url.strip())

这些都是很好的建议,但不幸的是,我仍然收到相同的错误。请为url添加一个打印并显示输出。不,只有在我打印时才显示。听起来你有一个文件没有用utf-8编码。您可能有Unicode字符。请查看我的更新代码以了解情况。这些都是很好的建议,但不幸的是,我仍然收到相同的错误。请为url添加打印并显示输出。不,只有在我打印时才显示。听起来您有一个文件没有用utf-8编码。您可能有Unicode字符。请参阅我的更新代码以了解。