Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/336.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在python中打印列表中的所有项目时出现问题_Python_Html_Regex_Web Scraping_Html Parsing - Fatal编程技术网

在python中打印列表中的所有项目时出现问题

在python中打印列表中的所有项目时出现问题,python,html,regex,web-scraping,html-parsing,Python,Html,Regex,Web Scraping,Html Parsing,我正在尝试学习如何做网页抓取,但它并没有以我希望的格式出现。以下是我遇到的问题: import urllib import re pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"] ziplocations = ["=30008","=30009"] i=0 while i<len(pagelist): url = "htt

我正在尝试学习如何做网页抓取,但它并没有以我希望的格式出现。以下是我遇到的问题:

import urllib
import re

pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]

i=0
while i<len(pagelist):
    url = "http://www.boostmobile.com/stores/?" +pagelist[i]+"&zipcode=30008"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<h2 style="float:left;">(.+?)</h2>' 
    pattern = re.compile(regex)
    storeName = re.findall(pattern,htmltext)
    print "Store Name=", storeName[i]
    i+=1
它将打印出每页上列出的每个店铺名称,但不是按上述格式打印,而是如下所示: “通过无线仓库推进移动商店”、“通过kob无线推进移动商店”、“marietta check chashing services”,。。。。。等等,大约还有120个条目。
那么,如何将其转换为所需的格式:“Store Name=…”而不是“Name”,“Name”,“…”

storeName
是一个数组,您需要循环遍历它。目前,您正在使用页码在每个页面上对其进行一次索引,这可能不是您的意图

下面是添加了循环的代码的正确版本

import urllib
import re

pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]

i=0
while i<len(pagelist):
    url = "http://www.boostmobile.com/stores/?" +pagelist[i]+"&zipcode=30008"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<h2 style="float:left;">(.+?)</h2>' 
    pattern = re.compile(regex)
    storeName = re.findall(pattern,htmltext)
    for sn in storeName:
        print "Store Name=", sn
    i+=1
导入urllib
进口稀土
页面列表=[“第1页”、“第2页”、“第3页”、“第4页”、“第5页”、“第6页”、“第7页”、“第8页”、“第9页”、“第10页”]
拉链=[“=30008”,“=30009”]
i=0
而我使用一个专门的工具-一个
HTML解析器

以下是使用以下方法的解决方案:

它打印:

Page Number: 1
Boost Mobile Store by Wireless Depot
Boost Mobile Store by KOB Wireless
Marietta Check Cashing Services
...
Page Number: 2
Target
Wal-Mart
...
如您所见,首先我们找到一个带有
结果
类的
标记-这就是商店名称的实际位置。然后,在
表中
我们找到了所有
h2
标记。这比依赖标记的
style
属性更健壮


你也可以利用。它将提高性能,因为它将只解析您指定的文档部分:

required_part = SoupStrainer('table', class_="results")
for page in xrange(1, num_pages + 1):
    url = base_url.format(page=page, zipcode=zipcode)
    soup = BeautifulSoup(urllib2.urlopen(url), parse_only=required_part)

    print "Page Number: %s" % page
    for h2 in soup.find_all('h2'):
        print h2.text
这里我们说:“只解析
标记和类
结果
,并给出其中所有的
h2
标记。”

此外,如果您想提高性能,您可以:


希望能有帮助。

太好了!谢谢,我还是有麻烦。我试着让它通过几个Zipcode。它是否像创建url=base_url.format(page=page,zipcode=(variable))@SamK-yup那么简单,您可能需要使用嵌套循环。如果你需要帮助,请告诉我。是的,我肯定没有正确地筑巢,你介意告诉我怎么做吗?
import urllib2
from bs4 import BeautifulSoup

base_url = "http://www.boostmobile.com/stores/?page={page}&zipcode={zipcode}"
num_pages = 10
zipcode = 30008

for page in xrange(1, num_pages + 1):
    url = base_url.format(page=page, zipcode=zipcode)
    soup = BeautifulSoup(urllib2.urlopen(url))

    print "Page Number: %s" % page
    results = soup.find('table', class_="results")
    for h2 in results.find_all('h2'):
        print h2.text
Page Number: 1
Boost Mobile Store by Wireless Depot
Boost Mobile Store by KOB Wireless
Marietta Check Cashing Services
...
Page Number: 2
Target
Wal-Mart
...
required_part = SoupStrainer('table', class_="results")
for page in xrange(1, num_pages + 1):
    url = base_url.format(page=page, zipcode=zipcode)
    soup = BeautifulSoup(urllib2.urlopen(url), parse_only=required_part)

    print "Page Number: %s" % page
    for h2 in soup.find_all('h2'):
        print h2.text
soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=required_part)