在python中打印列表中的所有项目时出现问题
我正在尝试学习如何做网页抓取,但它并没有以我希望的格式出现。以下是我遇到的问题:在python中打印列表中的所有项目时出现问题,python,html,regex,web-scraping,html-parsing,Python,Html,Regex,Web Scraping,Html Parsing,我正在尝试学习如何做网页抓取,但它并没有以我希望的格式出现。以下是我遇到的问题: import urllib import re pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"] ziplocations = ["=30008","=30009"] i=0 while i<len(pagelist): url = "htt
import urllib
import re
pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]
i=0
while i<len(pagelist):
url = "http://www.boostmobile.com/stores/?" +pagelist[i]+"&zipcode=30008"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<h2 style="float:left;">(.+?)</h2>'
pattern = re.compile(regex)
storeName = re.findall(pattern,htmltext)
print "Store Name=", storeName[i]
i+=1
它将打印出每页上列出的每个店铺名称,但不是按上述格式打印,而是如下所示:
“通过无线仓库推进移动商店”、“通过kob无线推进移动商店”、“marietta check chashing services”,。。。。。等等,大约还有120个条目。
那么,如何将其转换为所需的格式:“Store Name=…”而不是“Name”,“Name”,“…”
storeName
是一个数组,您需要循环遍历它。目前,您正在使用页码在每个页面上对其进行一次索引,这可能不是您的意图
下面是添加了循环的代码的正确版本
import urllib
import re
pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]
i=0
while i<len(pagelist):
url = "http://www.boostmobile.com/stores/?" +pagelist[i]+"&zipcode=30008"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<h2 style="float:left;">(.+?)</h2>'
pattern = re.compile(regex)
storeName = re.findall(pattern,htmltext)
for sn in storeName:
print "Store Name=", sn
i+=1
导入urllib
进口稀土
页面列表=[“第1页”、“第2页”、“第3页”、“第4页”、“第5页”、“第6页”、“第7页”、“第8页”、“第9页”、“第10页”]
拉链=[“=30008”,“=30009”]
i=0
而我使用一个专门的工具-一个HTML解析器
以下是使用以下方法的解决方案:
它打印:
Page Number: 1
Boost Mobile Store by Wireless Depot
Boost Mobile Store by KOB Wireless
Marietta Check Cashing Services
...
Page Number: 2
Target
Wal-Mart
...
如您所见,首先我们找到一个带有结果类的表
标记-这就是商店名称的实际位置。然后,在表中
我们找到了所有h2
标记。这比依赖标记的style
属性更健壮
你也可以利用。它将提高性能,因为它将只解析您指定的文档部分:
required_part = SoupStrainer('table', class_="results")
for page in xrange(1, num_pages + 1):
url = base_url.format(page=page, zipcode=zipcode)
soup = BeautifulSoup(urllib2.urlopen(url), parse_only=required_part)
print "Page Number: %s" % page
for h2 in soup.find_all('h2'):
print h2.text
这里我们说:“只解析表
标记和类结果
,并给出其中所有的h2
标记。”
此外,如果您想提高性能,您可以:
希望能有帮助。太好了!谢谢,我还是有麻烦。我试着让它通过几个Zipcode。它是否像创建url=base_url.format(page=page,zipcode=(variable))@SamK-yup那么简单,您可能需要使用嵌套循环。如果你需要帮助,请告诉我。是的,我肯定没有正确地筑巢,你介意告诉我怎么做吗?
import urllib2
from bs4 import BeautifulSoup
base_url = "http://www.boostmobile.com/stores/?page={page}&zipcode={zipcode}"
num_pages = 10
zipcode = 30008
for page in xrange(1, num_pages + 1):
url = base_url.format(page=page, zipcode=zipcode)
soup = BeautifulSoup(urllib2.urlopen(url))
print "Page Number: %s" % page
results = soup.find('table', class_="results")
for h2 in results.find_all('h2'):
print h2.text
Page Number: 1
Boost Mobile Store by Wireless Depot
Boost Mobile Store by KOB Wireless
Marietta Check Cashing Services
...
Page Number: 2
Target
Wal-Mart
...
required_part = SoupStrainer('table', class_="results")
for page in xrange(1, num_pages + 1):
url = base_url.format(page=page, zipcode=zipcode)
soup = BeautifulSoup(urllib2.urlopen(url), parse_only=required_part)
print "Page Number: %s" % page
for h2 in soup.find_all('h2'):
print h2.text
soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=required_part)