Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/loops/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用python 3.6.3使用beautifulsoup4刮取多个页面_Python_Loops_Beautifulsoup - Fatal编程技术网

使用python 3.6.3使用beautifulsoup4刮取多个页面

使用python 3.6.3使用beautifulsoup4刮取多个页面,python,loops,beautifulsoup,Python,Loops,Beautifulsoup,我试图循环浏览多个页面,但我的代码没有提取任何内容。我是个新手,所以请容忍我。我制作了一个容器,以便可以针对每个列表。我还创建了一个变量,用于指向锚定标记,您可以按该标记进入下一页。如果能得到任何帮助,我将不胜感激。谢谢 from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup for page in range(0,25): file = "breakfeast_chicago.c

我试图循环浏览多个页面,但我的代码没有提取任何内容。我是个新手,所以请容忍我。我制作了一个容器,以便可以针对每个列表。我还创建了一个变量,用于指向锚定标记,您可以按该标记进入下一页。如果能得到任何帮助,我将不胜感激。谢谢

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

for page in range(0,25):
    file = "breakfeast_chicago.csv"
    f = open(file, "w")
    Headers = "Nambusiness_name, business_address, business_city, business_region, business_phone_number\n"
f.write(Headers)

my_url = 'https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}'.format(page)

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()   

# html parsing
page_soup = soup(page_html, "html.parser")

# grabs each listing
containers = page_soup.findAll("div",{"class": "result"})

new = page_soup.findAll("a", {"class":"next ajax-page"})

for i in new:
    try:
        for container in containers:
            b_name = i.find("container.h2.span.text").get_text()
            b_addr = i.find("container.p.span.text").get_text()

            city_container = container.findAll("span",{"class": "locality"})
            b_city = i.find("city_container[0].text ").get_text()

            region_container = container.findAll("span",{"itemprop": "postalCode"})
            b_reg = i.find("region_container[0].text").get_text()

            phone_container = container.findAll("div",{"itemprop": "telephone"})
            b_phone = i.find("phone_container[0].text").get_text()

            print(b_name, b_addr, b_city, b_reg, b_phone)
            f.write(b_name + "," +b_addr + "," +b_city.replace(",", "|") + "," +b_reg + "," +b_phone + "\n")
    except: AttributeError
f.close()

如果使用BS4,请尝试:
find\u all

尝试使用
import pdb将其放入跟踪;pdb.set_trace()
并尝试调试for循环中选择的内容

此外,如果通过javascript加载某些内容,则可能会隐藏这些内容


“点击”的每个锚点标签或HREF只是另一个网络请求,如果您打算跟踪链接,请考虑在每个请求之间减慢请求的数量,这样就不会被阻止。

< P>您可以像下面的脚本一样尝试。它将通过分页遍历不同的页面,并从每个容器中收集姓名和电话号码

import requests
from bs4 import BeautifulSoup

my_url = "https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}"
for link in [my_url.format(page) for page in range(1,5)]:
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")

    for item in soup.select(".info"):
        try:
            name = item.select(".business-name [itemprop='name']")[0].text
        except Exception:
            name = ""
        try:
            phone = item.select("[itemprop='telephone']")[0].text
        except Exception:
            phone = ""

        print(name,phone)

好的,谢谢。据我所知,bs4只适用于html和xml。在内容隐藏的情况下,还有其他更有用的库吗?@luisreyes如果您需要通过javascript加载额外的隐藏内容,那么您可以尝试selenium、headless chrome,甚至phantomjs。谢谢!我尝试了一下,一开始不起作用,但我把“lxml”改为“html.parser”,它就起作用了。“lxml”不起作用有什么原因吗?奇怪!!我找不到任何理由。两者交替工作。顺便说一句,一定要接受它作为一个答案。谢谢