使用python 3.6.3使用beautifulsoup4刮取多个页面_Python_Loops_Beautifulsoup

使用python 3.6.3使用beautifulsoup4刮取多个页面

python loops

使用python 3.6.3使用beautifulsoup4刮取多个页面,python,loops,beautifulsoup,Python,Loops,Beautifulsoup,我试图循环浏览多个页面，但我的代码没有提取任何内容。我是个新手，所以请容忍我。我制作了一个容器，以便可以针对每个列表。我还创建了一个变量，用于指向锚定标记，您可以按该标记进入下一页。如果能得到任何帮助，我将不胜感激。谢谢 from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup for page in range(0,25): file = "breakfeast_chicago.c

我试图循环浏览多个页面，但我的代码没有提取任何内容。我是个新手，所以请容忍我。我制作了一个容器，以便可以针对每个列表。我还创建了一个变量，用于指向锚定标记，您可以按该标记进入下一页。如果能得到任何帮助，我将不胜感激。谢谢

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

for page in range(0,25):
    file = "breakfeast_chicago.csv"
    f = open(file, "w")
    Headers = "Nambusiness_name, business_address, business_city, business_region, business_phone_number\n"
f.write(Headers)

my_url = 'https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}'.format(page)

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()   

# html parsing
page_soup = soup(page_html, "html.parser")

# grabs each listing
containers = page_soup.findAll("div",{"class": "result"})

new = page_soup.findAll("a", {"class":"next ajax-page"})

for i in new:
    try:
        for container in containers:
            b_name = i.find("container.h2.span.text").get_text()
            b_addr = i.find("container.p.span.text").get_text()

            city_container = container.findAll("span",{"class": "locality"})
            b_city = i.find("city_container[0].text ").get_text()

            region_container = container.findAll("span",{"itemprop": "postalCode"})
            b_reg = i.find("region_container[0].text").get_text()

            phone_container = container.findAll("div",{"itemprop": "telephone"})
            b_phone = i.find("phone_container[0].text").get_text()

            print(b_name, b_addr, b_city, b_reg, b_phone)
            f.write(b_name + "," +b_addr + "," +b_city.replace(",", "|") + "," +b_reg + "," +b_phone + "\n")
    except: AttributeError
f.close()

如果使用BS4，请尝试：

find\u all

尝试使用

import pdb将其放入跟踪；pdb.set_trace（）

并尝试调试for循环中选择的内容

此外，如果通过javascript加载某些内容，则可能会隐藏这些内容

“点击”的每个锚点标签或HREF只是另一个网络请求，如果您打算跟踪链接，请考虑在每个请求之间减慢请求的数量，这样就不会被阻止。

< P>您可以像下面的脚本一样尝试。它将通过分页遍历不同的页面，并从每个容器中收集姓名和电话号码

import requests
from bs4 import BeautifulSoup

my_url = "https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}"
for link in [my_url.format(page) for page in range(1,5)]:
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")

    for item in soup.select(".info"):
        try:
            name = item.select(".business-name [itemprop='name']")[0].text
        except Exception:
            name = ""
        try:
            phone = item.select("[itemprop='telephone']")[0].text
        except Exception:
            phone = ""

        print(name,phone)

好的，谢谢。据我所知，bs4只适用于html和xml。在内容隐藏的情况下，还有其他更有用的库吗？@luisreyes如果您需要通过javascript加载额外的隐藏内容，那么您可以尝试selenium、headless chrome，甚至phantomjs。谢谢！我尝试了一下，一开始不起作用，但我把“lxml”改为“html.parser”，它就起作用了。“lxml”不起作用有什么原因吗？奇怪！！我找不到任何理由。两者交替工作。顺便说一句，一定要接受它作为一个答案。谢谢