使用python beautifulsoup 3刮取分页结果_Python_Beautifulsoup_Pagination_Export To Csv

使用python beautifulsoup 3刮取分页结果

python pagination

使用python beautifulsoup 3刮取分页结果,python,beautifulsoup,pagination,export-to-csv,Python,Beautifulsoup,Pagination,Export To Csv,我能够为第一页和最后一页编写代码，但只能提取CSV格式的第1页数据。我需要将所有10页的数据提取到CSV中。在代码中我哪里出错了导入已安装的模块 import requests from bs4 import BeautifulSoup import csv 要从网页中获取数据，我们将使用requests get（）方法检查http响应状态代码的步骤 print(page.status_code) 现在我已经从网页上收集了数据，让我们看看我们得到了什么 print(page.text)

我能够为第一页和最后一页编写代码，但只能提取CSV格式的第1页数据。我需要将所有10页的数据提取到CSV中。在代码中我哪里出错了

导入已安装的模块

import requests
from bs4 import BeautifulSoup
import csv

要从网页中获取数据，我们将使用requests get（）方法

检查http响应状态代码的步骤

print(page.status_code)

现在我已经从网页上收集了数据，让我们看看我们得到了什么

print(page.text)

通过使用beautifulsoup的prettify（）方法，可以以漂亮的格式查看上述数据。为此，我们将创建一个bs4对象并使用prettify方法

soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])

查找包含公司信息的所有div

product_name_list = soup.findAll("div",{"class":"CompanyInfo"})

提取第一页和最后一页的页码

paging = soup.find("div",{"class":"pg-full-width me-pagination"}).find("ul",{"class":"pagination"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text

现在遍历这些元素

for element in product_name_list:

获取“div”{“class”：“CompanyInfo”}标记的1个块，并查找/存储名称、地址、电话

name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()

将姓名、地址、电话写入csv

writer.writerow([name, address, phone])

现在将转到下一个“div”{“class”：“CompanyInfo”}标记并重复

outfile.close()

您应该使用页面属性，如page=2

10页的示例代码：

url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore&page={}"
for page_num in range(1, 10):
   page = requests.get(url.format(page_num)
   #further processing

您应该使用页面属性，如page=2

10页的示例代码：

url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore&page={}"
for page_num in range(1, 10):
   page = requests.get(url.format(page_num)
   #further processing

只是需要更多的循环。您现在需要遍历每个页面url：请参见下面的内容

import requests
from bs4 import BeautifulSoup
import csv

root_url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
html = requests.get(root_url)
soup = BeautifulSoup(html.text, 'html.parser')

paging = soup.find("div",{"class":"pg-full-width me-pagination"}).find("ul",{"class":"pagination"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text


outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])


pages = list(range(1,int(last_page)+1))
for page in pages:
    url = 'https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore&page=%s' %(page)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')

    #print(soup.prettify())
    print ('Processing page: %s' %(page))

    product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
    for element in product_name_list:
        name = element.find('h2').text
        address = element.find('address').text.strip()
        phone = element.find("ul",{"class":"submenu"}).text.strip()

        writer.writerow([name, address, phone])

outfile.close()
print ('Done')

只是需要更多的循环。您现在需要遍历每个页面url：请参见下面的内容

import requests
from bs4 import BeautifulSoup
import csv

root_url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
html = requests.get(root_url)
soup = BeautifulSoup(html.text, 'html.parser')

paging = soup.find("div",{"class":"pg-full-width me-pagination"}).find("ul",{"class":"pagination"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text


outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])


pages = list(range(1,int(last_page)+1))
for page in pages:
    url = 'https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore&page=%s' %(page)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')

    #print(soup.prettify())
    print ('Processing page: %s' %(page))

    product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
    for element in product_name_list:
        name = element.find('h2').text
        address = element.find('address').text.strip()
        phone = element.find("ul",{"class":"submenu"}).text.strip()

        writer.writerow([name, address, phone])

outfile.close()
print ('Done')

确保将代码放在一个块中（如我的解决方案中所示）。您仍然可以添加粗体部分，但只需在代码中添加它们作为#注释即可。我不确定您想做什么。你的密码。就目前的情况而言，它很难阅读。您是否希望所有标题都是代码注释？如果您只需将代码重新复制到问题中，并将整个代码格式化为单个代码块（选择所有代码并单击

{}），阅读此内容可能会更容易

文本区域上方的图标。这样做将保留Python程序中至关重要的缩进。@Makyen我将在单个块中格式化代码确保将代码放入一个块中（如我的解决方案中所示）。您仍然可以添加粗体部分，但只需将它们作为注释添加到您的代码中。我不确定您试图做什么。您的代码。就目前而言，很难阅读。您是否打算将所有标题都作为代码注释？如果您只需将您的代码复制到问题中并格式化整个文档，阅读起来可能会更容易de作为单个代码块（选择所有代码并单击

{}

图标位于文本区域上方。这样做将保留Python程序中至关重要的缩进。@Makyen我将用单个块格式化代码哇，真的很有效。我非常喜欢页面处理消息。我又一次遗漏了一些循环。谢谢我认为插入这些

print（）是一种很好的做法

循环中的语句，这是一种测试和调试出错时间的方法。在它正常工作后，删除这些语句很简单。但正如我所说，很高兴看到脚本在整个过程中的位置，以检查它在哪里/是否挂起。哇，它真的很有效。我非常喜欢页面处理消息。再一次，我遗漏了一些循环。谢谢我认为插入那些

print（）是一个很好的做法

循环中的语句，这是一种测试和调试出错时间的方法。在它正常工作后，删除这些语句很简单。但正如我所说，很高兴看到脚本在整个过程中的位置，以检查它在何处/是否挂起。