Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/url/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
For loop 网页抓取时无法迭代多个网页_For Loop_Url_Web Scraping_Beautifulsoup_Request - Fatal编程技术网

For loop 网页抓取时无法迭代多个网页

For loop 网页抓取时无法迭代多个网页,for-loop,url,web-scraping,beautifulsoup,request,For Loop,Url,Web Scraping,Beautifulsoup,Request,我在努力刮 https://www.maybank.co.id/others/locate-us?Keyword=&LocType=branch&LocSubType=all 获取所有银行分行的分行名称和地址。有44页我需要抓取url不变的页面。我不能反复浏览这些页面 for page_no in range(1,45): payload='page='+str(page_no)+'&PageSize=9&id=%7B5066AC98-FE40-40

我在努力刮

https://www.maybank.co.id/others/locate-us?Keyword=&LocType=branch&LocSubType=all 
获取所有银行分行的
分行名称和地址。有
44页
我需要抓取
url
不变的页面。我不能反复浏览这些页面

for page_no in range(1,45):

    payload='page='+str(page_no)+'&PageSize=9&id=%7B5066AC98-FE40-407A-B4FE-03C814BED5F5%7D&keyword=&LocType=branch&LocSubType=all'
    response = requests.post(url, data=payload)
    page = requests.post(url,data=payload)
    print('Page',page_no)
    for i in soup.find_all('div',class_="col-md-4 col-sm-6 col-xs-12 property-item"):
        Branch=i.find_all('h3') if i.find_all('h3') else ''
        Address=i.find_all('p') if i.find_all('p') else '' 
    for j in Address:
        j = re.sub(r'<(.*?)>', '', str(j))
        j = j.strip()
        Address_list.append(j)
    for k in Branch:
        k=re.sub(r'<(.*?)>', '', str(k))
        Branch_list.append(k)

范围(1,45)内页码的
:
payload='page='+str(页码)+'&PageSize=9&id=%7B5066AC98-FE40-407A-B4FE-03C814BED5F5%7D&keyword=&LocType=branch&LocSubType=all'
response=requests.post(url,data=payload)
page=requests.post(url,data=payload)
打印('页码',页码)
对于汤中的i。查找所有('div',class=“col-md-4 col-sm-6 col-xs-12属性项”):
分支=i.find_all('h3'),如果i.find_all('h3')else''
地址=i.find_all('p'),如果i.find_all('p')else''
地址为:
j=再细分(r'','',str(j))
j=j.条带()
地址列表。附加(j)
对于k in分支:
k=再分段(r'','',str(k))
分支列表。追加(k)

是否有人建议在这里执行此操作?

您应该使用API获取所需内容

试试这个:

from urllib.parse import urlencode

import requests
from bs4 import BeautifulSoup


api_url = "https://www.maybank.co.id/api/sitecore/MapsLocation/MapsLocationListPaging?"

payload = {
    "page": "44",
    "id": "{5066AC98-FE40-407A-B4FE-03C814BED5F5}",
    "keyword": "",
    "LocType": "branch",
    "LocSubType": "all",
}

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36",
    "x-requested-with": "XMLHttpRequest",
}


for page in range(1, 45):
    payload["PageSize"] = page
    page = requests.get(f"{api_url}{urlencode(payload)}", headers).text
    soup = BeautifulSoup(page, "html.parser").find("div", {"class": "col-md-4 col-sm-6 col-xs-12 property-item"})
    branch_data = [
        soup.find("h3").getText(strip=True),
        [p.getText(strip=True) for p in soup.find_all("p")],
        soup.find("a")["href"],
    ]
    print(branch_data)
输出:

['KC MANADO', ['Jl. Kawasan Mega Mas Jl. Pierre Tendean Boulevard Blok I C1 No. 24,25,26 dan Blok I C2 No. 27,28,29 Manado', 'Closed until 03.30 PM0431 - 860543'], '/others/locate-us/locate-us-detail?id=337&loctype=Branch&locsubtype=']
['KC SUNSET ROAD, DPS', ['Jl. Sunset Road No 811, Kuta  - Badung, Bali', 'Closed until 03.30 PM0361 - 3003811'], '/others/locate-us/locate-us-detail?id=294&loctype=Branch&locsubtype=']
['KCP BSB CITY', ['Ruko Taman Niaga Bukit Semarang Baru (BSB) Blok E No. 3A, Semarang', 'Closed until 03.30 PM(024) 76670611'], '/others/locate-us/locate-us-detail?id=217&loctype=Branch&locsubtype=']
['KCP GRAHA IRAMA', ['Jl. HR Rasuna Said Kav. 1-2 Ground Floor Blok B Jakarta Selatan', 'Closed until 03.30 PM021-5261330-4'], '/others/locate-us/locate-us-detail?id=111&loctype=Branch&locsubtype=']
['KCP KLP. GADING BULEVARD II', ['Jl. Raya Boulevard I-3 no. 4, Jakarta', 'Closed until 03.30 PM021 - 4515253'], '/others/locate-us/locate-us-detail?id=199&loctype=Branch&locsubtype=']
['KCP PALM SPRING BATAM CENTER', ['Komplek Palm Spring BTC Blok D1 No. 10, Batam Centre', 'Closed until 03.30 PM0778 - 6053070'], '/others/locate-us/locate-us-detail?id=26&loctype=Branch&locsubtype=']
and so on...

您应该使用API来获取所需的内容

试试这个:

from urllib.parse import urlencode

import requests
from bs4 import BeautifulSoup


api_url = "https://www.maybank.co.id/api/sitecore/MapsLocation/MapsLocationListPaging?"

payload = {
    "page": "44",
    "id": "{5066AC98-FE40-407A-B4FE-03C814BED5F5}",
    "keyword": "",
    "LocType": "branch",
    "LocSubType": "all",
}

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36",
    "x-requested-with": "XMLHttpRequest",
}


for page in range(1, 45):
    payload["PageSize"] = page
    page = requests.get(f"{api_url}{urlencode(payload)}", headers).text
    soup = BeautifulSoup(page, "html.parser").find("div", {"class": "col-md-4 col-sm-6 col-xs-12 property-item"})
    branch_data = [
        soup.find("h3").getText(strip=True),
        [p.getText(strip=True) for p in soup.find_all("p")],
        soup.find("a")["href"],
    ]
    print(branch_data)
输出:

['KC MANADO', ['Jl. Kawasan Mega Mas Jl. Pierre Tendean Boulevard Blok I C1 No. 24,25,26 dan Blok I C2 No. 27,28,29 Manado', 'Closed until 03.30 PM0431 - 860543'], '/others/locate-us/locate-us-detail?id=337&loctype=Branch&locsubtype=']
['KC SUNSET ROAD, DPS', ['Jl. Sunset Road No 811, Kuta  - Badung, Bali', 'Closed until 03.30 PM0361 - 3003811'], '/others/locate-us/locate-us-detail?id=294&loctype=Branch&locsubtype=']
['KCP BSB CITY', ['Ruko Taman Niaga Bukit Semarang Baru (BSB) Blok E No. 3A, Semarang', 'Closed until 03.30 PM(024) 76670611'], '/others/locate-us/locate-us-detail?id=217&loctype=Branch&locsubtype=']
['KCP GRAHA IRAMA', ['Jl. HR Rasuna Said Kav. 1-2 Ground Floor Blok B Jakarta Selatan', 'Closed until 03.30 PM021-5261330-4'], '/others/locate-us/locate-us-detail?id=111&loctype=Branch&locsubtype=']
['KCP KLP. GADING BULEVARD II', ['Jl. Raya Boulevard I-3 no. 4, Jakarta', 'Closed until 03.30 PM021 - 4515253'], '/others/locate-us/locate-us-detail?id=199&loctype=Branch&locsubtype=']
['KCP PALM SPRING BATAM CENTER', ['Komplek Palm Spring BTC Blok D1 No. 10, Batam Centre', 'Closed until 03.30 PM0778 - 6053070'], '/others/locate-us/locate-us-detail?id=26&loctype=Branch&locsubtype=']
and so on...

嗨,你能告诉我api url是如何获得的吗?进入开发者工具->网络->XHR。这是你可以找到请求URL的地方。嗨,你能告诉我api URL是如何获得的吗?转到开发者工具->网络->XHR。在这里可以找到请求URL。