Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/362.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 动态网页抓取_Python_Selenium_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 动态网页抓取

Python 动态网页抓取,python,selenium,web-scraping,beautifulsoup,Python,Selenium,Web Scraping,Beautifulsoup,我正在尝试刮取此页面(“”),在该页面中,当我选择州和城市时,将显示一个地址,我必须在csv/excel文件中写入州、城市和地址。我能走到这一步,现在我被卡住了 这是我的密码: from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait chrome_path= r"C:\Users\IBM_ADMIN\Downloads\chromedriver_win32\chromed

我正在尝试刮取此页面(“”),在该页面中,当我选择州和城市时,将显示一个地址,我必须在csv/excel文件中写入州、城市和地址。我能走到这一步,现在我被卡住了

这是我的密码:

from selenium import webdriver  
from selenium.webdriver.support.ui import WebDriverWait

chrome_path=  r"C:\Users\IBM_ADMIN\Downloads\chromedriver_win32\chromedriver.exe"
driver =webdriver.Chrome(chrome_path)
driver.get("http://www.arohan.in/branch-locator.php")
select = Select(driver.find_element_by_name('state'))
select.select_by_visible_text('Bihar')
drop = Select(driver.find_element_by_name('branch'))
city_option = WebDriverWait(driver, 5).until(lambda x: x.find_element_by_xpath("//select[@id='city1']/option[text()='Gaya']"))
city_option.click()

硒是必需的吗?看起来您可以使用URL获得所需内容:


获取州/分支组合列表,然后使用beautiful soup教程从每页获取信息。

是否需要硒?看起来您可以使用URL获得所需内容:


获取州/分支组合列表,然后使用Beauty soup教程从每页获取信息。

以稍微有条理的方式:

import requests
from bs4 import BeautifulSoup

link = "http://www.arohan.in/branch-locator.php?"


def get_links(session,url,payload):
    session.headers["User-Agent"] = "Mozilla/5.0"
    res = session.get(url,params=payload)
    soup = BeautifulSoup(res.text,"lxml")
    item = [item.text for item in soup.select(".address_area p")]
    print(item)

if __name__ == '__main__':
    for st,br in zip(['Bihar','West Bengal'],['Gaya','Kolkata']):
        payload = {
            'state':st ,
            'branch':br 
        }
        with requests.Session() as session:
            get_links(session,link,payload)
输出:

['Branch', 'House no -10/12, Ward-18, Holding No-12, Swarajpuri Road, Near Bank of Baroda, Gaya Pin 823001(Bihar)', 'N/A', 'N/A']
['Head Office', 'PTI Building, 4th Floor, DP Block, DP-9, Salt Lake City Calcutta, 700091', '+91 33 40156000', 'contact@arohan.in']

以稍微有条理的方式:

import requests
from bs4 import BeautifulSoup

link = "http://www.arohan.in/branch-locator.php?"


def get_links(session,url,payload):
    session.headers["User-Agent"] = "Mozilla/5.0"
    res = session.get(url,params=payload)
    soup = BeautifulSoup(res.text,"lxml")
    item = [item.text for item in soup.select(".address_area p")]
    print(item)

if __name__ == '__main__':
    for st,br in zip(['Bihar','West Bengal'],['Gaya','Kolkata']):
        payload = {
            'state':st ,
            'branch':br 
        }
        with requests.Session() as session:
            get_links(session,link,payload)
输出:

['Branch', 'House no -10/12, Ward-18, Holding No-12, Swarajpuri Road, Near Bank of Baroda, Gaya Pin 823001(Bihar)', 'N/A', 'N/A']
['Head Office', 'PTI Building, 4th Floor, DP Block, DP-9, Salt Lake City Calcutta, 700091', '+91 33 40156000', 'contact@arohan.in']

更好的方法是避免使用硒。如果您需要呈现HTML所需的javascript处理,这将非常有用。在您的情况下,这是不需要的。所需信息已包含在HTML中

需要做的是首先请求获取包含所有状态的页面。然后,对于每个状态,请求分支列表。然后,对于每个状态/分支组合,可以发出URL请求以获取包含地址的HTML。这恰好包含在
  • 条目之后的第二个
  • 条目中

    from bs4 import BeautifulSoup
    import requests
    import csv
    import time
    
    # Get a list of available states
    r = requests.get('http://www.arohan.in/branch-locator.php')
    soup = BeautifulSoup(r.text, 'html.parser')
    state_select = soup.find('select', id='state1')
    states = [option.text for option in state_select.find_all('option')[1:]]
    
    # Open an output CSV file
    with open('branch addresses.csv', 'w', newline='', encoding='utf-8') as f_output:
        csv_output = csv.writer(f_output)
        csv_output.writerow(['State', 'Branch', 'Address'])
    
        # For each state determine the available branches
        for state in states:
            r_branches = requests.post('http://www.arohan.in/Ajax/ajax_branch.php', data={'ajax_state':state})
            soup = BeautifulSoup(r_branches.text, 'html.parser')
    
            # For each branch, request a page contain the address
            for option in soup.find_all('option')[1:]:
                time.sleep(0.5)     # Reduce server loading
                branch = option.text
                print("{}, {}".format(state, branch))
                r_branch = requests.get('http://www.arohan.in/branch-locator.php', params={'state':state, 'branch':branch})
                soup_branch = BeautifulSoup(r_branch.text, 'html.parser')
                ul = soup_branch.find('ul', class_='address_area')
    
                if ul:
                    address = ul.find_all('li')[1].get_text(strip=True)
                    row = [state, branch, address]
                    csv_output.writerow(row)
                else:
                    print(soup_branch.title)
    
    为您提供一个输出CSV文件,开始:

    州、分支机构、地址
    加尔各答西孟加拉邦,“700091盐湖城加库塔DP-9 DP区4楼PTI大楼”
    西孟加拉邦,Maheshtala,“NMAMADA公园,帕班加,BADDR班德巴士站,OPP巷KISMAT NunGi路,Maheshtala,加尔各答- 700140(W.B)”
    西孟加拉邦Shymbazar,“加尔各答F.b.T.路6号病房一楼-700002”
    
    您应该使用
    time.sleep(0.5)
    降低脚本速度,以避免服务器上的负载过多


    注意:
    [1://code>用作下拉列表中的第一项,它不是分支或状态,而是一个
    选择分支
    条目

    更好的方法是避免使用硒。如果您需要呈现HTML所需的javascript处理,这将非常有用。在您的情况下,这是不需要的。所需信息已包含在HTML中

    需要做的是首先请求获取包含所有状态的页面。然后,对于每个状态,请求分支列表。然后,对于每个状态/分支组合,可以发出URL请求以获取包含地址的HTML。这恰好包含在
  • 条目之后的第二个
  • 条目中

    from bs4 import BeautifulSoup
    import requests
    import csv
    import time
    
    # Get a list of available states
    r = requests.get('http://www.arohan.in/branch-locator.php')
    soup = BeautifulSoup(r.text, 'html.parser')
    state_select = soup.find('select', id='state1')
    states = [option.text for option in state_select.find_all('option')[1:]]
    
    # Open an output CSV file
    with open('branch addresses.csv', 'w', newline='', encoding='utf-8') as f_output:
        csv_output = csv.writer(f_output)
        csv_output.writerow(['State', 'Branch', 'Address'])
    
        # For each state determine the available branches
        for state in states:
            r_branches = requests.post('http://www.arohan.in/Ajax/ajax_branch.php', data={'ajax_state':state})
            soup = BeautifulSoup(r_branches.text, 'html.parser')
    
            # For each branch, request a page contain the address
            for option in soup.find_all('option')[1:]:
                time.sleep(0.5)     # Reduce server loading
                branch = option.text
                print("{}, {}".format(state, branch))
                r_branch = requests.get('http://www.arohan.in/branch-locator.php', params={'state':state, 'branch':branch})
                soup_branch = BeautifulSoup(r_branch.text, 'html.parser')
                ul = soup_branch.find('ul', class_='address_area')
    
                if ul:
                    address = ul.find_all('li')[1].get_text(strip=True)
                    row = [state, branch, address]
                    csv_output.writerow(row)
                else:
                    print(soup_branch.title)
    
    为您提供一个输出CSV文件,开始:

    州、分支机构、地址
    加尔各答西孟加拉邦,“700091盐湖城加库塔DP-9 DP区4楼PTI大楼”
    西孟加拉邦,Maheshtala,“NMAMADA公园,帕班加,BADDR班德巴士站,OPP巷KISMAT NunGi路,Maheshtala,加尔各答- 700140(W.B)”
    西孟加拉邦Shymbazar,“加尔各答F.b.T.路6号病房一楼-700002”
    
    您应该使用
    time.sleep(0.5)
    降低脚本速度,以避免服务器上的负载过多


    注意:
    [1://code>用作下拉列表中的第一项,它不是分支或状态,而是一个
    选择分支
    条目

    它就像是每个选项的列表,然后使用BeautifulSoup提取您想要的结果。如果我错了,请纠正我。这就像是每个选项的列表,然后使用美丽的汤来提取你所指的结果。如果我错了,请纠正我。这是工作。谢谢。你给我省了不少麻烦。我很感谢你,这是工作,谢谢。你给我省了不少麻烦。我很感激你。