在python中使用beautiful soup和selenium解析html_Python_Html_Selenium_Beautifulsoup

在python中使用beautiful soup和selenium解析html

python html selenium

在python中使用beautiful soup和selenium解析html,python,html,selenium,beautifulsoup,Python,Html,Selenium,Beautifulsoup,我想通过在python中使用BeautifulSoup和Selenium，用一个真实的示例（Airbnb）来练习刮片。具体来说，我的目标是在洛杉矶获得所有房源（房屋）ID。我的策略是打开一个chrome浏览器，进入Airbnb网站，在那里我已经手动搜索了洛杉矶的房屋，并从这里开始。在这个过程中，我决定使用硒。之后，我想解析源代码中的HTML代码，然后找到当前页面上显示的列表ID。然后基本上，我只想遍历所有页面。这是我的密码： from urllib import urlopen from bs

我想通过在python中使用BeautifulSoup和Selenium，用一个真实的示例（Airbnb）来练习刮片。具体来说，我的目标是在洛杉矶获得所有房源（房屋）ID。我的策略是打开一个chrome浏览器，进入Airbnb网站，在那里我已经手动搜索了洛杉矶的房屋，并从这里开始。在这个过程中，我决定使用硒。之后，我想解析源代码中的HTML代码，然后找到当前页面上显示的列表ID。然后基本上，我只想遍历所有页面。这是我的密码：

from urllib import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver

option=webdriver.ChromeOptions()
option.add_argument("--incognito")

driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",chrome_options=option)

first_url="https://www.airbnb.com/s/Los-Angeles--CA--United-States/select_homes?refinement_paths%5B%5D=%2Fselect_homes&place_id=ChIJE9on3F3HwoAR9AhGJW_fL-I&children=0&guests=1&query=Los%20Angeles%2C%20CA%2C%20United%20States&click_referer=t%3ASEE_ALL%7Csid%3Afcf33cf1-61b8-41d5-bef1-fbc5d0570810%7Cst%3AHOME_GROUPING_SELECT_HOMES&superhost=false&title_type=SELECT_GROUPING&allow_override%5B%5D=&s_tag=tm-X8bVo"
n=3

for i in range(1,n+1):
    if (i==1):
        driver.get(first_url)
        print first_url
        #HTML parse using BS
        html =driver.page_source
        soup=BeautifulSoup(html,"html.parser")
        listings=soup.findAll("div",{"class":"_f21qs6"})

        #print out all the listing_ids within a current page
        for i in range(len(listings)):
            only_id= listings[i]['id']
            print(only_id[8:])

    after_first_url=first_url+"&section_offset=%d" % i
    print after_first_url
    driver.get(after_first_url)
    #HTML parse using BS
    html =driver.page_source
    soup=BeautifulSoup(html,"html.parser")
    listings=soup.findAll("div",{"class":"_f21qs6"})

    #print out all the listing_ids within a current page
    for i in range(len(listings)):
        only_id= listings[i]['id']
        print(only_id[8:])

如果你发现任何低效的代码，请理解，因为我是初学者。我通过阅读和观看多个来源来编写这些代码。无论如何，我想我有正确的代码，但问题是每次我运行这个，我都会得到不同的结果。它的意思是它在页面上循环，但有时它只给出特定数量页面的结果。例如，它循环page1但不给出任何相应的输出，循环page2并给出结果，但不针对page3。它是如此随机，以至于它给出了一些页面的结果，但对其他一些页面却没有。最重要的是，有时它会循环第1、2、3页。。。按顺序，但有时它会循环第1页，然后转到最后一页（17），然后返回第2页。我想我的代码不是完美的，因为它的输出不稳定。有没有人有过类似的经历，或者有人能帮我解决问题吗？谢谢

试试下面的方法

假设您位于要解析的页面上，Selenium将源HTML存储在驱动程序的page_source属性中。然后，将页面源加载到BeautifulSoup中，如下所示：

In [8]: from bs4 import BeautifulSoup

In [9]: from selenium import webdriver

In [10]: driver = webdriver.Firefox()

In [11]: driver.get('http://news.ycombinator.com')

In [12]: html = driver.page_source

In [13]: soup = BeautifulSoup(html)

In [14]: for tag in soup.find_all('title'):
   ....:     print tag.text
   ....:     
   ....:     
Hacker News

为什么现在使用

selenium

，而不是

请求

？这是因为selenium以html格式提供javascript创建的内容吗？有人能在这里说明一下吗？