如何使用Python从这个javascript页面中获取职业道路职位_Python_Selenium_Web Scraping_Beautifulsoup_Python Requests

如何使用Python从这个javascript页面中获取职业道路职位

python selenium web-scraping

如何使用Python从这个javascript页面中获取职业道路职位,python,selenium,web-scraping,beautifulsoup,python-requests,Python,Selenium,Web Scraping,Beautifulsoup,Python Requests,我如何使用Python从这个javascript页面中获取职业道路职位这是我的代码片段，返回的汤没有我需要的任何文本数据导入请求从bs4导入BeautifulSoup 导入json 进口稀土从selenium导入webdriver 从selenium.webdriver.firefox.firefox\u二进制文件导入FirefoxBinary 从selenium.webdriver.common.keys导入密钥从selenium.webdriver.common.by导入从se

我如何使用Python从这个javascript页面中获取职业道路职位

这是我的代码片段，返回的汤没有我需要的任何文本数据

导入请求从bs4导入BeautifulSoup 导入json 进口稀土从selenium导入webdriver 从selenium.webdriver.firefox.firefox\u二进制文件导入FirefoxBinary 从selenium.webdriver.common.keys导入密钥从selenium.webdriver.common.by导入从selenium.webdriver.support.wait导入WebDriverWait 从selenium.webdriver.support将预期的_条件导入为EC 获取美化组对象 def get_soupurl：此函数返回BeautifulSoup对象。参数： url：获取汤对象的链接返回：汤：美化对象 req=requests.geturl soup=BeautifulSoupreq.text，“html.parser” 返汤获取selenium驱动程序对象 def get_Seleniu_驱动程序：此函数返回selenium驱动程序对象。参数：没有一个返回：驱动程序：selenium驱动程序对象选项=webdriver.FirefoxOptions 选项。添加参数“-headless” driver=webdriver.Firefoxexecutable\u path=rgeckodriver，firefox\u options=options 返回驱动器使用硒获得汤obj def使用硒来获取汤：给定页面的url，此函数返回soup对象。参数： url：获取汤对象的链接返回：汤：汤对象选项=webdriver.FirefoxOptions 选项。添加参数“-headless” driver=webdriver.Firefoxexecutable\u path=rgeckodriver，firefox\u options=options driver.geturl driver.implicitly_wait3 html=driver.page\u源 soup=BeautifulSouphtml，“html.parser” 司机，关门返汤 title=PHP%2b开发人员地点=圣地牙哥%2、加利福尼亚%2、美国%2、加利福尼亚%2 经验年数=0 按过滤器排序=最有可能的转换 url=https://www.dice.com/career-paths?title={}&location={}&experience={}&sortBy={}。格式标题、位置、工作年限、按过滤器排序职业生涯路径页面

就像其他用户在评论中提到的那样，这里的请求对您不起作用。但是，使用Selenium，您可以使用WebDriverWait刮取页面内容，以确保已加载所有页面内容，并使用element.text获取网页内容

以下代码段将在页面左侧打印职业路径字符串：

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# navigate to the page
driver = get_selenium_driver()
driver.get(url)

# wait for loading indicator to be hidden
WebDriverWait(driver, 10).until(EC.invisibility_of_element((By.XPATH, "//*[contains(text(), 'Loading data')]")))

# wait for content to load
career_path_elements = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='abcd']/ul/li")))

# print out career paths
for element in career_path_elements:

    # get title attribute that usually contains career path text
    title = element.get_attribute("title")

    # sometimes career path is in span below this element
    if not title:

        # find the element and print its text
        span_element = element.find_element_by_xpath("span[not(contains(@class, 'currentJobHead'))]")
        print(span_element.text)

   # print title in other cases
    else:
        print(title)

这将打印以下内容：

PHP Developer
Drupal Developer
Web Developer
Full Stack Developer
Back-End Developer
Full Stack PHP Developer
IT Director
Software Development Manager

这里有几个有趣的项目。主要的一个是这个页面上的Javascript加载-在第一次打开页面时，加载数据。。。指示器出现。我们必须等待此项目的元素EC.invisibility\u，以确保它已消失，然后才能尝试定位任何页面内容

之后，我们再次调用WebDriverWait，但这次是在页面右侧的职业路径元素上。此WebDriverWait调用返回存储在career\u path\u元素中的元素列表。我们可以循环浏览这个元素列表来打印每个项目的职业生涯路径

每个职业路径元素在title属性中都包含职业路径文本，因此我们调用element.get_attributetitle来获取该文本。但是，“当前职务”项有一个特殊情况，其中职业路径文本包含在较低级别的跨度中。我们通过调用element.find_element_by_xpath来处理标题为空的情况，以定位跨度。这确保了我们可以打印页面上的每个职业道路项目。

发布您的代码。到目前为止，您做了哪些研究。请记住，我们不是为您工作的，我们会在您遇到困难时帮助您。所以你至少应该学习，尝试，如果你失败了，我们会帮助你。谢谢大家，对此深表歉意！请检查代码段！页面是由java脚本呈现的。因此，在这种情况下，请求对您没有帮助。但是，由于您已经为selenium编写了代码，您可以使用seleniumurl调用函数career\u paths\u Page\u soup=get\u soup\u，还提到了您希望从页面返回的值。非常感谢您，这就是我所需要的！