使用python和selenium从html代码获取url_Python_Html_Python 3.x_Selenium_Pdf

使用python和selenium从html代码获取url

python html python-3.x selenium pdf

使用python和selenium从html代码获取url,python,html,python-3.x,selenium,pdf,Python,Html,Python 3.x,Selenium,Pdf,在这方面，有人帮助我解决了表格中的下拉菜单问题。但是，我希望从源代码中获取url，它是：并将其存储在列表中，而不是像当前那样单击它。上面代码中的链接是/consultas/util/pdf.php？type=rdd&rdd=nYgT5Rcvs2I%3D。但是，我需要在每个获取的链接之前添加http://digesto.asamblea.gob.ni 完成链接我怎样才能做到这一点这是我当前的脚本，这是我的网站：免责声明：使用脚本时，您需要手动键入验证码，而无需按enter键，脚本才能继续。

在这方面，有人帮助我解决了表格中的下拉菜单问题。但是，我希望从源代码中获取url，它是：

并将其存储在列表中，而不是像当前那样单击它。上面代码中的链接是/consultas/util/pdf.php？type=rdd&rdd=nYgT5Rcvs2I%3D。但是，我需要在每个获取的链接之前添加http://digesto.asamblea.gob.ni 完成链接

我怎样才能做到这一点

这是我当前的脚本，这是我的网站：

免责声明：使用脚本时，您需要手动键入验证码，而无需按enter键，脚本才能继续。

以/开头的相关链接来自顶级域，例如。http://digesto.asamblea.gob.ni 就你而言；另一方面，如果他们不是从这个开始，他们是从当前页面。在清除链接的循环中，将代码更改为：

list_of_links = []    # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url   # for any links not starting with /
for row in rows:
    row.find_element_by_css_selector('button').click()
    link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("href")
    if link.startswith('/'):
        list_of_links.append(tld + link)
    else:
        list_of_links.append(current_url + link)

    # at this point the dropdown will be visible, and will interfere with the next loop cycle
    # click again in it, so the menu closes
    row.find_element_by_css_selector('button').click()

print(list_of_links)

我建议以后只添加代码的相关部分——完整的列表确实帮助我手动重复，但它的长度可能会让很多人无法真正阅读它。我是根据个人经验说的——如果问题没有立即抓住我，并且有两页代码我必须略读，那么我通常都懒得读完整的问题。托多，谢谢你的推荐和回答。关于第一个问题，它不是有意这样做的，而是认为最好回答这个问题。下次我将添加更少的代码。我尝试了你的代码，得到了一条错误消息selenium.common.exceptions.Element ClickInterceptedException:message:Element在1199.658317565918270.18333435058594点不可单击，因为另一个元素遮挡了它。我已经意识到了问题所在-在扩展了带有链接的下拉列表后，它保持这样；然后在下一个循环中，selenium无法单击下一行上的按钮-下拉菜单挡住了去路。我想到的解决办法是再次点击按钮——这可能会隐藏我无法在手机上输入的下拉列表。如果这样做不行，请找到其他位置/定位器进行单击，以便在每个周期结束时隐藏下拉列表。谢谢Todor，这解决了我的问题。

list_of_links = []    # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url   # for any links not starting with /
for row in rows:
    row.find_element_by_css_selector('button').click()
    link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("href")
    if link.startswith('/'):
        list_of_links.append(tld + link)
    else:
        list_of_links.append(current_url + link)

    # at this point the dropdown will be visible, and will interfere with the next loop cycle
    # click again in it, so the menu closes
    row.find_element_by_css_selector('button').click()

print(list_of_links)