我如何清理那些没有'；是否使用Python返回源代码？_Python_Selenium_Web Scraping_Beautifulsoup_Dryscrape

我如何清理那些没有'；是否使用Python返回源代码？

python selenium web-scraping

我如何清理那些没有'；是否使用Python返回源代码？,python,selenium,web-scraping,beautifulsoup,dryscrape,Python,Selenium,Web Scraping,Beautifulsoup,Dryscrape,我试图从以下网站上获取澳大利亚证券交易所公司发布的公告的“ASX代码”：到目前为止，我已尝试将BeautifulSoup与以下代码一起使用： import requests from bs4 import BeautifulSoup response = requests.get('http://www.asx.com.au/asx/statistics/todayAnns.do') parser = BeautifulSoup(response.content, 'html.parser')

我试图从以下网站上获取澳大利亚证券交易所公司发布的公告的“ASX代码”：

到目前为止，我已尝试将BeautifulSoup与以下代码一起使用：

import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.asx.com.au/asx/statistics/todayAnns.do')
parser = BeautifulSoup(response.content, 'html.parser')
print(parser)

但是，当我打印此文件时，它的打印方式与我手动进入页面并查看页面源代码时的打印方式不同。我在谷歌上搜索了一下stackoverflow，认为这是因为页面上运行的Javascript隐藏了html代码

然而，我不知道该如何解决这个问题。任何帮助都将不胜感激

提前谢谢。

试试这个。您所需要做的就是让刮板等待片刻，直到页面加载，因为您可能已经注意到内容正在动态加载。但是，执行后，您将从该网页获得表的左侧标题

import time
from bs4 import BeautifulSoup
from selenium  import webdriver

driver = webdriver.Chrome()
driver.get('http://www.asx.com.au/asx/statistics/todayAnns.do')
time.sleep(8)

soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select('.row'):
    print(item.text)
driver.quit()

部分结果：

RLC
RNE
PFM
PDF
HXG
NCZ
NCZ

顺便说一句，我已经用python 3.5编写并执行了这段代码。因此，最新版本的python在绑定selenium时没有任何问题。

import time
from bs4 import BeautifulSoup
from selenium  import webdriver

driver = webdriver.Chrome()
driver.get('http://www.asx.com.au/asx/statistics/todayAnns.do')
time.sleep(8)

soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select('.row'):
    print(item.text)
driver.quit()

部分结果：

RLC
RNE
PFM
PDF
HXG
NCZ
NCZ

顺便说一下，我已经使用Python3.5编写并执行了这段代码。因此，最新版本的python在绑定selenium时没有任何问题。

您标记了selenium，那么您尝试了吗？我完全不确定从何处开始使用selenium。我发现了一个例子，它点击按钮并在这里提供源代码：但我不需要点击按钮，我只需要源代码。不过，我会继续寻找。感谢链接@cricket_007。该网站是动态生成的，而不是使用和查找他们的API来请求您需要的数据或浏览器模拟器。我想不出解决方案。@ElvirMuslic浏览器模拟器是可行的选择吗？硒会起作用吗？我已经编写了一段selenium代码：

from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import WebDriverWait#从selenium.webdriver.support导入预期的条件为EC driver=webdriver.Firefox（）driver.get('http://www.asx.com.au/asx/statistics/todayAnns.do“）tickers=驱动程序。按类名称（“行”）查找元素打印（tickers）

。但是，我非常确定Selenium只在Python 2上工作，我只有Python 3Definity支持Python 3。您标记了Selenium，所以尝试了吗？我完全不确定从何处开始使用Selenium。我发现了一个示例，其中它单击按钮并在此处提供源代码：但我不需要单击按钮-我只需要源代码。不过我会继续搜索。感谢链接@cricket_007。该网站是动态生成的，而不是使用和查找他们的API来请求您需要的数据或浏览器模拟器。我想不出解决方案。@ElvirMuslic浏览器模拟器是可行的选择吗？selenium会工作吗？我已经编写了一段selenium代码：<代码>从selenium从selenium.common.exceptions导入webdriver从selenium.webdriver.support.ui导入TimeoutException从selenium.webdriver.support导入WebDriverWait#自2.4.0起从selenium.webdriver.support导入预期条件为EC driver=webdriver.Firefox（）driver.get（'http://www.asx.com.au/asx/statistics/todayAnns.do')tickers=驱动程序。通过类名称（“行”）打印查找元素（tickers）。但是我非常确定Selenium只在Python 2上工作，我只有Python 3Definity支持Python 3。非常感谢。这很漂亮。最后我实际上编写了一个与此非常相似的代码，只是我使用了re而不是bs4。我真的很感激。如果我想，你知道我将如何睡眠Selenium的过程吗要大规模实现这一点？再次感谢！加快进程*not Sleep有一个等待函数。例如，您可以通过XPath或其他方式找到该元素，

从selenium从selenium.webdriver.common.by从selenium.webdriver.support.ui导入WebDriverWait#从selenium.webdriver.2.4.0开始提供r、 支持从2.26.0 ff=webdriver.Firefox（）ff.get（“http://somedomain/url_that_delays_loading）try:element=WebDriverWait（ff，10）。直到（EC.presence\u of_element\u located（（By.ID，“myDynamicElement”））最后：ff.quit（）

@ElvirMuslic谢谢。这非常有帮助。@JamesWard我很高兴你发现这很方便。这是关于显式等待的官方文档，你也可以使用隐式等待（意思与睡眠相同（5））。在那里你可以找到各种各样的例子，这些例子是为了让你能够理解库并立即使用它们。非常感谢。这很漂亮。我实际上在最后写了一个与此非常相似的代码，除了我使用re而不是bs4。我真的很感激。你知道我将如何在sele过程中睡个好觉吗nium如果我想大规模这么做？再次感谢！加快进程*not Sleep有一个等待函数。例如，您可以通过XPath或其他，

从selenium找到该元素，从selenium.webdriver.common.by导入从selenium.webdriver.support.ui导入WebDriverWait#从s的2.4.0开始提供elenium.webdriver.support将预期的_条件导入为EC#自2.26.0 ff=webdriver.Firefox（）ff.get（“http://somedomain/url_that_delays_loading）try:element=WebDriverWait（ff，10）。直到（EC.presence\u of_element\u located（（By.ID，“myDynamicElement”））最后：ff.quit（）

@ElvirMuslic谢谢。这非常有帮助。@JamesWard我很高兴你觉得这很方便。这是关于显式等待的官方文档，