Python 3.x Web刮取由Javascript函数创建的表
我正试图在下面的链接中浏览研究报告表 该表使用Javascript动态创建了内容。 我尝试使用selenium,但间歇性地出现StaleElementException。 请帮我做同样的事情 我想检索表中的所有行并将它们存储在本地数据库中。 以下是我在selenium中尝试的内容Python 3.x Web刮取由Javascript函数创建的表,python-3.x,selenium,selenium-webdriver,automation,geckodriver,Python 3.x,Selenium,Selenium Webdriver,Automation,Geckodriver,我正试图在下面的链接中浏览研究报告表 该表使用Javascript动态创建了内容。 我尝试使用selenium,但间歇性地出现StaleElementException。 请帮我做同样的事情 我想检索表中的所有行并将它们存储在本地数据库中。 以下是我在selenium中尝试的内容 import selenium.webdriver as webdriver url = 'https://clinicaltrials.gov/ct2/results?cond=COVID&term=&
import selenium.webdriver as webdriver
url = 'https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist='
driver=webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
data = []
for tr in driver.find_elements_by_xpath('//table[@id="theDataTable"]//tbody//tr'):
tds = tr.find_elements_by_tag_name('td')
if tds:
for td in tds:
print(td.text)
if td.text not in data:
data.append(td.text)
driver.quit()
print('*********************************************************************')
print(data)
进一步从我将存储在DB中的'data'变量中提取数据
我对selenium和网络抓取还不熟悉,我想点击“研究标题”栏中的每个链接,并从该页面中提取每项研究的数据
我想要一些建议,以避免/处理过时的元素异常或SeleniumWebDriver的替代方案。
提前谢谢 我尝试了我的以下代码,所有数据都正确存储。你能试试吗 代码
driver.get("https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist=")
time.sleep(5)
array = []
flag = True
next_counter = 0
time.sleep(4)
select = Select(driver.find_element_by_name('theDataTable_length'))
select.select_by_value('100')
time.sleep(5)
while flag == True:
if next_counter == 13:
print("Stoped")
else:
item = driver.find_elements_by_tag_name("tbody")[1]
rows = item.find_elements_by_tag_name('tr')
for x in range(len(rows)):
for i in range(7):
array.insert(x, rows[x].find_elements_by_tag_name('td')[i].text)
print(rows[x].find_elements_by_tag_name('td')[i].text)
time.sleep(5)
next = driver.find_element_by_id('theDataTable_next')
next.click()
next_counter = next_counter + 1
time.sleep(7)
输出
1
Not yet recruiting
NEW
Indirect Endovenous Systemic Ozone for New Coronavirus Disease (COVID19) in Non-intubated Patients
COVID
Other: Systemic indirect endovenous ozone therapy
SEOT
Valencia, Spain
2
Recruiting
NEW
Prediction of Clinical Course in COVID19 Patients
COVID 19
Other: CT-Scan
Chu Saint-Etienne
Saint-Étienne, France
3
Not yet recruiting
NEW
Risks of COVID19 in the Pregnant Population
COVID19
Other: Biospecimen collection
Mayo Clinic in Rochester
Rochester, Minnesota, United States
4
Recruiting
NEW
Saved From COVID-19
COVID
Drug: Chloroquine
Drug: Placebo oral tablet
Columbia University Irving Medical Center/NYP
New York, New York, United States
5
Recruiting
NEW
Efficacy of Convalescent Plasma Therapy in Severely Sick COVID-19 Patients
COVID
Drug: Convalescent Plasma Transfusion
Other: Supportive Care
Drug: Random Donor Plasma
Maulana Azad medical College
New Delhi, Delhi, India
Institute of Liver and Biliary Sciences
New Delhi, Delhi, India
6
Not yet recruiting
NEW
A Real-life Experience on Treatment of Patients With COVID 19
COVID
Drug: Chloroquine
Drug: Favipiravir
Drug: Nitazoxanide
(and 3 more...)
Tanta university hospital
Tanta, Egypt
7
Recruiting
International COVID19 Clinical Evaluation Registry,
COVID 19
Combination Product: Observational (registry)
Hospital Lclinico San Carlos
Madrid, Spain
8
Completed
NEW
AiM COVID for Covid 19 Tracking and Prediction
COVID 19
Other: No Intervention
Aarogyam (UK)
Leicester, United Kingdom
9
Recruiting
NEW
Establishing a COVID-19 Prospective Cohort for Identification of Secondary HLH
COVID
Department of nephrology, Klinikum rechts der Isar
München, Bavaria, Germany
10
Recruiting
NEW
Max Ivermectin- COVID 19 Study Versus Standard of Care Treatment for COVID 19 Cases. A Pilot Study
COVID
Drug: Ivermectin
Max Super Speciality hospital, Saket (A unit of Devki Devi Foundation)
New Delhi, Delhi, India
我的代码正在执行以下逻辑步骤:
- 首先,为了节省检索数据的时间,我选择了查看100个结果而不是10个结果的选项
- 其次,我阅读了页面(100)的所有结果,当我完成后,我点击下一页符号。然后我有一个sleep命令等待4秒钟(你可以用更好的方式来做,但我这样做是为了给你一些快速的东西-你必须插入waituntilementisvisible概念)
- 单击下一页按钮后,我再次保存结果(100)
- 此功能将一直运行,直到标志变为False。当下一个_计数器为14(大于最大值13)时,它将为false。数字13实际上是1300(结果)除以100(每页的最大结果数),因此1300/100=13。所以我们有13页
编辑和传输数据是您可以管理的,不需要Selenium知识或与web自动化相关的知识。这是一个100%的Python概念。我尝试了以下代码,所有数据都正确存储。你能试试吗 代码
driver.get("https://clinicaltrials.gov/ct2/results?cond=COVID&term=&cntry=&state=&city=&dist=")
time.sleep(5)
array = []
flag = True
next_counter = 0
time.sleep(4)
select = Select(driver.find_element_by_name('theDataTable_length'))
select.select_by_value('100')
time.sleep(5)
while flag == True:
if next_counter == 13:
print("Stoped")
else:
item = driver.find_elements_by_tag_name("tbody")[1]
rows = item.find_elements_by_tag_name('tr')
for x in range(len(rows)):
for i in range(7):
array.insert(x, rows[x].find_elements_by_tag_name('td')[i].text)
print(rows[x].find_elements_by_tag_name('td')[i].text)
time.sleep(5)
next = driver.find_element_by_id('theDataTable_next')
next.click()
next_counter = next_counter + 1
time.sleep(7)
输出
1
Not yet recruiting
NEW
Indirect Endovenous Systemic Ozone for New Coronavirus Disease (COVID19) in Non-intubated Patients
COVID
Other: Systemic indirect endovenous ozone therapy
SEOT
Valencia, Spain
2
Recruiting
NEW
Prediction of Clinical Course in COVID19 Patients
COVID 19
Other: CT-Scan
Chu Saint-Etienne
Saint-Étienne, France
3
Not yet recruiting
NEW
Risks of COVID19 in the Pregnant Population
COVID19
Other: Biospecimen collection
Mayo Clinic in Rochester
Rochester, Minnesota, United States
4
Recruiting
NEW
Saved From COVID-19
COVID
Drug: Chloroquine
Drug: Placebo oral tablet
Columbia University Irving Medical Center/NYP
New York, New York, United States
5
Recruiting
NEW
Efficacy of Convalescent Plasma Therapy in Severely Sick COVID-19 Patients
COVID
Drug: Convalescent Plasma Transfusion
Other: Supportive Care
Drug: Random Donor Plasma
Maulana Azad medical College
New Delhi, Delhi, India
Institute of Liver and Biliary Sciences
New Delhi, Delhi, India
6
Not yet recruiting
NEW
A Real-life Experience on Treatment of Patients With COVID 19
COVID
Drug: Chloroquine
Drug: Favipiravir
Drug: Nitazoxanide
(and 3 more...)
Tanta university hospital
Tanta, Egypt
7
Recruiting
International COVID19 Clinical Evaluation Registry,
COVID 19
Combination Product: Observational (registry)
Hospital Lclinico San Carlos
Madrid, Spain
8
Completed
NEW
AiM COVID for Covid 19 Tracking and Prediction
COVID 19
Other: No Intervention
Aarogyam (UK)
Leicester, United Kingdom
9
Recruiting
NEW
Establishing a COVID-19 Prospective Cohort for Identification of Secondary HLH
COVID
Department of nephrology, Klinikum rechts der Isar
München, Bavaria, Germany
10
Recruiting
NEW
Max Ivermectin- COVID 19 Study Versus Standard of Care Treatment for COVID 19 Cases. A Pilot Study
COVID
Drug: Ivermectin
Max Super Speciality hospital, Saket (A unit of Devki Devi Foundation)
New Delhi, Delhi, India
我的代码正在执行以下逻辑步骤:
- 首先,为了节省检索数据的时间,我选择了查看100个结果而不是10个结果的选项
- 其次,我阅读了页面(100)的所有结果,当我完成后,我点击下一页符号。然后我有一个sleep命令等待4秒钟(你可以用更好的方式来做,但我这样做是为了给你一些快速的东西-你必须插入waituntilementisvisible概念)
- 单击下一页按钮后,我再次保存结果(100)
- 此功能将一直运行,直到标志变为False。当下一个_计数器为14(大于最大值13)时,它将为false。数字13实际上是1300(结果)除以100(每页的最大结果数),因此1300/100=13。所以我们有13页
编辑和传输数据是您可以管理的,不需要Selenium知识或与web自动化相关的知识。这是一个100%的Python概念。谢谢@dpapadopoulos的回答!我试过了,但还是有点过时Exception@AkshayPhadnis StaleElementException发生在dom上更改webelement并且对该webelement的初始引用丢失时。因此,请重试该代码。我更新了。谢谢@dpapadopoulos的回答!我试过了,但还是有点过时Exception@AkshayPhadnis StaleElementException发生在dom上更改webelement并且对该webelement的初始引用丢失时。因此,请重试该代码。我更新了。你能查一下我的答案吗?我的代码正在运行并收集所有数据(所有1300项研究),你能检查我的答案吗?我的代码正在运行并收集所有数据(所有1300项研究)