Python 3.x 从网页中提取URL链接
我对网页抓取还不熟悉。我正在尝试使用python从关键字“急性髓系白血病”、“慢性髓系白血病”、“急性淋巴细胞白血病”中提取数据,以提取以下信息:临床编号、试验状态、试验完整名称、赞助商名称、国家、正在调查的医疗状况、,参与试验的调查员网络 我试图从每个链接收集URL,然后转到每个页面并提取信息,但我没有得到正确的链接。 我想要像“”这样的URL,但是Python 3.x 从网页中提取URL链接,python-3.x,web-scraping,Python 3.x,Web Scraping,我对网页抓取还不熟悉。我正在尝试使用python从关键字“急性髓系白血病”、“慢性髓系白血病”、“急性淋巴细胞白血病”中提取数据,以提取以下信息:临床编号、试验状态、试验完整名称、赞助商名称、国家、正在调查的医疗状况、,参与试验的调查员网络 我试图从每个链接收集URL,然后转到每个页面并提取信息,但我没有得到正确的链接。 我想要像“”这样的URL,但是 '/ctr-search/trial/2014-000526-37/DE', '/ctr-search/trial/2006-001777-
'/ctr-search/trial/2014-000526-37/DE',
'/ctr-search/trial/2006-001777-19/NL',
'/ctr-search/trial/2006-001777-19/BE',
'/ctr-search/trial/2007-000273-35/IT',
'/ctr-search/trial/2011-005934-20/FR',
'/ctr-search/trial/2006-004950-25/GB',
'/ctr-search/trial/2009-017347-33/DE',
'/ctr-search/trial/2012-000334-19/IT',
'/ctr-search/trial/2012-001594-93/FR',
'/ctr-search/trial/2012-001594-93/results',
'/ctr-search/trial/2007-003103-12/DE',
'/ctr-search/trial/2006-004517-17/FR',
'/ctr-search/trial/2013-003421-28/DE',
'/ctr-search/trial/2008-002986-30/FR',
'/ctr-search/trial/2008-002986-30/results',
'/ctr-search/trial/2013-000238-37/NL',
'/ctr-search/trial/2010-018418-53/FR',
'/ctr-search/trial/2010-018418-53/NL',
'/ctr-search/trial/2010-018418-53/HU',
'/ctr-search/trial/2010-018418-53/DE',
'/ctr-search/trial/2010-018418-53/results',
'/ctr-search/trial/2006-006852-37/DE',
'/ctr-search/trial/2006-006852-37/ES',
'/ctr-search/trial/2006-006852-37/AT',
'/ctr-search/trial/2006-006852-37/CZ',
'/ctr-search/trial/2006-006852-37/NL',
'/ctr-search/trial/2006-006852-37/SK',
'/ctr-search/trial/2006-006852-37/HU',
'/ctr-search/trial/2006-006852-37/BE',
'/ctr-search/trial/2006-006852-37/IT',
'/ctr-search/trial/2006-006852-37/FR',
'/ctr-search/trial/2006-006852-37/GB',
'/ctr-search/trial/2008-000664-16/IT',
'/ctr-search/trial/2005-005321-63/IT',
'/ctr-search/trial/2005-005321-63/results',
'/ctr-search/trial/2011-005023-40/GB',
'/ctr-search/trial/2010-022446-24/DE',
'/ctr-search/trial/2010-019710-24/IT',
尝试代码-
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.clinicaltrialsregister.eu/ctr-search/search?query=acute+myeloid+leukemia&page=1')
soup = BeautifulSoup(page.text, 'html.parser')
#links = [a['href'] for a in soup.find_all('a', href=True) if a.text]
#links_with_text = []
#for a in soup.find_all('a', href=True):
# if a.text:
# links_with_text.append(a['href'])
links = [a['href'] for a in soup.find_all('a', href=True)]
输出-
'/help.html',
'/ctr-search/search',
'/joiningtrial.html',
'/contacts.html',
'/about.html',
'/about.html',
'/whatsNew.html',
'/dataquality.html',
'/doc/Sponsor_Contact_Information_EUCTR.pdf',
'/natauthorities.html',
'/links.html',
'/about.html',
'/doc/How_to_Search_EU_CTR.pdf#zoom=100,0,0',
'javascript:void(0)',
'javascript:void(0)',
'javascript:void(0)',
'javascript:void();',
'#tabs-1',
'#tabs-2',
'&page=2',
'&page=3',
'&page=4',
'&page=5',
'&page=6',
'&page=7',
'&page=8',
'&page=9',
'&page=2',
'&page=19',
'/ctr-search/trial/2014-000526-37/DE',
'/ctr-search/trial/2006-001777-19/NL',
'/ctr-search/trial/2006-001777-19/BE',
'/ctr-search/trial/2007-000273-35/IT',
'/ctr-search/trial/2011-005934-20/FR',
'/ctr-search/trial/2006-004950-25/GB',
'/ctr-search/trial/2009-017347-33/DE',
'/ctr-search/trial/2012-000334-19/IT',
'/ctr-search/trial/2012-001594-93/FR',
'/ctr-search/trial/2012-001594-93/results',
'/ctr-search/trial/2007-003103-12/DE',
'/ctr-search/trial/2006-004517-17/FR',
'/ctr-search/trial/2013-003421-28/DE',
'/ctr-search/trial/2008-002986-30/FR',
'/ctr-search/trial/2008-002986-30/results',
'/ctr-search/trial/2013-000238-37/NL',
'/ctr-search/trial/2010-018418-53/FR',
'/ctr-search/trial/2010-018418-53/NL',
'/ctr-search/trial/2010-018418-53/HU',
'/ctr-search/trial/2010-018418-53/DE',
'/ctr-search/trial/2010-018418-53/results',
'/ctr-search/trial/2006-006852-37/DE',
'/ctr-search/trial/2006-006852-37/ES',
'/ctr-search/trial/2006-006852-37/AT',
'/ctr-search/trial/2006-006852-37/CZ',
'/ctr-search/trial/2006-006852-37/NL',
'/ctr-search/trial/2006-006852-37/SK',
'/ctr-search/trial/2006-006852-37/HU',
'/ctr-search/trial/2006-006852-37/BE',
'/ctr-search/trial/2006-006852-37/IT',
'/ctr-search/trial/2006-006852-37/FR',
'/ctr-search/trial/2006-006852-37/GB',
'/ctr-search/trial/2008-000664-16/IT',
'/ctr-search/trial/2005-005321-63/IT',
'/ctr-search/trial/2005-005321-63/results',
'/ctr-search/trial/2011-005023-40/GB',
'/ctr-search/trial/2010-022446-24/DE',
'/ctr-search/trial/2010-019710-24/IT',
'javascript:void(0)',
'&page=2',
'&page=3',
'&page=4',
'&page=5',
'&page=6',
'&page=7',
'&page=8',
'&page=9',
'&page=2',
'&page=19',
'https://servicedesk.ema.europa.eu',
'/disclaimer.html',
'http://www.ema.europa.eu',
'http://www.hma.eu'
正如我所说的,您可以通过将url的必需部分连接到每个结果来实现这一点 请尝试以下代码:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.clinicaltrialsregister.eu/ctr-search/search?query=acute+myeloid+leukemia&page=1')
soup = BeautifulSoup(page.text, 'html.parser')
links = ["https://www.clinicaltrialsregister.eu" + a['href'] for a in soup.find_all('a', href=True)]
此脚本将遍历搜索结果的所有页面,并尝试查找相关信息 有必要添加完整的url,而不仅仅是
https://www.clinicaltrialsregister.eu
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.clinicaltrialsregister.eu/ctr-search/search?query=acute+myeloid+leukemia'
url = base_url + '&page=1'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
page = 1
while True:
print('Page no.{}'.format(page))
print('-' * 160)
print()
for table in soup.select('table.result'):
print('EudraCT Number: ', end='')
for span in table.select('td:contains("EudraCT Number:")'):
print(span.get_text(strip=True).split(':')[1])
print('Full Title: ', end='')
for td in table.select('td:contains("Full Title:")'):
print(td.get_text(strip=True).split(':')[1])
print('Sponsor Name: ', end='')
for td in table.select('td:contains("Sponsor Name:")'):
print(td.get_text(strip=True).split(':')[1])
print('Trial protocol: ', end='')
for a in table.select('td:contains("Trial protocol:") a'):
print(a.get_text(strip=True), end=' ')
print()
print('Medical condition: ', end='')
for td in table.select('td:contains("Medical condition:")'):
print(td.get_text(strip=True).split(':')[1])
print('-' * 160)
next_page = soup.select_one('a:contains("Next»")')
if next_page:
soup = BeautifulSoup(requests.get(base_url + next_page['href']).text, 'lxml')
page += 1
else:
break
印刷品:
Page no.1
----------------------------------------------------------------------------------------------------------------------------------------------------------------
EudraCT Number: 2014-000526-37
Full Title: An Investigator-Initiated Study To Evaluate Ara-C and Idarubicin in Combination with the Selective Inhibitor Of Nuclear Export (SINE)
Selinexor (KPT-330) in Patients with Relapsed Or Refractory A...
Sponsor Name: GSO Global Clinical Research B.V.
Trial protocol: DE
Medical condition: Patients with relapsed/refractory Acute Myeloid Leukemia (AML)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
EudraCT Number: 2006-001777-19
Full Title: A Phase II multicenter study to assess the tolerability and efficacy of the addition of Bevacizumab to standard induction therapy in AML and
high risk MDS above 60 years.
Sponsor Name: HOVON foundation
Trial protocol: NL BE
Medical condition: Acute myeloid leukaemia (AML), AML FAB M0-M2 or M4-M7;
diagnosis with refractory anemia with excess of blasts (RAEB) or refractory anemia with excess of blasts in transformation (RAEB-T) with an IP...
----------------------------------------------------------------------------------------------------------------------------------------------------------------
EudraCT Number: 2007-000273-35
Full Title: A Phase II, Open-Label, Multi-centre, 2-part study to assess the Safety, Tolerability, and Efficacy of Tipifarnib Plus Bortezomib in the Treatment of Newly Diagnosed Acute Myeloid Leukemia AML ...
Sponsor Name: AZIENDA OSPEDALIERA DI BOLOGNA POLICLINICO S. ORSOLA M. MALPIGHI
Trial protocol: IT
Medical condition: Acute Myeloid Leukemia
----------------------------------------------------------------------------------------------------------------------------------------------------------------
...and so on.
只需在解析时将所需的url添加到结果中,我做到了。我正在使用解析,但我无法理解如何从每个ID收集正确的链接。感谢您提供的代码,我在运行代码时收到了此错误。NotImplementedError:仅实现以下伪类:类型为N。@DipankarNeogi请确保您正在运行最新版本的
bs4
。我的版本是beautifulsoup4==4.8.0
现在我收到了这个错误-AttributeError:'LXMLTreeBuilder'对象没有属性'initialize_soup'@DipankarNeogi您是否安装了lxml
模块lxml==4.3.4
或者您可以为html替换lxml
解析器。对于html.parser,我也会遇到同样的错误。”HTMLParserTreeBuilder对象没有属性“initialize\u soup”