Python 分集解析
我想解析来自药品网站的数据。此解析需要有选择性,这是我使用的代码:Python 分集解析,python,python-2.7,pandas,parsing,beautifulsoup,Python,Python 2.7,Pandas,Parsing,Beautifulsoup,我想解析来自药品网站的数据。此解析需要有选择性,这是我使用的代码: import requests from bs4 import BeautifulSoup def get_details(url): print('details:', url) # get subpage r = requests.get(url) soup = BeautifulSoup(r.text ,"lxml") # get data on subpabe dts
import requests
from bs4 import BeautifulSoup
def get_details(url):
print('details:', url)
# get subpage
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
# get data on subpabe
dts = soup.findAll('dt')
dds = soup.findAll('dd')
# display details
for dt, dd in zip(dts, dds):
print(dt.text)
print(dd.text)
print('---')
print('---------------------------')
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
while url:
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
# get links to subpages
links = soup.select('strong a')
for link in links:
# exeecute function to get subpage
get_details('https://www.drugbank.ca' + link['href'])
# next page url
url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
print(url)
if url:
url = 'https://www.drugbank.ca' + url[0].get('href')
else:
break
drug_data()
这很有效。但是,如何进行更深入和选择性的分析呢?比如说,对于这种药物:当我使用我的代码解析专利时,它将把专利的所有信息连接在一个段落中的一个子表中
理想情况下,如果我可以解析专利,但只提取专利号,批准和国旗所代表的国家!在不同的列中!
要帮忙吗
以下是专利屏幕截图:
如果您要查找登录号和组,可以执行以下操作:
def get_details(url):
print('Details:', url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
accession_dt = soup.find('dt', text='Accession Number')
accession_number = accession_dt.nextSibling.string
groups_dt = soup.find('dt', text='Groups')
groups = groups_dt.nextSibling.string
print('Accession number: ' + accession_number)
print('Groups: ' + groups)
对于您提供的url,输出如下所示:
>>> get_details('https://www.drugbank.ca/drugs/DB01614')
Details: https://www.drugbank.ca/drugs/DB01614
Accession number: DB01614
Groups: Approved, Vet Approved
如果要对此进行泛化,可以定义一个函数,该函数返回作为参数传递的键的文本:
def get_value(soup, key):
key_dt = soup.find('dt', text=key)
return key_dt.nextSibling.string
要使用此功能,可以执行以下操作:
def get_details(url):
print('Details:', url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
accession_number = get_value(soup, 'Accession Number')
groups = get_value(soup, 'Groups')
print('Accession number: ' + accession_number)
print('Groups: ' + groups)
其输出与上述相同
编辑:问题的答案
这将直接给你想要的
def get_details(url):
print('Details:', url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
patents = soup.find('dt', text='Patents').nextSibling
if patents.string == 'Not Available':
print('Patent: Not Available')
else:
for i, row in enumerate(patents.find('tbody').findAll('tr')):
print('\nPatent entry %d:' % (i+1))
patent_number = row.find('a').text
patent_approved = row.findAll('td')[2].text
patent_country = row.find('img')['alt']
print('Patent number: ' + patent_number)
print('Approved: ' + patent_approved)
print('Country: ' + patent_country)
对于药物:,输出为
Details: https://www.drugbank.ca/drugs/DB00639
Patent entry 1:
Patent number: US5266329
Approved: 1993-11-30
Country: Us
Patent entry 2:
Patent number: US5993856
Approved: 1997-11-17
Country: Us
专利号是指注册号吗?“批准”是指团体吗?我在任何地方都看不到标志。它在页面底部。我现在看到了,你提供的链接没有专利。在评论之后,我可以看到这不是你想要的。但我会保留在这里,因为它与您想要的内容类似。我已经编辑了答案并在底部添加了解决方案。非常感谢,我现在正在运行整个代码,并尝试在最终csv输出中添加专利/批准/国家/地区列