Python 从ClinicalTrials.Gov的特定字段中获取数据_Python_Web Scraping_Beautifulsoup

Python 从ClinicalTrials.Gov的特定字段中获取数据

python web-scraping

Python 从ClinicalTrials.Gov的特定字段中获取数据,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我编写了一个函数，它提供了一个NCTID，即ClinicalTrials.Gov ID，它从ClinicalTrials.Gov中获取数据： def clinicalTrialsGov (nctid): data = BeautifulSoup(requests.get("https://clinicaltrials.gov/ct2/show/" + nctid + "?displayxml=true").text, "xml") subset = ['intervention_

我编写了一个函数，它提供了一个NCTID，即ClinicalTrials.Gov ID，它从ClinicalTrials.Gov中获取数据：

def clinicalTrialsGov (nctid):
    data = BeautifulSoup(requests.get("https://clinicaltrials.gov/ct2/show/" + nctid + "?displayxml=true").text, "xml")
    subset = ['intervention_type', 'study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking', 'enrollment', 'official_title', 'condition', 'minimum_age', 'maximum_age', 'gender', 'healthy_volunteers', 'phase', 'primary_outcome', 'secondary_outcome', 'number_of_arms']
    tag_matches = data.find_all(subset)

然后，我做以下工作：

tag_dict = dict((str('ct' + tag_matches[i].name.capitalize()), tag_matches[i].text) for i in range(0, len(tag_matches)))
for key in tag_dict:
    print(key + ': ' + tag_dict[key])

将此数据转换为字典。但是，如果存在多种干预类型，例如，这将只采用一种干预类型。如何调整此代码，以便在有多个值的字段时，这些值以逗号分隔的列表列出

电流输出：

ctOfficial_title: Aerosolized Beta-Agonist Isomers in Asthma
ctPhase: Phase 4
ctStudy_type: Interventional
ctAllocation: Non-Randomized
ctIntervention_model: Crossover Assignment
ctPrimary_purpose: Treatment
ctMasking: None (Open Label)
ctPrimary_outcome: 
Change in Maximum Forced Expiratory Volume at One Second (FEV1)
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment

ctSecondary_outcome: 
Change in Dyspnea Response as Measured by the University of California, San Diego (UCSD) Dyspnea Scale
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment

ctNumber_of_arms: 5
ctEnrollment: 10
ctCondition: Asthma
ctIntervention_type: Drug
ctGender: All
ctMinimum_age: 18 Years
ctMaximum_age: N/A
ctHealthy_volunteers: No

期望输出：

ctOfficial_title: Aerosolized Beta-Agonist Isomers in Asthma
ctPhase: Phase 4
ctStudy_type: Interventional
ctAllocation: Non-Randomized
ctIntervention_model: Crossover Assignment
ctPrimary_purpose: Treatment
ctMasking: None (Open Label)
ctPrimary_outcome: 
Change in Maximum Forced Expiratory Volume at One Second (FEV1)
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment

ctSecondary_outcome: 
Change in Dyspnea Response as Measured by the University of California, San Diego (UCSD) Dyspnea Scale
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment

ctNumber_of_arms: 5
ctEnrollment: 10
ctCondition: Asthma
ctIntervention_type: Drug, Drug, Other, Device, Device, Drug
ctGender: All
ctMinimum_age: 18 Years
ctMaximum_age: N/A
ctHealthy_volunteers: No

如何调整代码，使其能够刮除所有干预类型？

您看到的是最后一个标记值，因为所有以前的值都将被下一个值覆盖。您需要检查字典中是否已经存在密钥，如果已经存在，则相应地设置句柄。大概是这样的：

tag_dict = {}
for i in range(0, len(tag_matches)):
    if(str('ct' + tag_matches[i].name.capitalize())) in tag_dict:
         tag_dict[str('ct' + tag_matches[i].name.capitalize())] += ', '+tag_matches[i].text
    else:
         tag_dict[(str('ct' + tag_matches[i].name.capitalize()))]= tag_matches[i].text

您的代码失败，因为它正在覆盖给定字典键的以前的值。相反，您需要附加到现有条目

您可以使用Python的defaultdict。这可用于自动为每个键创建列表。如果有多个条目，则每个条目都会附加到该键的列表中。然后，在打印时，如果需要，可以使用分隔符将列表重新连接在一起：

import bs4
from collections import defaultdict    
from bs4 import BeautifulSoup    
import requests

def clinicalTrialsGov(nctid):
    data = defaultdict(list)
    soup = BeautifulSoup(requests.get("https://clinicaltrials.gov/ct2/show/" + nctid + "?displayxml=true").text, "xml")
    subset = ['intervention_type', 'study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking', 'enrollment', 'official_title', 'condition', 'minimum_age', 'maximum_age', 'gender', 'healthy_volunteers', 'phase', 'primary_outcome', 'secondary_outcome', 'number_of_arms']

    for tag in soup.find_all(subset):
        data['ct{}'.format(tag.name.capitalize())].append(tag.get_text(strip=True))

    for key in data:
        print('{}: {}'.format(key, ', '.join(data[key])))

clinicalTrialsGov('NCT02170532')

这将显示以下内容：

tag_dict = dict((str('ct' + tag_matches[i].name.capitalize()), tag_matches[i].text) for i in range(0, len(tag_matches)))
for key in tag_dict:
    print(key + ': ' + tag_dict[key])

官方标题：哮喘中的雾化β-激动剂异构体第四阶段：第四阶段 CTU研究类型：介入性 C分配：非随机 CTI干预_模型：交叉分配 CTU主要用途：治疗 CTM：无开放标签 CTU主要结果：治疗前1秒、治疗后30分钟、1、2、4、6和8小时FEV1基线时最大用力呼气量的变化 CTU次要结果：给药后10至8小时FEV10曲线下8小时面积的变化，治疗前、治疗后30分钟、1、2、4、6和8小时心率基线的变化，治疗前、30分钟、1、2、4、6和8小时用比例尺测得的震颤评估的变化，治疗8小时后，将伸出手0进行震颤评估，1±＝微震，几乎觉察不到，2±明显震颤，呼吸困难的变化由加利福尼亚大学、圣地亚哥UCSD治疗前呼吸困难量表基线、治疗后30分钟、1, 2, 4、6和8小时测量。武器数量：5 注册人数：10人病情：哮喘 C干预类型：药物、药物、其他、设备、设备、药物性别：全部 CTU最低年龄：18岁 CTU最大使用年限：不适用 CTU志愿者：没有

您可以添加您期望的输出吗？对于您给定的id，我看到6个干预类型字段。使用bs4.6进行测试。0@bla请参见编辑。@MartinEvans请参见编辑。