Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/324.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Python在包含给定单词的标记之间提取文本_Python_Xml_Nlp - Fatal编程技术网

使用Python在包含给定单词的标记之间提取文本

使用Python在包含给定单词的标记之间提取文本,python,xml,nlp,Python,Xml,Nlp,我从一个XML文档中提取了一些文本,我试图提取包含某些单词的标记中的文本 例如: search('adverse') 应返回包含单词“不利”的所有标记的文本 Out: [ "<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, br

我从一个XML文档中提取了一些文本,我试图提取包含某些单词的标记中的文本

例如:

search('adverse')
应返回包含单词“不利”的所有标记的文本

Out: 
  [
    "<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>"
  ]
我应该使用什么工具来进行此操作?正则表达式?BS4?如有任何建议,我们将不胜感激


示例文本:

 </highlight>
 </excerpt>
 <component>
 <section id="ID40">
 <id root="fbc21d1a-2fb2-47b1-ac53-f84ed1428bb4"></id>
 <title>6.1 Clinical Trials Experience</title>
 <text>
 <paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>
 <list id="ID42" listtype="unordered" stylecode="Disc">
 <item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>

6.1临床试验经验
在一项多中心、随机、双盲、平行组试验(结合α受体阻滞剂治疗或战斗试验)中,对联合服用杜他赛酯和坦索罗辛(杜他赛酯和盐酸坦索罗辛胶囊的单独成分)的临床疗效和安全性进行了评估
联合服用杜他司特和坦索罗辛的受试者最常见的不良反应是阳痿、性欲下降、乳房疾病(包括乳房肿大和压痛)、射精障碍和头晕。

您可以使用正则表达式对其进行硬编码,也可以使用类似

使用正则表达式,即:

import re

your_text = "(...)"

def search(instr):
    return re.findall(r"<.+>.*{}.*<.+>".format(instr), your_text, re.MULTILINE)

print(search("safety"))
重新导入
您的_text=“(…)”
def搜索(instr):
返回re.findall(r.“*{}.*.”格式(instr),您的_文本,re.MULTILINE)
打印(搜索(“安全”))
import re

your_text = "(...)"

def search(instr):
    return re.findall(r"<.+>.*{}.*<.+>".format(instr), your_text, re.MULTILINE)

print(search("safety"))