用python从段落中提取xml语句

用python从段落中提取xml语句,python,lxml,pubmed,Python,Lxml,Pubmed,我希望能够以xml格式逐句处理不指定句子的段落。我的输入如下所示: <p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/"> Recently, a first step in this direction has been taken in the form of the framework called &#8220;dynamical fingerprints&#8221;, which has been d

我希望能够以xml格式逐句处理不指定句子的段落。我的输入如下所示:

<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/"> 
Recently, a first step in this direction has been taken
in the form of the framework called &#8220;dynamical fingerprints&#8221;,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></p>

最近,朝着这个方向迈出了第一步 以称为“的框架形式;动态指纹”;, 已开发出与实验和MSM导出的 动力学信息.56若干研究 各小组现在专注于开发系统性交叉验证的协议 MSM预测并使用优化方法获得MSM参数 对少数最慢的动态产生最佳估计的协议 蛋白质动力学的模式。57

我希望我的输入看起来更像:

<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">
<s>Recently, a first step in this direction has been taken
in the form of the framework called &#8220;dynamical fingerprints&#8221;,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> </s><s>Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></s></p>

最近,朝着这个方向迈出了第一步 以称为“的框架形式;动态指纹”;, 已开发出与实验和MSM导出的 动力学信息.56若干研究 各小组现在专注于开发系统性交叉验证的协议 MSM预测并使用优化方法获得MSM参数 对少数最慢的动态产生最佳估计的协议 蛋白质动力学的模式。57

这样我就可以提取这些完整的数据,比如:

<s xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">Recently, a first step in this direction has been taken
in the form of the framework called &#8220;dynamical fingerprints&#8221;,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> </s>

<s xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></s>
最近,朝着这个方向迈出了第一步
以称为“的框架形式;动态指纹”;,
已开发出与实验和MSM导出的
动力学信息.56
几项研究
各小组现在专注于开发系统性交叉验证的协议
MSM预测并使用优化方法获得MSM参数
对少数最慢的动态产生最佳估计的协议
蛋白质动力学的模式。57
我的测试代码是:

from lxml import etree

if __name__=="__main__":

  xml1 = '''<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/"> 
Recently, a first step in this direction has been taken
in the form of the framework called &#8220;dynamical fingerprints&#8221;,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></p>
'''


  print xml1

  root = etree.XML(xml1)
  sentences_info = []
  for sentence in root:
    # I want to do more fun stuff here with the result
    sentence_text = sentence.text
    ref_ids = []
    for reference in sentence.getchildren():
        if 'rid' in reference.attrib.keys():
            ref_id = reference.attrib['rid']
            ref_ids.append(ref_id)
    sent_par = {'reference_ids': ref_ids,'text': sentence_text}
    sentences_info.append(sent_par)
    print sent_par
从lxml导入etree
如果名称=“\uuuuu main\uuuuuuuu”:
xml1=''

将BeautifulSoup对象转换为字符串,然后使用正则表达式进行清理效果良好。例如:

from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('yourlink.com'), 'lxml')

paragraphs = str(soup.findAll('p')) #turn the soup object into a string

sentences = paragraphs.split('<sup><xref ref-type="bibr" rid="ref56">56</xref></sup>') #creates a list of sentences

clean = []
for e in sentences:
    e = re.sub(r'(<.*?>)', '', e) #gets rid of the tags
    clean.append(e)
从bs4导入美化组
soup=BeautifulSoup(urlopen('yourlink.com'),'lxml')
段落=str(soup.findAll('p'))#将soup对象转换为字符串
句子=段落。拆分('56')#创建句子列表
干净=[]
对于句子中的e:
e=re.sub(r'()','',e)#去掉标记
clean.append(e)

据我所知,没有内置的方法来处理xml中的句子,它需要自己的临时解决方案。

将BeautifulSoup对象转换为字符串,然后使用正则表达式进行清理效果很好。例如:

from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('yourlink.com'), 'lxml')

paragraphs = str(soup.findAll('p')) #turn the soup object into a string

sentences = paragraphs.split('<sup><xref ref-type="bibr" rid="ref56">56</xref></sup>') #creates a list of sentences

clean = []
for e in sentences:
    e = re.sub(r'(<.*?>)', '', e) #gets rid of the tags
    clean.append(e)
从bs4导入美化组
soup=BeautifulSoup(urlopen('yourlink.com'),'lxml')
段落=str(soup.findAll('p'))#将soup对象转换为字符串
句子=段落。拆分('56')#创建句子列表
干净=[]
对于句子中的e:
e=re.sub(r'()','',e)#去掉标记
clean.append(e)

据我所知,没有内置的方法来处理xml中的句子,它需要自己的临时解决方案。

这是在解析xml时,它仍然包含名称空间。基本上,您解析的每个XML都包含以下元素:

<Element {https://jats.nlm.nih.gov/ns/archiving/1.0/}p at 0x108219048>
然后解析XML并删除名称空间

tree = etree.fromstring(xml1)
remove_namespace(tree) # remove namespace
tree.findall('sup') # output as [<Element sup at 0x1081d73c8>, <Element sup at 0x1081d7648>]
tree=etree.fromstring(xml1)
删除名称空间(树)#删除名称空间
tree.findall('sup')#输出为[,]

这是在解析XML时,它仍然包含名称空间。基本上,您解析的每个XML都包含以下元素:

<Element {https://jats.nlm.nih.gov/ns/archiving/1.0/}p at 0x108219048>
然后解析XML并删除名称空间

tree = etree.fromstring(xml1)
remove_namespace(tree) # remove namespace
tree.findall('sup') # output as [<Element sup at 0x1081d73c8>, <Element sup at 0x1081d7648>]
tree=etree.fromstring(xml1)
删除名称空间(树)#删除名称空间
tree.findall('sup')#输出为[,]

这是一种为我指定修复程序的方法。我需要一些非常笼统的东西。我将按照类似的思路提出另一个问题,你可能不会得到一个“非常通用”的方法来获取给定数据的句子。xml模块中没有用于此的工具,因此您必须定制解决方案。好的,我将构建一个解决方案,希望社区能够帮助我清理它。谢谢你的帮助!没问题。如果您指定了您需要在解决方案中尝试概括的内容,我也很乐意对其进行另一次尝试。这是一种为我指定修复方法的方法。我需要一些非常笼统的东西。我将按照类似的思路提出另一个问题,你可能不会得到一个“非常通用”的方法来获取给定数据的句子。xml模块中没有用于此的工具,因此您必须定制解决方案。好的,我将构建一个解决方案,希望社区能够帮助我清理它。谢谢你的帮助!没问题。如果您指定需要在解决方案中进行推广的内容,我也很乐意再次尝试。