Python，在已知字符串下面的行上搜索文本？_Python_Regex_Xml_Web Scraping_Beautifulsoup

Python，在已知字符串下面的行上搜索文本？

python regex xml web-scraping

Python，在已知字符串下面的行上搜索文本？,python,regex,xml,web-scraping,beautifulsoup,Python,Regex,Xml,Web Scraping,Beautifulsoup,我已经使用python模块BeautifulSoup编写了一个脚本，用于从网页获取xml。此网页包含使用基因组数据描述项目的信息，我想提取所有PUBMED ID（来自此项目的出版物的唯一ID号）。每个PUBMED ID是一个8位数字我尝试了两种不同的方法来提取PUBMED ID，但两者都存在问题。首先，我使用以下代码提取完整的xml： url = 'http://www.ebi.ac.uk/ena/data/view/PRJEB2357&display=xml' project_pag

我已经使用python模块BeautifulSoup编写了一个脚本，用于从网页获取xml。此网页包含使用基因组数据描述项目的信息，我想提取所有PUBMED ID（来自此项目的出版物的唯一ID号）。每个PUBMED ID是一个8位数字

我尝试了两种不同的方法来提取PUBMED ID，但两者都存在问题。首先，我使用以下代码提取完整的xml：

url = 'http://www.ebi.ac.uk/ena/data/view/PRJEB2357&display=xml'
project_page = urlopen(url)
soup = BeautifulSoup(project_page, "html.parser")
print soup

此命令的输出有点像这样：

<db>PUBMED</db>
<id>25101644</id>
</xref_link>
</project_link>
<project_link>
<xref_link>
<db>PUBMED</db>
<id>24509479</id>

这次输出如下所示：

PUBMED
25101644




PUBMED
24509479

url = 'http://www.ebi.ac.uk/ena/data/view/PRJEB2357&display=xml'
project_page = urlopen(url)
soup2 = BeautifulSoup(project_page, "html.parser") 
text = soup2.text
text = text.replace('\n', ' ').replace(' ', '') #removes all spaces and linebreaks
PMID = re.findall('PUBMED........', text, flags = 0)
print PMID

在这一点上我有一些想法。首先，可以使用python re模块（python早期版本中的regex）来搜索表达式，但我知道的所有re命令都需要至少一部分模式作为输入进行搜索，所以我不认为这是一个选项。第二，我试着这样做：

PUBMED
25101644




PUBMED
24509479

url = 'http://www.ebi.ac.uk/ena/data/view/PRJEB2357&display=xml'
project_page = urlopen(url)
soup2 = BeautifulSoup(project_page, "html.parser") 
text = soup2.text
text = text.replace('\n', ' ').replace(' ', '') #removes all spaces and linebreaks
PMID = re.findall('PUBMED........', text, flags = 0)
print PMID

这将提供以下输出：

[u'PUBMED25101644', u'PUBMED24509479']

所以理论上这可以转换成一个字符串，我只是删掉了相关的8位数字，但是这变得非常粗糙，我想在几千个项目的网页上多次运行这个脚本，每个项目的PUBMED ID的数量会有所不同，所以这种方法不适合自动化

我想要的是一种搜索单词“PUBMED”的每个实例的方法，无论是在生汤中还是在文本中，并且只提取下一行的PUBMED ID。有人对如何做到这一点有什么建议吗

查找所有出现的

PUBMED

并获取：

或者，制作一个：

请注意，您应该使用

xml

解析器，而不是

html.parser

：

soup = BeautifulSoup(project_page, "xml")

演示：

您可以找到

db

，然后获得它的第一个同级

data = '''<db>PUBMED</db>
<id>25101644</id>
</xref_link>
</project_link>
<project_link>
<xref_link>
<db>PUBMED</db>
<id>24509479</id>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(data, "html.parser")
#print(soup)

for x in soup.find_all('db'):
    print(x.text, x.fetchNextSiblings()[0].text)

您可以直接在正则表达式中使用lookbehind。如果文本是

print text
PUBMED
25101644




PUBMED
24509479

通过使用

>>> re.findall('(?<=PUBMED\n).+',text)
['25101644', '24509479']

>>>re.findall（“（？感谢您，除了最后一位，我的输出效果非常好，即使我包含[\d]+选项，我的输出也是这样：[u'25101644'，u'24509479']，这是因为文本是一个unicode字符串。您可以再尝试一行，以获得re.findall（“（？）中项的integerResults=[int（item）]列表？
PUBMED 25101644
PUBMED 24509479

print text
PUBMED
25101644




PUBMED
24509479

>>> re.findall('(?<=PUBMED\n).+',text)
['25101644', '24509479']