Python 3.x b声音:有条件地提取href文本

Python 3.x b声音:有条件地提取href文本,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,有没有一种方法可以使用正则表达式有条件地获取“HREF”?例如,下面我只想要两个HREF的文本(TUBB1和TUBB2): href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:*" 而只是href目标的文本 href="http://www.uniprot.org/uniprot/P04690" target="_blank">P04690</a> href=”http://www.uniprot

有没有一种方法可以使用正则表达式有条件地获取“HREF”?例如,下面我只想要两个HREF的文本(TUBB1和TUBB2):

href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:*"
而只是href目标的文本

href="http://www.uniprot.org/uniprot/P04690" target="_blank">P04690</a>
href=”http://www.uniprot.org/uniprot/P04690“target=“\u blank”>P04690
我的最终目标是创建一个列表,如[(“TUBB1”,TUBB2),P04960]

下面是我想要提取的文本的HTML块

<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.ncbi_scientific_name:Chlamydomonas reinhardtii">Chlamydomonas reinhardtii</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB1">TUBB1</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB2">TUBB2</a>
<a class="querySearchLink" href="/search?q=rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession:P04690 AND rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name:UniProt">P04690</a>
<a href="http://www.uniprot.org/uniprot/P04690" target="_blank">P04690</a>
<a href="/pdb/protein/P04690" target="_blank">P04690</a>

我不认为它性感,但我想这就行了

z=i.find_all('a')

for j in z:
    if "_gene_name" in j['href']:
        print(j.text)
    if "/pdb/protein" in j['href']:
        print(j.text)
输出:

TUBB1
TUBB2
P04690

根据评论,这里有一个选择所需元素的可能解决方案:

from bs4 import BeautifulSoup

html = '''<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.ncbi_scientific_name:Chlamydomonas reinhardtii">Chlamydomonas reinhardtii</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB1">TUBB1</a>
<a class="querySearchLink" href="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:TUBB2">TUBB2</a>
<a class="querySearchLink" href="/search?q=rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession:P04690 AND rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name:UniProt">P04690</a>
<a href="http://www.uniprot.org/uniprot/P04690" target="_blank">P04690</a>
<a href="/pdb/protein/P04690" target="_blank">P04690</a>'''

soup = BeautifulSoup(html, 'html.parser')

# select all text from elements where href begins with "/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:"
part_1 = tuple(s.text for s in soup.select('[href^="/search?q=rcsb_entity_source_organism.rcsb_gene_name.value:"]'))

# select text from first element where href begins with "http://www.uniprot.org/uniprot/"
part_2 = soup.select_one('[href^="http://www.uniprot.org/uniprot/"]').text

# combine parts and print them:
print([part_1, part_2])

如果我理解正确,是否要选择
href=
开头的所有元素“/search?q=rcsb\u entity\u source\u organism.rcsb\u gene\u name.value:“
href=
开头的一个元素?”http://www.uniprot.org/uniprot/“
?没错!我想我只是设法用字符串搜索”hrefs““.好奇是否有一个正则表达式会抓住他们。也许在执行更昂贵的任务时,字符串搜索/输入会中断或花费很长时间。
[('TUBB1', 'TUBB2'), 'P04690']