Python 如何从HTML代码中提取信息
此代码是一行.html文件,它是从具有唯一标识符“|Rv0153c |”的html文件中提取的:Python 如何从HTML代码中提取信息,python,Python,此代码是一行.html文件,它是从具有唯一标识符“|Rv0153c |”的html文件中提取的: 我认为您需要的是一个HTML解析器: 您可以使用Python和正则表达式库 from bs4 import BeautifulSoup import re sentence = '<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv0153c|ptbB<br />M
我认为您需要的是一个HTML解析器:
您可以使用Python和正则表达式库
from bs4 import BeautifulSoup
import re
sentence = '<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv0153c|ptbB<br />MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE<br />VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND<br />AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL<br />DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA<br />AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG<br /></small><TR><td><b><big>Blastp: <a href="http://tuberculist.epfl.ch/blast_output/Rv0153c.fasta.out"> Pre-computed results</a></big></b><TR><td><b><big>TransMembrane prediction using Hidden Markov Models: <a href="http://tuberculist.epfl.ch/tmhmm/Rv0153c.html"> TMHMM</a></big></b><base target="_blank"/><TR><td><b><big>Genomic sequence</big></b><br /><br /><form action="dnaseq.php" method="get">'
print re.sub('<[^>]*>', '', sentence)
从bs4导入美化组
进口稀土
句子='>M.tuberculosis H37Rv | Rv0153c | ptbB
mavrepgawnfrdvaddtatalrpgrlfrsselldldartlrgitdvadlrgitdvadlrsre
varrgpgrvpdgidvhllpladdddddsaphetafkrltndgsngessind
aatrymtdeyrqfptrngarlvtllaagrplthcfagkdrtgfalvlalvlavgl
/>AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG
Blastp:使用隐马尔可夫模型的跨膜预测:基因组序列
打印有关子项(']*>','',句子)
嗯 您可能应该检查正则表达式。Python有一个re模块可以为您处理这个问题。您实际上并没有在这里使用BeautifulSoup,您只是使用了一个正则表达式。感谢您的宝贵建议HTML解析器非常适合我的工作,只有三行代码可以完全满足我的要求。我很乐意提供帮助。
>M. tuberculosis H37Rv|Rv0153c|ptbB
MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE
VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND
AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL
DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA
AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG
from bs4 import BeautifulSoup
import re
sentence = '<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv0153c|ptbB<br />MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE<br />VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND<br />AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL<br />DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA<br />AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG<br /></small><TR><td><b><big>Blastp: <a href="http://tuberculist.epfl.ch/blast_output/Rv0153c.fasta.out"> Pre-computed results</a></big></b><TR><td><b><big>TransMembrane prediction using Hidden Markov Models: <a href="http://tuberculist.epfl.ch/tmhmm/Rv0153c.html"> TMHMM</a></big></b><base target="_blank"/><TR><td><b><big>Genomic sequence</big></b><br /><br /><form action="dnaseq.php" method="get">'
print re.sub('<[^>]*>', '', sentence)