Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/341.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何从HTML代码中提取信息_Python - Fatal编程技术网

Python 如何从HTML代码中提取信息

Python 如何从HTML代码中提取信息,python,Python,此代码是一行.html文件,它是从具有唯一标识符“|Rv0153c |”的html文件中提取的: 我认为您需要的是一个HTML解析器: 您可以使用Python和正则表达式库 from bs4 import BeautifulSoup import re sentence = '<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv0153c|ptbB<br />M

此代码是一行.html文件,它是从具有唯一标识符“|Rv0153c |”的html文件中提取的:


我认为您需要的是一个HTML解析器:
您可以使用Python和正则表达式库

from bs4 import BeautifulSoup
import re
sentence = '<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv0153c|ptbB<br />MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE<br />VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND<br />AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL<br />DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA<br />AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG<br /></small><TR><td><b><big>Blastp: <a href="http://tuberculist.epfl.ch/blast_output/Rv0153c.fasta.out"> Pre-computed results</a></big></b><TR><td><b><big>TransMembrane prediction using Hidden Markov Models: <a href="http://tuberculist.epfl.ch/tmhmm/Rv0153c.html"> TMHMM</a></big></b><base target="_blank"/><TR><td><b><big>Genomic sequence</big></b><br /><br /><form action="dnaseq.php" method="get">'   
print re.sub('<[^>]*>', '',  sentence)
从bs4导入美化组
进口稀土
句子='>M.tuberculosis H37Rv | Rv0153c | ptbB
mavrepgawnfrdvaddtatalrpgrlfrsselldldartlrgitdvadlrgitdvadlrsre
varrgpgrvpdgidvhllpladdddddsaphetafkrltndgsngessind
aatrymtdeyrqfptrngarlvtllaagrplthcfagkdrtgfalvlalvlavgl
/>AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG
Blastp:使用隐马尔可夫模型的跨膜预测:基因组序列

打印有关子项(']*>','',句子)

您可能应该检查正则表达式。Python有一个re模块可以为您处理这个问题。您实际上并没有在这里使用BeautifulSoup,您只是使用了一个正则表达式。感谢您的宝贵建议HTML解析器非常适合我的工作,只有三行代码可以完全满足我的要求。我很乐意提供帮助。
>M. tuberculosis H37Rv|Rv0153c|ptbB
MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE
VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND
AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL
DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA
AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG
from bs4 import BeautifulSoup
import re
sentence = '<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv0153c|ptbB<br />MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE<br />VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND<br />AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL<br />DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA<br />AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG<br /></small><TR><td><b><big>Blastp: <a href="http://tuberculist.epfl.ch/blast_output/Rv0153c.fasta.out"> Pre-computed results</a></big></b><TR><td><b><big>TransMembrane prediction using Hidden Markov Models: <a href="http://tuberculist.epfl.ch/tmhmm/Rv0153c.html"> TMHMM</a></big></b><base target="_blank"/><TR><td><b><big>Genomic sequence</big></b><br /><br /><form action="dnaseq.php" method="get">'   
print re.sub('<[^>]*>', '',  sentence)