在python中捕获特定标记之间的数据
我正在获取python中的url内容。。。我想捕获在python中捕获特定标记之间的数据,python,find,Python,Find,我正在获取python中的url内容。。。我想捕获和之间的所有内容 我尝试的是: myString='''<h1>kgkgjgjgkjgkjgkj</h1> <h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1> dsfgdfgg <h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1> dfgdffdgf <h1>kgkgjgjsdsssssssssssssss
和
之间的所有内容
我尝试的是:
myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
if '<h1>' in myString:
startString='<h1>'
endString='</h1>'
print myString[myString.find(startString)+len(startString):myString.find(endString)]
myString='''kgkgjgkjgkjgkj
Kdfggggggggggggggkgjgkjgkjgkjgkj
dsfgdfgg
kgkgjgjgkdfgdfgdfjgkjgkj
dfgdffdgf
kgkgjgjsssssssssssssgkjgkjgkj
dfgdfgdg
kgkgjgjgkjgkjgkgggggggggggggggj
'''
如果myString中有“”:
startString=''
endString=''
打印myString[myString.find(startString)+len(startString):myString.find(endString)]
我有多个h1
标签。但是它捕获第一个h1标记之间的数据
如何在所有
h1
标记之间捕获数据?您可以使用一个简单的:
未包含在标准库中,因此您需要手动安装。您可以通过pip轻松安装:
pip install beautifulsoup4
使用BeautifulSoup解析器
>>> from bs4 import BeautifulSoup
>>> myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
>>> soup = BeautifulSoup(myString)
>>> h1 = soup.select('h1')
>>> for i in h1:
print i.text
kgkgjgjgkjgkjgkj
kdfgggggggggggggggggggkgjgjgkjgkjgkj
kgkgjgjgkdfgdfgdgdfjgkjgkj
kgkgjgjsdssssssssssssssssssssgkjgkjgkj
kgkgjgjgkjgkjgkgggggggggggggggggggj
>>>
>>来自bs4导入组
>>>myString=''kgkgjgjgkjgkjgkj
Kdfggggggggggggggkgjgkjgkjgkjgkj
dsfgdfgg
kgkgjgjgkdfgdfgdfjgkjgkj
dfgdffdgf
kgkgjgjsssssssssssssgkjgkjgkj
dfgdfgdg
kgkgjgjgkjgkjgkgggggggggggggggj
'''
>>>汤=美汤(myString)
>>>h1=汤。选择('h1')
>>>对于h1中的i:
打印i.text
kgkgjgkjgkjgkjgkj
Kdfggggggggggggggkgjgkjgkjgkjgkj
kgkgjgjgkdfgdfgdfjgkjgkj
kgkgjgjsssssssssssssgkjgkjgkj
kgkgjgjgkjgkjgkgggggggggggggggj
>>>
漂亮汤的工作示例
>>> import bs4
>>> myString='''<h1>kgkgjgjgkjgkjgkj</h1>
... <h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
... dsfgdfgg
... <h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
... dfgdffdgf
... <h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
... dfgdfgdg
... <h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
... '''
>>> soup = bs4.BeautifulSoup(myString)
>>> soup.find("h1").text
u'kgkgjgjgkjgkjgkj'
>>> soup.find_all("h1")
[<h1>kgkgjgjgkjgkjgkj</h1>, <h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>, <h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>, <h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>, <h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>]
导入bs4
>>>myString=''kgkgjgjgkjgkjgkj
... Kdfggggggggggggggkgjgkjgkjgkjgkj
... dsfgdfgg
... kgkgjgjgkdfgdfgdfjgkjgkj
... dfgdffdgf
... kgkgjgjsssssssssssssgkjgkjgkj
... dfgdfgdg
... kgkgjgjgkjgkjgkgggggggggggggggj
... '''
>>>汤=bs4.BeautifulSoup(myString)
>>>soup.find(“h1”).text
u'kgkgjgkjgkjgkj'
>>>汤。全部找到(“h1”)
[kgkgjgjgkjgkjgkjgkjgkjggggggggggggkgjgjgkjgkjgkj,kgkgjgjgkdfgdgfgdfjgkjgkjgkjgkj,kgkgjgjgjsssssssssssssssgkjgkjgkjgkjgkj,kgjgjgjgjgjgkjgkjggkjggggggggggggggggggggggggggggggg
简单列表补偿解决方案:
print [s.split('</h1>')[0] for s in myString.split('<h1>')[1:]]
myString.split(“”)[1:]
我会选择Beautifulsoup——我的尝试
from bs4 import BeautifulSoup
import requests
url = 'http://accessibility.psu.edu/headingshtml/'
respons = requests.get(url).content
soup = BeautifulSoup(respons,'lxml')
h1tags = soup.find_all('h1')
for singleTag in h1tags:
print singleTag.text
打印(在这种情况下,只有一个h1标签)
请与那些多个
h1
标记共享示例数据。如果您打算提取html内容,那么您应该使用,这要容易得多。在演示并被告知您应该使用html解析器之后,您究竟为什么尝试自己使用字符串搜索来实现它?!如果您想解析HTML,请使用HTML解析器-线索就在名称中。错误:`/usr/local/lib/python2.7/dist-packages/bs4/_-init__.py:166:UserWarning:没有显式指定解析器,因此我正在使用此系统可用的最佳HTML解析器(“lxml”)。这通常不是问题,但如果您在另一个系统上或在不同的虚拟环境中运行此代码,它可能会使用不同的解析器并表现出不同的行为。要消除此警告,请将此:BeautifulSoup([您的标记])更改为:BeautifulSoup([您的标记],“lxml”)markup_type=markup_type))`@MortezaLSC那么您是否阅读了该消息,并/或按照它的建议执行了?啊!它不会返回所有h1标记,只返回其中一个:(@SIslam两种方法都显示了,find将只返回第一次出现的元素,而find_all将返回搜索到的所有元素。谢谢-我根据OP的需要假设-不需要find
,无论如何,谢谢。谢谢。我如何才能消除错误:/usr/local/lib/python2.7/dist packages/bs4/u init_uu.py:166:用户警告:没有明确的解析器这通常不是问题,但如果您在另一个系统上或在不同的虚拟环境中运行此代码,它可能会使用不同的语法分析器,并且表现不同。要消除此警告,请将此:beautifulsou([您的标记])更改为:beautifulsou(“您的标记”,“LXML”(MulkopyType=MARKUPYPE类型))@ MortezaLSC,这不是一个错误。您考虑阅读它吗?试试<代码>汤=漂亮的汤(MySnpe,“html。语法分析器”)< /> > @ AvinashRaj >代码>汤。选择< /COD>是一个聪明的方式替代<代码> FiffyALL < /代码>呵呵!
print [s.split('</h1>')[0] for s in myString.split('<h1>')[1:]]
from bs4 import BeautifulSoup
import requests
url = 'http://accessibility.psu.edu/headingshtml/'
respons = requests.get(url).content
soup = BeautifulSoup(respons,'lxml')
h1tags = soup.find_all('h1')
for singleTag in h1tags:
print singleTag.text
Heading Tags (H1, H2, H3, P) in HTML