在python中捕获特定标记之间的数据

在python中捕获特定标记之间的数据,python,find,Python,Find,我正在获取python中的url内容。。。我想捕获和之间的所有内容 我尝试的是: myString='''<h1>kgkgjgjgkjgkjgkj</h1> <h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1> dsfgdfgg <h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1> dfgdffdgf <h1>kgkgjgjsdsssssssssssssss

我正在获取python中的url内容。。。我想捕获
之间的所有内容

我尝试的是:

myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
if '<h1>' in myString:
    startString='<h1>'
    endString='</h1>'
    print myString[myString.find(startString)+len(startString):myString.find(endString)]
myString='''kgkgjgkjgkjgkj
Kdfggggggggggggggkgjgkjgkjgkjgkj
dsfgdfgg
kgkgjgjgkdfgdfgdfjgkjgkj
dfgdffdgf
kgkgjgjsssssssssssssgkjgkjgkj
dfgdfgdg
kgkgjgjgkjgkjgkgggggggggggggggj
'''
如果myString中有“”:
startString=''
endString=''
打印myString[myString.find(startString)+len(startString):myString.find(endString)]
我有多个
h1
标签。但是它捕获第一个h1标记之间的数据


如何在所有
h1
标记之间捕获数据?

您可以使用一个简单的:

未包含在标准库中,因此您需要手动安装。您可以通过pip轻松安装:

pip install beautifulsoup4

使用BeautifulSoup解析器

>>> from bs4 import BeautifulSoup
>>> myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
>>> soup = BeautifulSoup(myString)
>>> h1 = soup.select('h1')
>>> for i in h1:
    print i.text


kgkgjgjgkjgkjgkj
kdfgggggggggggggggggggkgjgjgkjgkjgkj
kgkgjgjgkdfgdfgdgdfjgkjgkj
kgkgjgjsdssssssssssssssssssssgkjgkjgkj
kgkgjgjgkjgkjgkgggggggggggggggggggj
>>> 
>>来自bs4导入组
>>>myString=''kgkgjgjgkjgkjgkj
Kdfggggggggggggggkgjgkjgkjgkjgkj
dsfgdfgg
kgkgjgjgkdfgdfgdfjgkjgkj
dfgdffdgf
kgkgjgjsssssssssssssgkjgkjgkj
dfgdfgdg
kgkgjgjgkjgkjgkgggggggggggggggj
'''
>>>汤=美汤(myString)
>>>h1=汤。选择('h1')
>>>对于h1中的i:
打印i.text
kgkgjgkjgkjgkjgkj
Kdfggggggggggggggkgjgkjgkjgkjgkj
kgkgjgjgkdfgdfgdfjgkjgkj
kgkgjgjsssssssssssssgkjgkjgkj
kgkgjgjgkjgkjgkgggggggggggggggj
>>> 

漂亮汤的工作示例

>>> import bs4
>>> myString='''<h1>kgkgjgjgkjgkjgkj</h1>
... <h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
... dsfgdfgg
... <h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
... dfgdffdgf
... <h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
... dfgdfgdg
... <h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
... '''
>>> soup = bs4.BeautifulSoup(myString)
>>> soup.find("h1").text
u'kgkgjgjgkjgkjgkj'
>>> soup.find_all("h1")
[<h1>kgkgjgjgkjgkjgkj</h1>, <h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>, <h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>, <h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>, <h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>]
导入bs4 >>>myString=''kgkgjgjgkjgkjgkj ... Kdfggggggggggggggkgjgkjgkjgkjgkj ... dsfgdfgg ... kgkgjgjgkdfgdfgdfjgkjgkj ... dfgdffdgf ... kgkgjgjsssssssssssssgkjgkjgkj ... dfgdfgdg ... kgkgjgjgkjgkjgkgggggggggggggggj ... ''' >>>汤=bs4.BeautifulSoup(myString) >>>soup.find(“h1”).text u'kgkgjgkjgkjgkj' >>>汤。全部找到(“h1”) [kgkgjgjgkjgkjgkjgkjgkjggggggggggggkgjgjgkjgkjgkj,kgkgjgjgkdfgdgfgdfjgkjgkjgkjgkj,kgkgjgjgjsssssssssssssssgkjgkjgkjgkjgkj,kgjgjgjgjgjgkjgkjggkjggggggggggggggggggggggggggggggg
简单列表补偿解决方案:

print [s.split('</h1>')[0] for s in myString.split('<h1>')[1:]]
myString.split(“”)[1:]

我会选择Beautifulsoup——我的尝试

from bs4 import BeautifulSoup
import requests

url = 'http://accessibility.psu.edu/headingshtml/'

respons = requests.get(url).content

soup = BeautifulSoup(respons,'lxml')

h1tags = soup.find_all('h1')

for singleTag in h1tags:
    print singleTag.text
打印(在这种情况下,只有一个h1标签)


请与那些多个
h1
标记共享示例数据。如果您打算提取html内容,那么您应该使用,这要容易得多。在演示并被告知您应该使用html解析器之后,您究竟为什么尝试自己使用字符串搜索来实现它?!如果您想解析HTML,请使用HTML解析器-线索就在名称中。错误:`/usr/local/lib/python2.7/dist-packages/bs4/_-init__.py:166:UserWarning:没有显式指定解析器,因此我正在使用此系统可用的最佳HTML解析器(“lxml”)。这通常不是问题,但如果您在另一个系统上或在不同的虚拟环境中运行此代码,它可能会使用不同的解析器并表现出不同的行为。要消除此警告,请将此:BeautifulSoup([您的标记])更改为:BeautifulSoup([您的标记],“lxml”)markup_type=markup_type))`@MortezaLSC那么您是否阅读了该消息,并/或按照它的建议执行了?啊!它不会返回所有h1标记,只返回其中一个:(@SIslam两种方法都显示了,find将只返回第一次出现的元素,而find_all将返回搜索到的所有元素。谢谢-我根据OP的需要假设-不需要
find
,无论如何,谢谢。谢谢。我如何才能消除错误:/usr/local/lib/python2.7/dist packages/bs4/u init_uu.py:166:用户警告:没有明确的解析器这通常不是问题,但如果您在另一个系统上或在不同的虚拟环境中运行此代码,它可能会使用不同的语法分析器,并且表现不同。要消除此警告,请将此:beautifulsou([您的标记])更改为:beautifulsou(“您的标记”,“LXML”(MulkopyType=MARKUPYPE类型))@ MortezaLSC,这不是一个错误。您考虑阅读它吗?试试<代码>汤=漂亮的汤(MySnpe,“html。语法分析器”)< /> > @ AvinashRaj >代码>汤。选择< /COD>是一个聪明的方式替代<代码> FiffyALL < /代码>呵呵!
print [s.split('</h1>')[0] for s in myString.split('<h1>')[1:]]
from bs4 import BeautifulSoup
import requests

url = 'http://accessibility.psu.edu/headingshtml/'

respons = requests.get(url).content

soup = BeautifulSoup(respons,'lxml')

h1tags = soup.find_all('h1')

for singleTag in h1tags:
    print singleTag.text
Heading Tags (H1, H2, H3, P) in HTML