如何使用Python 2.7.10迭代列表并在引号之间提取文本_Python_Regex_List_Text Parsing

如何使用Python 2.7.10迭代列表并在引号之间提取文本

python regex list

如何使用Python 2.7.10迭代列表并在引号之间提取文本,python,regex,list,text-parsing,Python,Regex,List,Text Parsing,我试图遍历一个长列表（我们称之为url\u list），其中每个项目看起来如下：，，，，等等。我希望遍历该列表，只保留前两个引号之间的文本，并丢弃其余的-即： https://www.example.com/5th-february-2018/， https://www.example.com/4th-february-2018/， https://www.example.com/3rd-february-2018/， https://www.example.com/2nd-febru

我试图遍历一个长列表（我们称之为

url\u list

），其中每个项目看起来如下：

，

，

，

，

等等。我希望遍历该列表，只保留前两个引号之间的文本，并丢弃其余的-即：

https://www.example.com/5th-february-2018/，

https://www.example.com/4th-february-2018/，

https://www.example.com/3rd-february-2018/，

https://www.example.com/2nd-february-2018/，

所以本质上，我试图返回一个干净的URL列表。我没有太多的运气遍历列表并在引号上拆分-有更好的方法吗？有没有一种方法可以在

itemprop=

字符串之后丢弃所有内容？

您是否尝试过使用split函数在“处拆分，然后从结果列表中提取第二个条目

urls=[]
for url_entry in url_list:
    url = url_entry.split('\"')[1]
    urls.append(url)

使用正则表达式：

import re

url_list = ['<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>', '<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>']
for i in url_list:
    print re.search("(?P<url>https?://[^\s]+)/", i).group("url")

这听起来有点像是一个错误

如果您使用（或正在使用）解析HTML，则会变得容易得多：

from bs4 import BeautifulSoup

html_text = '''<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>
<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>
<a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>
<a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>'''

soup = BeautifulSoup(html_text)
urls = [x['href'] for x in soup.find_all("a")]
for url in urls:
    print(url)
# https://www.example.com/5th-february-2018/
# https://www.example.com/4th-february-2018/
# https://www.example.com/3rd-february-2018/
# https://www.example.com/2nd-february-2018/

从bs4导入美化组
html_text=“”
'''
soup=BeautifulSoup（html_文本）
URL=[x['href']表示汤中的x。查找所有（“a”）]
对于url中的url：
打印（url）
# https://www.example.com/5th-february-2018/
# https://www.example.com/4th-february-2018/
# https://www.example.com/3rd-february-2018/
# https://www.example.com/2nd-february-2018/

from bs4 import BeautifulSoup

html_text = '''<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>
<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>
<a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>
<a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>'''

soup = BeautifulSoup(html_text)
urls = [x['href'] for x in soup.find_all("a")]
for url in urls:
    print(url)
# https://www.example.com/5th-february-2018/
# https://www.example.com/4th-february-2018/
# https://www.example.com/3rd-february-2018/
# https://www.example.com/2nd-february-2018/