如何使用Python 2.7.10迭代列表并在引号之间提取文本

如何使用Python 2.7.10迭代列表并在引号之间提取文本,python,regex,list,text-parsing,Python,Regex,List,Text Parsing,我试图遍历一个长列表(我们称之为url\u list),其中每个项目看起来如下: , , , , 等等。我希望遍历该列表,只保留前两个引号之间的文本,并丢弃其余的-即: https://www.example.com/5th-february-2018/, https://www.example.com/4th-february-2018/, https://www.example.com/3rd-february-2018/, https://www.example.com/2nd-febru

我试图遍历一个长列表(我们称之为
url\u list
),其中每个项目看起来如下:

等等。我希望遍历该列表,只保留前两个引号之间的文本,并丢弃其余的-即:

https://www.example.com/5th-february-2018/,
https://www.example.com/4th-february-2018/,
https://www.example.com/3rd-february-2018/,
https://www.example.com/2nd-february-2018/,


所以本质上,我试图返回一个干净的URL列表。我没有太多的运气遍历列表并在引号上拆分-有更好的方法吗?有没有一种方法可以在
itemprop=
字符串之后丢弃所有内容?

您是否尝试过使用split函数在“处拆分,然后从结果列表中提取第二个条目

urls=[]
for url_entry in url_list:
    url = url_entry.split('\"')[1]
    urls.append(url)
使用正则表达式:

import re

url_list = ['<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>', '<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>']
for i in url_list:
    print re.search("(?P<url>https?://[^\s]+)/", i).group("url")
这听起来有点像是一个错误

如果您使用(或正在使用)解析HTML,则会变得容易得多:

from bs4 import BeautifulSoup

html_text = '''<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>
<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>
<a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>
<a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>'''

soup = BeautifulSoup(html_text)
urls = [x['href'] for x in soup.find_all("a")]
for url in urls:
    print(url)
# https://www.example.com/5th-february-2018/
# https://www.example.com/4th-february-2018/
# https://www.example.com/3rd-february-2018/
# https://www.example.com/2nd-february-2018/
从bs4导入美化组
html_text=“”
'''
soup=BeautifulSoup(html_文本)
URL=[x['href']表示汤中的x。查找所有(“a”)]
对于url中的url:
打印(url)
# https://www.example.com/5th-february-2018/
# https://www.example.com/4th-february-2018/
# https://www.example.com/3rd-february-2018/
# https://www.example.com/2nd-february-2018/
from bs4 import BeautifulSoup

html_text = '''<a href="https://www.example.com/5th-february-2018/" itemprop="url">5th February 2018</a>
<a href="https://www.example.com/4th-february-2018/" itemprop="url">4th February 2018</a>
<a href="https://www.example.com/3rd-february-2018/" itemprop="url">3rd February 2018</a>
<a href="https://www.example.com/2nd-february-2018/" itemprop="url">2nd February 2018</a>'''

soup = BeautifulSoup(html_text)
urls = [x['href'] for x in soup.find_all("a")]
for url in urls:
    print(url)
# https://www.example.com/5th-february-2018/
# https://www.example.com/4th-february-2018/
# https://www.example.com/3rd-february-2018/
# https://www.example.com/2nd-february-2018/