Python 3.x 从Python中的href标记中删除不需要的html_Python 3.x_Web Scraping_Beautifulsoup_Filtering

Python 3.x 从Python中的href标记中删除不需要的html

python-3.x web-scraping

Python 3.x 从Python中的href标记中删除不需要的html,python-3.x,web-scraping,beautifulsoup,filtering,Python 3.x,Web Scraping,Beautifulsoup,Filtering,我想能够刮出一个链接列表。由于html的结构方式，我不能直接用BeautifulSoup来解释这一点 start_list = soup.find_all(href=re.compile('id=')) print(start_list) [<a href="/movies/?id=actofvalor.htm"><b>Act of Valor</b></a>, <a href="/movies/?id=actionjackson.ht

我想能够刮出一个链接列表。由于html的结构方式，我不能直接用BeautifulSoup来解释这一点

start_list = soup.find_all(href=re.compile('id='))

print(start_list)

[<a href="/movies/?id=actofvalor.htm"><b>Act of Valor</b></a>,
 <a href="/movies/?id=actionjackson.htm"><b>Action Jackson</b></a>]

这样做的目的是能够在“从开始到删除”的过程中循环，并从“开始”列表中删除所有出现的内容

start_list = soup.find_all(href=re.compile('id='))

href_list = [i['href'] for i in start_list]

href

是标记的属性，如果使用

find_all

get bunch of tags，只需在其上迭代并使用

tag['href']

访问该属性

要理解为什么使用

[]

，您应该知道标记的属性存储在字典中。 :

标记可以具有任意数量的属性。标签

有一个属性“class”，其值为“boldest”。您可以访问通过将标记视为字典来处理标记的属性：

tag['class']
# u'boldest'

您可以通过.attrs直接访问该词典：

tag.attrs
# {u'class': u'boldest'}

列表理解很简单，您可以引用它，在这种情况下，可以在for循环中完成：

href_list = []
for i in start_list:
    href_list.append(i['href'])

发布你想要的输出。这正是我需要的。你能给我解释一下列表理解吗？具体来说：第一部分['href']为什么在括号中？@Chace Mcguyer请接受这个答案来结束这个问题。

href_list = []
for i in start_list:
    href_list.append(i['href'])