Python 在《美丽的汤》中获得描述时,去掉奇怪的缩进
我有一个bs4程序,在那里我收集链接的描述。它首先检查是否有任何元描述标记,如果没有,则从标记获取描述 代码如下:Python 在《美丽的汤》中获得描述时,去掉奇怪的缩进,python,beautifulsoup,Python,Beautifulsoup,我有一个bs4程序,在那里我收集链接的描述。它首先检查是否有任何元描述标记,如果没有,则从标记获取描述 代码如下: from bs4 import BeautifulSoup import requests def find_title(url): page = requests.get(url) soup = BeautifulSoup(page.content, 'html.parser') with open('descrip.txt', 'a', encodi
from bs4 import BeautifulSoup
import requests
def find_title(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
with open('descrip.txt', 'a', encoding='utf-8') as f:
description = soup.find('meta', attrs={'name':'og:description'}) or soup.find('meta', attrs={'property':'description'}) or soup.find('meta', attrs={'name':'description'})
if description:
desc = description["content"]
else:
desc = soup.find_all('p')[0].getText()
lengths = len(desc)
index = 0
while lengths == 1:
index = index + 1
desc = soup.find_all('p')[index].getText()
lengths = len(desc)
if lengths > 300:
desc = soup.find_all('p')[index].getText()[0:300]
elif lengths < 300:
desc = soup.find_all('p')[index].getText()[0:lengths]
print(desc)
f.write(desc + '\n')
find_title('https://en.wikipedia.org/wiki/Portal:The_arts')
find_title('https://en.wikipedia.org/wiki/Portal:Biography')
find_title('https://en.wikipedia.org/wiki/Portal:Geography')
find_title('https://en.wikipedia.org/wiki/November_15')
find_title('https://en.wikipedia.org/wiki/November_16')
find_title('https://en.wikipedia.org/wiki/Wikipedia:Selected_anniversaries/November')
find_title('https://lists.wikimedia.org/mailman/listinfo/daily-article-l')
find_title('https://en.wikipedia.org/wiki/List_of_days_of_the_year')
find_title('https://en.wikipedia.org/wiki/File:Proclama%C3%A7%C3%A3o_da_Rep%C3%BAblica_by_Benedito_Calixto_1893.jpg')
find_title('https://en.wikipedia.org/wiki/First_Brazilian_Republic')
find_title('https://en.wikipedia.org/wiki/Empire_of_Brazil')
find_title('https://en.wikipedia.org/wiki/Pedro_II_of_Brazil')
find_title('https://en.wikipedia.org/wiki/Benedito_Calixto')
find_title('https://en.wikipedia.org/wiki/Rio_de_Janeiro')
find_title('https://en.wikipedia.org/wiki/Deodoro_da_Fonseca')
有没有办法解决这个问题?将
strip=True
添加到getText()
(注意:它是get\u text()
的别名),然后添加一个空格作为分隔符。例如:
get_text(strip=True, separator=' ')
get_text(strip=True, separator=' ')