Python 从网站中提取链接数
我需要提取一个网站中的链接数量,例如,(仅举个例子) 我曾尝试使用Python 从网站中提取链接数,python,pandas,beautifulsoup,Python,Pandas,Beautifulsoup,我需要提取一个网站中的链接数量,例如,(仅举个例子) 我曾尝试使用urlparse提取url信息,然后使用BeautifulSoup domain_name = urlparse(url).netloc soup = BeautifulSoup(requests.get(url).content, "html.parser") 我需要保存在一个列表中的所有链接在每个网站的网站。我想要这样的东西: URL
urlparse
提取url信息,然后使用BeautifulSoup
domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
我需要保存在一个列表中的所有链接在每个网站的网站。我想要这样的东西:
URL Links
https://stackoverflow.com/questions/ask ['link1','link2','link3',...]
https://anotherwebsite.com/sport ['link1','link2','link3','link4']
https://last_example.es []
你能解释一下如何得到类似的结果吗?让我们试试:
def get_all_links(url):
# of course one needs to deal with the case when `requests` fails
# but that's outside the scope here
soup = BeautifulSoup(requests.get(url).content, "html.parser")
return [a.attrs.get('href', '') for a in soup.find_all('a')]
# sample data
df = pd.DataFrame({'URL':['https://stackoverflow.com/questions/ask']})
df['Links'] = df['URL'].apply(get_all_links)
输出:
URL Links
0 https://stackoverflow.com/questions/ask [#, https://stackoverflow.com, /company, #, /t...
让我们试试:
def get_all_links(url):
# of course one needs to deal with the case when `requests` fails
# but that's outside the scope here
soup = BeautifulSoup(requests.get(url).content, "html.parser")
return [a.attrs.get('href', '') for a in soup.find_all('a')]
# sample data
df = pd.DataFrame({'URL':['https://stackoverflow.com/questions/ask']})
df['Links'] = df['URL'].apply(get_all_links)
输出:
URL Links
0 https://stackoverflow.com/questions/ask [#, https://stackoverflow.com, /company, #, /t...
您的问题太笼统了。您的问题太笼统了。值得思考的是:如果您想删除重复项,请将返回项更改为:list(set(a.attrs.get('href'),''),对于汤中的a.find_all('a')),或者如果您想限制返回的内容仅具有特定的url或部分,则可以执行以下操作:list(set(a.get('href'),对于汤中的a.find_all)('a',href=re.compile('stackoverflow.com'))供思考:如果你想删除重复项,将返回项更改为:list(set(a.attrs.get('href'),'')对于汤中的a.find_all('a')),或者如果你想限制返回的内容只有特定的url或部分,你可以做:list(set(a.get('href')对于汤中的a.find_all('a',href=re.compile('stackoverflow.com'))