Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/336.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从网站中提取链接数_Python_Pandas_Beautifulsoup - Fatal编程技术网

Python 从网站中提取链接数

Python 从网站中提取链接数,python,pandas,beautifulsoup,Python,Pandas,Beautifulsoup,我需要提取一个网站中的链接数量,例如,(仅举个例子) 我曾尝试使用urlparse提取url信息,然后使用BeautifulSoup domain_name = urlparse(url).netloc soup = BeautifulSoup(requests.get(url).content, "html.parser") 我需要保存在一个列表中的所有链接在每个网站的网站。我想要这样的东西: URL

我需要提取一个网站中的链接数量,例如,(仅举个例子)

我曾尝试使用
urlparse
提取url信息,然后使用BeautifulSoup

domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
我需要保存在一个列表中的所有链接在每个网站的网站。我想要这样的东西:

URL                                            Links
    https://stackoverflow.com/questions/ask    ['link1','link2','link3',...]
    https://anotherwebsite.com/sport           ['link1','link2','link3','link4']
    https://last_example.es                    []
你能解释一下如何得到类似的结果吗?

让我们试试:

def get_all_links(url):
    # of course one needs to deal with the case when `requests` fails
    # but that's outside the scope here
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    return [a.attrs.get('href', '') for a in soup.find_all('a')]

# sample data
df = pd.DataFrame({'URL':['https://stackoverflow.com/questions/ask']})


df['Links'] = df['URL'].apply(get_all_links)
输出:

                                       URL                                              Links
0  https://stackoverflow.com/questions/ask  [#, https://stackoverflow.com, /company, #, /t...
让我们试试:

def get_all_links(url):
    # of course one needs to deal with the case when `requests` fails
    # but that's outside the scope here
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    return [a.attrs.get('href', '') for a in soup.find_all('a')]

# sample data
df = pd.DataFrame({'URL':['https://stackoverflow.com/questions/ask']})


df['Links'] = df['URL'].apply(get_all_links)
输出:

                                       URL                                              Links
0  https://stackoverflow.com/questions/ask  [#, https://stackoverflow.com, /company, #, /t...

您的问题太笼统了。您的问题太笼统了。值得思考的是:如果您想删除重复项,请将返回项更改为:list(set(a.attrs.get('href'),''),对于汤中的a.find_all('a')),或者如果您想限制返回的内容仅具有特定的url或部分,则可以执行以下操作:list(set(a.get('href'),对于汤中的a.find_all)('a',href=re.compile('stackoverflow.com'))供思考:如果你想删除重复项,将返回项更改为:list(set(a.attrs.get('href'),'')对于汤中的a.find_all('a')),或者如果你想限制返回的内容只有特定的url或部分,你可以做:list(set(a.get('href')对于汤中的a.find_all('a',href=re.compile('stackoverflow.com'))