Python 从网页中提取特定链接的计数。
我正在使用BeautifulSoup编写python脚本。我需要抓取一个网站并计算唯一链接,忽略以“#”开头的链接 如果网页上存在以下链接,则为示例: 对于本例,只有两个唯一的链接(删除主域名后的链接信息): 注意:这是我第一次使用python和web抓取工具 我提前感谢所有的帮助 这就是我迄今为止所尝试的:Python 从网页中提取特定链接的计数。,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在使用BeautifulSoup编写python脚本。我需要抓取一个网站并计算唯一链接,忽略以“#”开头的链接 如果网页上存在以下链接,则为示例: 对于本例,只有两个唯一的链接(删除主域名后的链接信息): 注意:这是我第一次使用python和web抓取工具 我提前感谢所有的帮助 这就是我迄今为止所尝试的: from bs4 import BeautifulSoup import requests url = 'https://en.wikipedia.org/wiki/Beauti
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
for link in soup.find_all('a'):
print(link.get('href'))
count += 1
有一个名为urllib.parse
的函数,您可以从中获得URL的netloc
。还有一个新的很棒的HTTP库,名为,它可以帮助您获取源文件中的所有链接
from requests_html import HTMLSession
from collections import Counter
from urllib.parse import urlparse
session = HTMLSession()
r = session.get("the link you want to crawl")
unique_netlocs = Counter(urlparse(link).netloc for link in r.html.absolute_links)
for link in unique_netlocs:
print(link, unique_netlocs[link])
有一个名为urllib.parse
的函数,您可以从中获得URL的netloc
。还有一个新的很棒的HTTP库,名为,它可以帮助您获取源文件中的所有链接
from requests_html import HTMLSession
from collections import Counter
from urllib.parse import urlparse
session = HTMLSession()
r = session.get("the link you want to crawl")
unique_netlocs = Counter(urlparse(link).netloc for link in r.html.absolute_links)
for link in unique_netlocs:
print(link, unique_netlocs[link])
您也可以这样做:
from bs4 import BeautifulSoup
from collections import Counter
import requests
soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)").text, "html.parser")
foundUrls = Counter([link["href"] for link in soup.find_all("a", href=lambda href: href and not href.startswith("#"))])
foundUrls = foundUrls.most_common()
for item in foundUrls:
print ("%s: %d" % (item[0], item[1]))
soup.find_all
行检查每个a
标记是否都有href
集,以及是否以#字符开头。
计数器方法按值统计每个列表项和最常见的顺序的出现次数
for
循环只打印结果。您也可以这样做:
from bs4 import BeautifulSoup
from collections import Counter
import requests
soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)").text, "html.parser")
foundUrls = Counter([link["href"] for link in soup.find_all("a", href=lambda href: href and not href.startswith("#"))])
foundUrls = foundUrls.most_common()
for item in foundUrls:
print ("%s: %d" % (item[0], item[1]))
soup.find_all
行检查每个a
标记是否都有href
集,以及是否以#字符开头。
计数器方法按值统计每个列表项和最常见的顺序的出现次数
for
循环只打印结果。我的方法是使用beautiful soup查找所有链接,然后确定哪个链接重定向到哪个位置:
def get_count_url(url): # get the umber of links having the same domain and suffix
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
urls={} #dictionary for the domains
# input_domain=url.split('//')[1].split('/')[0]
#library to extract the exact domain( ex.- blog.bbc.com and bbc.com have the same domains )
input_domain=tldextract.extract(url).domain+"."+tldextract.extract(url).suffix
for link in soup.find_all('a'):
word =link.get('href')
# print(word)
if word:
# Same website or domain calls
if "#" in word or word[0]=="/": #div call or same domain call
if not input_domain in urls:
# print(input_domain)
urls[input_domain]=1 #if first encounter with the domain
else:
urls[input_domain]+=1 #multiple encounters
elif "javascript" in word:
# javascript function calls (for domains that use modern JS frameworks to display information)
if not "JavascriptRenderingFunctionCall" in urls:
urls["JavascriptRenderingFunctionCall"]=1
else:
urls["JavascriptRenderingFunctionCall"]+=1
else:
# main_domain=word.split('//')[1].split('/')[0]
main_domain=tldextract.extract(word).domain+"." +tldextract.extract(word).suffix
# print(main_domain)
if main_domain.split('.')[0]=='www':
main_domain = main_domain.replace("www.","") # removing the www
if not main_domain in urls: # maintaining the dictionary
urls[main_domain]=1
else:
urls[main_domain]+=1
count += 1
for key, value in urls.items(): # printing the dictionary in a paragraph format for better readability
print(key,value)
return count
tld extract查找正确的url名称和soup。find_all('a')查找标记。if语句检查相同的域重定向、javascript重定向或其他域重定向。我的方法是使用Beauty soup查找所有链接,然后确定哪个链接重定向到哪个位置:
def get_count_url(url): # get the umber of links having the same domain and suffix
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
urls={} #dictionary for the domains
# input_domain=url.split('//')[1].split('/')[0]
#library to extract the exact domain( ex.- blog.bbc.com and bbc.com have the same domains )
input_domain=tldextract.extract(url).domain+"."+tldextract.extract(url).suffix
for link in soup.find_all('a'):
word =link.get('href')
# print(word)
if word:
# Same website or domain calls
if "#" in word or word[0]=="/": #div call or same domain call
if not input_domain in urls:
# print(input_domain)
urls[input_domain]=1 #if first encounter with the domain
else:
urls[input_domain]+=1 #multiple encounters
elif "javascript" in word:
# javascript function calls (for domains that use modern JS frameworks to display information)
if not "JavascriptRenderingFunctionCall" in urls:
urls["JavascriptRenderingFunctionCall"]=1
else:
urls["JavascriptRenderingFunctionCall"]+=1
else:
# main_domain=word.split('//')[1].split('/')[0]
main_domain=tldextract.extract(word).domain+"." +tldextract.extract(word).suffix
# print(main_domain)
if main_domain.split('.')[0]=='www':
main_domain = main_domain.replace("www.","") # removing the www
if not main_domain in urls: # maintaining the dictionary
urls[main_domain]=1
else:
urls[main_domain]+=1
count += 1
for key, value in urls.items(): # printing the dictionary in a paragraph format for better readability
print(key,value)
return count
tld extract查找正确的url名称和soup。find_all('a')查找标记。if语句检查相同的域重定向、javascript重定向或其他域重定向。这起作用,但我只需要显示域和该域的计数。例如,如果网页上的链接是我们的www.foo.com/bar、www.foo.com/bash、www.foo.com/bar。输出将是www.foo.com 3这是有效的,但我只需要显示域和该域的计数。例如,如果网页上的链接是我们的www.foo.com/bar、www.foo.com/bash、www.foo.com/bar。输出将是www.foo.com 3这是有效的,但我只需要显示域和该域的计数。例如,如果网页上的链接是我们的www.foo.com/bar、www.foo.com/bash、www.foo.com/bar。输出将是www.foo.com 3这是有效的,但我只需要显示域和该域的计数。例如,如果网页上的链接是我们的www.foo.com/bar、www.foo.com/bash、www.foo.com/bar。输出将是www.foo.com 3