Python 使用BeautifulSoup从url提取url列表_Python_Web Scraping_Beautifulsoup

Python 使用BeautifulSoup从url提取url列表

python web-scraping

Python 使用BeautifulSoup从url提取url列表,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想从这个链接中提取有关网站相似性的信息：我正在查看class='site'，试图从中提取信息 <a href="/siteinfo/ebay.com" class="truncation">ebay.com</a> 我试过了 from bs4 import BeautifulSoup soup = BeautifulSoup(data, "html.parser") print([item.get_tex

我想从这个链接中提取有关网站相似性的信息：

我正在查看class='site'，试图从中提取信息

<a href="/siteinfo/ebay.com" class="truncation">ebay.com</a>

我试过了

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, "html.parser")
print([item.get_text(strip=True) for item in soup.select("span.site")])

但是由于代码中的一些错误参数，这似乎足以获取信息。

您的CSS选择器是一个很好的开始，但太窄了。您应该使用的CSS选择器是：

网站：card\u mini\u观众。网站>a 分数：卡片\迷你\观众。重叠>。截断这些选择器将焦点缩小到存储表的div，然后利用类标签提取所需信息

我在下面附上了一些解决您问题的示例代码。我只是将结果打印到屏幕上，但是可以很容易地更改，以便对值执行任何操作

from bs4 import BeautifulSoup
import requests

#Getting the website and processing it
url = "https://www.alexa.com/siteinfo/amazon.com"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

#Using CSS Selectors to grab content
websites = soup.select("#card_mini_audience .site>a")   #Selects the websites in the table
scores = soup.select("#card_mini_audience .overlap>.truncation")    #Selects the corresponding scores

#Goes through the list and extracts just the text
websites = [website.text.strip() for website in websites]
scores = [float(score.text.strip()) for score in scores]    #Converts the scores to floats

#Ordinary print to screen. You can change this to add to a dataframe or whatever else you want for your project
for pair in zip(websites, scores):
    print(pair)

输出如下所示：

('ebay.com', 70.1)
('pinterest.com', 54.7)
('wikipedia.org', 51.3)
('facebook.com', 50.4)
('reddit.com', 49.6)

你的CSS选择器是一个很好的开始，但是太窄了。您应该使用的CSS选择器是：

网站：card\u mini\u观众。网站>a 分数：卡片\迷你\观众。重叠>。截断这些选择器将焦点缩小到存储表的div，然后利用类标签提取所需信息

我在下面附上了一些解决您问题的示例代码。我只是将结果打印到屏幕上，但是可以很容易地更改，以便对值执行任何操作

from bs4 import BeautifulSoup
import requests

#Getting the website and processing it
url = "https://www.alexa.com/siteinfo/amazon.com"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

#Using CSS Selectors to grab content
websites = soup.select("#card_mini_audience .site>a")   #Selects the websites in the table
scores = soup.select("#card_mini_audience .overlap>.truncation")    #Selects the corresponding scores

#Goes through the list and extracts just the text
websites = [website.text.strip() for website in websites]
scores = [float(score.text.strip()) for score in scores]    #Converts the scores to floats

#Ordinary print to screen. You can change this to add to a dataframe or whatever else you want for your project
for pair in zip(websites, scores):
    print(pair)

输出如下所示：

('ebay.com', 70.1)
('pinterest.com', 54.7)
('wikipedia.org', 51.3)
('facebook.com', 50.4)
('reddit.com', 49.6)

似乎您需要span.truncation、a.truncation或div.site谢谢您的评论，OneCricketeer。我可以从谷歌Chrome上的inspect工具中看到重叠分数和站点的跨度。我看不到您提到的标签该页面使用JavaScript添加元素-但BeautifulSoup和请求不能运行JavaScript-您可能需要控制真正的web浏览器，它可以运行JavaScript，但@furas不是这样。虽然它确实在某些功能中使用JS，但OP引用的表也会正常加载，并且可以在不需要无头browsera的情况下检测到。截断是您在问题中显示的元素。分数看起来像38.0，所以span.truncation。对于站点类，这些仅在div元素上您需要span.truncation、a.truncation或div.site的耳朵谢谢您的评论，OneCricketeer。我可以从谷歌Chrome上的inspect工具中看到重叠分数和站点的跨度。我看不到您提到的标签该页面使用JavaScript添加元素-但BeautifulSoup和请求不能运行JavaScript-您可能需要控制真正的web浏览器，它可以运行JavaScript，但@furas不是这样。虽然它确实在某些功能中使用JS，但OP引用的表也会正常加载，并且可以在不需要无头browsera的情况下检测到。截断是您在问题中显示的元素。分数看起来像38.0，所以span.truncation。对于站点类，这些仅在div元素上