使用beautiful soup python从semrush获取网站流量_Python_Web Scraping_Beautifulsoup

使用beautiful soup python从semrush获取网站流量

python web-scraping

使用beautiful soup python从semrush获取网站流量,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图从semrush.com上获取网站流量我当前使用BeautifulSoup的代码是： from bs4 import BeautifulSoup, BeautifulStoneSoup import urllib import json req = urllib.request.Request('https://www.semrush.com/info/burton.com', headers={'User-Agent':'Magic Browser'}) response = ur

我正试图从semrush.com上获取网站流量

我当前使用BeautifulSoup的代码是：

from bs4 import BeautifulSoup, BeautifulStoneSoup
import urllib
import json

req = urllib.request.Request('https://www.semrush.com/info/burton.com', headers={'User-Agent':'Magic Browser'})
response = urllib.request.urlopen(req)
raw_data = response.read()
response.close()

soup = BeautifulSoup(raw_data)

我一直在尝试

data=soup.findAll（“a”，{“href”：“/info/burton.com+（by+organic）”}

或

data=soup.findAll（“span”，{“class”：“sem report counter”}）

，运气不好

我可以在网页上看到我想要的号码。有没有办法把这些信息弄出来？我没有在我拉的html中看到它。

我做了额外的努力，并建立了一个工作示例，说明如何使用

selenium

刮取该页面。安装

selenium

并试用

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = 'https://www.semrush.com/info/burton.com' #your url
options = Options() #set up options
options.add_argument('--headless') #add --headless mode to options
driver = webdriver.Chrome(executable_path='/opt/ChromeDriver/chromedriver',
                      chrome_options=options)

#note: executable_path will depend on where your chromedriver.exe is located

driver.get(url) #get response
driver.implicitly_wait(1) #wait to load content
elements = driver.find_elements_by_xpath(xpath='//a[@href="/info/burton.com+(by+organic)"]') #grab that stuff you wanted?  

for e in elements: print(e.get_attribute('text').strip()) #print text fields

driver.quit() #close the driver when you're done

我在终端中看到的输出：

356K
6.5K
59.3K
$usd305K
Organic keywords
Organic
Top Organic Keywords
View full report
Organic Position Distribution

我可能错了，但我很确定，如果没有一个能够处理通过

JavaScript

code加载的动态内容的工具，你就无法抓取这个页面。您需要在

headless

模式下使用类似于

selenium

的东西。很高兴知道，谢谢@DascienzWow，这是超越一切的。谢谢你1000次。最后一个问题。如果我想在这个网站上运行很多次，它会引起任何问题吗？唯一的问题是你应该礼貌地对待你的网页抓取。通过传递一个虚拟用户代理来识别您的scraper，并通过在响应请求之间实现时间延迟来限制您的下载速度。我不能保证你不会被禁止，如果你有一个广泛的工作运行虽然。如果您还没有，我会研究scrapy库以实现更健壮的爬虫程序。我假设我将

头={'User-Agent'：dummy}

添加到

驱动程序中。获取。我应该为它的虚拟部分放什么？我对网页抓取还不熟悉。此外，什么是礼貌的速度限制。我应该等一分钟吗？每次它打电话给网站的时间间隔是5到10分钟？事实上，我说错了。您不需要传递头，因为selenium使用您的浏览器（在本例中为chrome）来传递请求。不过，一定要注意你的请求率不是侵入性的。刮得开心！非常感谢@Dascienz！我很感激。