Python 使用BeautifulSoup抓取Google搜索_Python_Search_Beautifulsoup_Scrape

Python 使用BeautifulSoup抓取Google搜索

python search

Python 使用BeautifulSoup抓取Google搜索,python,search,beautifulsoup,scrape,Python,Search,Beautifulsoup,Scrape,我想抓取谷歌搜索的多个页面。到目前为止，我只能勉强抓取第一页，但如何才能抓取多页 from bs4 import BeautifulSoup import requests import urllib.request import re from collections import Counter def search(query): url = "http://www.google.com/search?q="+query text = [] final_tex

我想抓取谷歌搜索的多个页面。到目前为止，我只能勉强抓取第一页，但如何才能抓取多页

from bs4 import BeautifulSoup
import requests
import urllib.request
import re
from collections import Counter

def search(query):
    url = "http://www.google.com/search?q="+query

    text = []
    final_text = []

    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text,"html.parser")

    for desc in soup.find_all("span",{"class":"st"}):
        text.append(desc.text)

    for title in soup.find_all("h3",attrs={"class":"r"}):
        text.append(title.text)

    for string in text:
        string  = re.sub("[^A-Za-z ]","",string)
        final_text.append(string)

    count_text = ' '.join(final_text)
    res = Counter(count_text.split())

    keyword_Count = dict(sorted(res.items(), key=lambda x: (-x[1], x[0])))

    for x,y in keyword_Count.items():
        print(x ," : ",y)


search("girl")

像上面的注释一样，您需要下一页URL并将代码放入循环中

def search(query):
    url = "https://www.google.com/search?hl=en&q=" + query
    while url:
        text = []
        ....
        ....
        for x,y in keyword_Count.items():
            print(x ," : ",y)

        # get next page url
        url = soup.find('a', id='pnnext')
        if url:
            url = 'https://www.google.com/' + url['href']
        else:
            print('no next page, loop ended')
            break

要使

soup.find（'a'，id='pnnext'）

工作，您可能需要为请求设置用户代理

下面的代码通过“下一步”按钮链接进行实际分页

url = "http://www.google.com/search?q=" + query + "&start=" + str((page - 1) * 10)

从bs4导入美化组
导入请求，urllib.parse
导入lxml
def打印从url（url）提取的数据：
标题={
“用户代理”：
“Mozilla/5.0（Windows NT 10.0；Win64；x64）AppleWebKit/537.36（KHTML，类似Gecko）Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582”
}
response=requests.get（url，headers=headers）.text
汤=BeautifulSoup（响应“lxml”）
打印（当前页面：{int（soup.select_one（“.YyVfkd”）.text）}）
打印（f'当前URL:{URL}'）
打印（）
对于soup.findAll（'div'，class='tF2Cxc'）中的容器：
head_text=container.find（'h3'，class='LC20lb DKV0Md'）。text
head\u sum=container.find（'div'，class='IsZvec'）.text
head_link=container.a['href']
打印（标题和文本）
打印（总目和）
打印（头链接）
打印（）
返回汤。选择一个（'a#pnnext'）
def scrape（）：
下一页节点=打印从url中提取的数据(
'https://www.google.com/search?hl=en-美国&q=可口可乐’）
“下一页”节点不是“无”：
下一页\u url=urllib.parse.urljoin（'https://www.google.com，下一页节点['href']）
下一页节点=打印从下一页url（下一页url）提取的数据
刮

部分输出：

通过beautifulsoup获得的结果
当前页：1
当前URL:https://www.google.com/search?hl=en-美国&q=可口可乐
可口可乐公司：刷新世界。有影响
我们来到这里是为了让世界焕然一新，改变世界。了解更多有关可口可乐公司、我们的品牌以及我们如何努力以正确的方式开展业务的信息。‎职业·‎联系我们·‎可口可乐的工作‎我们公司
https://www.coca-colacompany.com/home
可口可乐
2021可口可乐公司，版权所有。COCA-COLA®“品尝感觉”和轮廓瓶是可口可乐公司的商标。
https://www.coca-cola.com/

或者，您也可以使用SerpApi来执行此操作。这是一个付费API，免费试用5000次搜索

要集成的代码：

导入操作系统
从serpapi导入谷歌搜索
def scrape（）：
参数={
“引擎”：“谷歌”，
“q”：“可口可乐”，
“api_键”：os.getenv（“api_键”），
}
搜索=谷歌搜索（参数）
结果=search.get_dict（）
打印（f“当前页面：{results['serpapi_pagination']['Current']}”）
对于结果[“有机结果”]：
打印（f“标题：{result['Title']}\n链接：{result['link']}\n”）
而结果中的“下一步”[“serpapi_分页”]：
search.params_dict[“start”]=结果['serpapi_pagination']['current']*10
结果=search.get_dict（）
打印（f“当前页面：{results['serpapi_pagination']['Current']}”）
对于结果[“有机结果”]：
打印（f“标题：{result['Title']}\n链接：{result['link']}\n”）
刮

部分输出：

来自SerpApi的

结果
当前页：1
当前URL:https://www.google.com/search?hl=en-美国&q=可口可乐
可口可乐公司：刷新世界。有影响
我们来到这里是为了让世界焕然一新，改变世界。了解更多有关可口可乐公司、我们的品牌以及我们如何努力以正确的方式开展业务的信息。‎职业·‎联系我们·‎可口可乐的工作‎我们公司
https://www.coca-colacompany.com/home
可口可乐
2021可口可乐公司，版权所有。COCA-COLA®“品尝感觉”和轮廓瓶是可口可乐公司的商标。
https://www.coca-cola.com/

免责声明，我为SerpApi工作

刮取指向下一页的链接，

request.get（href\u for\u next\u page）

冲洗并重复。我建议您阅读《使用Python进行Web刮取》一书。我相信你可以在网上的某个地方找到pdf，但我也会买它。第68页有关于这个主题的好信息。然而，你应该把你的代码放在一个循环中，限制它运行的次数，否则你将运行一个无休止的代码，并使服务器的资源负担过重。@Kamikaze_goldfish如果是Google，你必须设置限制，而不是因为它会使服务器崩溃，但如果你无休止地请求Google simple会屏蔽你的IP数小时。是的，我应该澄清这一点。大多数网站都会将你列入黑名单。