Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/javascript/421.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/343.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Javascript 使用Beauty Soup从谷歌搜索中提取数据/链接_Javascript_Python_Html_Beautifulsoup_Bs4 - Fatal编程技术网

Javascript 使用Beauty Soup从谷歌搜索中提取数据/链接

Javascript 使用Beauty Soup从谷歌搜索中提取数据/链接,javascript,python,html,beautifulsoup,bs4,Javascript,Python,Html,Beautifulsoup,Bs4,晚上的朋友们 我试图问谷歌一个问题,并从其受人尊敬的搜索查询中提取所有相关链接(例如,我搜索“site:Wikipedia.com托马斯·杰斐逊”,它会给我wiki.com/jeff、wiki.com/tom等) 这是我的代码: from bs4 import BeautifulSoup from urllib2 import urlopen query = 'Thomas Jefferson' query.replace (" ", "+") #replaces whitespace wi

晚上的朋友们

我试图问谷歌一个问题,并从其受人尊敬的搜索查询中提取所有相关链接(例如,我搜索“site:Wikipedia.com托马斯·杰斐逊”,它会给我wiki.com/jeff、wiki.com/tom等)

这是我的代码:

from bs4 import BeautifulSoup
from urllib2 import urlopen

query = 'Thomas Jefferson'

query.replace (" ", "+")
#replaces whitespace with a plus sign for Google compatibility purposes

soup = BeautifulSoup(urlopen("https://www.google.com/?gws_rd=ssl#q=site:wikipedia.com+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

for item in soup.find_all('h3', attrs={'class' : 'r'}):
    print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
这里的目标是设置查询变量,让python查询Google,如果您愿意的话,BeautifulSoup将获取所有“绿色”链接

我只想把绿色链接拉到最大程度。奇怪的是,谷歌的源代码是“隐藏的”(这是他们搜索架构的一个症状),所以BeautifulSoup不能仅仅从h3标签中提取a href。我可以在检查元素时看到h3 HREF,但在查看源代码时看不到

我的问题是:如果我无法访问谷歌的源代码,那么如何通过BeautifulSoup从谷歌获取前5个最相关的绿色链接,只检查元素?

PS:为了让大家了解我想要实现的目标,我发现了两个与我类似的比较接近的堆栈溢出问题:


这不适用于散列搜索(
#q=site:wikipedia.com
,就像您拥有的那样),因为它通过AJAX加载数据,而不是为您提供完整的可解析HTML和结果,您应该改为使用:

soup = BeautifulSoup(urlopen("https://www.google.com/search?gbv=1&q=site:wikipedia.com+" + query), "html.parser")

作为参考,我禁用了javascript并执行了google搜索以获得此url结构。

这不适用于哈希搜索(
#q=site:wikipedia.com
,就像您所拥有的那样),因为它通过AJAX加载数据,而不是为您提供完整的可解析HTML和结果,您应该使用以下方法:

soup = BeautifulSoup(urlopen("https://www.google.com/search?gbv=1&q=site:wikipedia.com+" + query), "html.parser")

作为参考,我禁用了javascript并执行了google搜索以获取此url结构。

我得到的url与Rob M不同。当我尝试禁用javascript进行搜索时-

https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw
要使此操作适用于任何查询,您应该首先确保您的查询中没有空格(这就是为什么您会得到一个400:Bad请求)。您可以使用以下方法执行此操作:

它将把所有空格编码为加号-创建一个有效的URL

但是,这不适用于urllib-您将获得403:禁止。我使用如下模块使其工作:

import requests
import urllib
from bs4 import BeautifulSoup

query = 'Thomas Jefferson'
query = urllib.quote_plus(query)

r = requests.get('https://www.google.com/search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
打印链接提供:

print links
#  [u'http://en.wikipedia.com/wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
#   u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
#   u'http://en.wikipedia.com/wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
#   u'http://en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
#   u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
#   u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
#   u'http://en.wikipedia.com/wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']

我得到了一个与Rob M不同的URL。当我尝试禁用JavaScript进行搜索时-

https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw
要使此操作适用于任何查询,您应该首先确保您的查询中没有空格(这就是为什么您会得到一个400:Bad请求)。您可以使用以下方法执行此操作:

它将把所有空格编码为加号-创建一个有效的URL

但是,这不适用于urllib-您将获得403:禁止。我使用如下模块使其工作:

import requests
import urllib
from bs4 import BeautifulSoup

query = 'Thomas Jefferson'
query = urllib.quote_plus(query)

r = requests.get('https://www.google.com/search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
打印链接提供:

print links
#  [u'http://en.wikipedia.com/wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
#   u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
#   u'http://en.wikipedia.com/wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
#   u'http://en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
#   u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
#   u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
#   u'http://en.wikipedia.com/wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']

如果得到的结果为空,则需要指定
user agent
。这可能是原因之一。 我稍微简化了代码,删除了代码中的
query
变量

编码和测试:

输出:

https://en.wikipedia.com/wiki/Edison,_New_Jersey
https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
https://www.wikipedia.com/wiki/Thomas_E._Murray
https://en.wikipedia.com/wiki/Incandescent_light_bulb
https://en.wikipedia.com/wiki/Phonograph_cylinder
https://en.wikipedia.com/wiki/Emile_Berliner
https://wikipedia.com/wiki/Consolidated_Edison
https://www.wikipedia.com/wiki/hello
https://www.wikipedia.com/wiki/Tom%20Alston
https://en.wikipedia.com/wiki/Edison_screw
Link: https://en.wikipedia.com/wiki/Edison,_New_Jersey
Link: https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia.com/wiki/Thomas_E._Murray
Link: https://en.wikipedia.com/wiki/Incandescent_light_bulb
Link: https://en.wikipedia.com/wiki/Phonograph_cylinder
Link: https://en.wikipedia.com/wiki/Emile_Berliner
Link: https://wikipedia.com/wiki/Consolidated_Edison
Link: https://www.wikipedia.com/wiki/hello
Link: https://www.wikipedia.com/wiki/Tom%20Alston
Link: https://en.wikipedia.com/wiki/Edison_screw

或者,您可以从SerpApi使用:

部分JSON输出:

{
 "position": 1,
 "title": "Thomas Edison - Wikipedia",
 "link": "https://en.wikipedia.org/wiki/Thomas_Edison",
 "displayed_link": "en.wikipedia.org › wiki › Thomas_Edison",
 "snippet": "Thomas Alva Edison (February 11, 1847 – October 18, 1931) was an American inventor and businessman who has been described as America's greatest ..."
}
要集成的代码:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "site:wikipedia.com thomas edison",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
     print(f"Link: {result['link']}")
输出:

https://en.wikipedia.com/wiki/Edison,_New_Jersey
https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
https://www.wikipedia.com/wiki/Thomas_E._Murray
https://en.wikipedia.com/wiki/Incandescent_light_bulb
https://en.wikipedia.com/wiki/Phonograph_cylinder
https://en.wikipedia.com/wiki/Emile_Berliner
https://wikipedia.com/wiki/Consolidated_Edison
https://www.wikipedia.com/wiki/hello
https://www.wikipedia.com/wiki/Tom%20Alston
https://en.wikipedia.com/wiki/Edison_screw
Link: https://en.wikipedia.com/wiki/Edison,_New_Jersey
Link: https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia.com/wiki/Thomas_E._Murray
Link: https://en.wikipedia.com/wiki/Incandescent_light_bulb
Link: https://en.wikipedia.com/wiki/Phonograph_cylinder
Link: https://en.wikipedia.com/wiki/Emile_Berliner
Link: https://wikipedia.com/wiki/Consolidated_Edison
Link: https://www.wikipedia.com/wiki/hello
Link: https://www.wikipedia.com/wiki/Tom%20Alston
Link: https://en.wikipedia.com/wiki/Edison_screw
免责声明,我为SerpApi工作


如果得到的结果为空,则需要指定
user agent
。这可能是原因之一。 我稍微简化了代码,删除了代码中的
query
变量

编码和测试:

输出:

https://en.wikipedia.com/wiki/Edison,_New_Jersey
https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
https://www.wikipedia.com/wiki/Thomas_E._Murray
https://en.wikipedia.com/wiki/Incandescent_light_bulb
https://en.wikipedia.com/wiki/Phonograph_cylinder
https://en.wikipedia.com/wiki/Emile_Berliner
https://wikipedia.com/wiki/Consolidated_Edison
https://www.wikipedia.com/wiki/hello
https://www.wikipedia.com/wiki/Tom%20Alston
https://en.wikipedia.com/wiki/Edison_screw
Link: https://en.wikipedia.com/wiki/Edison,_New_Jersey
Link: https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia.com/wiki/Thomas_E._Murray
Link: https://en.wikipedia.com/wiki/Incandescent_light_bulb
Link: https://en.wikipedia.com/wiki/Phonograph_cylinder
Link: https://en.wikipedia.com/wiki/Emile_Berliner
Link: https://wikipedia.com/wiki/Consolidated_Edison
Link: https://www.wikipedia.com/wiki/hello
Link: https://www.wikipedia.com/wiki/Tom%20Alston
Link: https://en.wikipedia.com/wiki/Edison_screw

或者,您可以从SerpApi使用:

部分JSON输出:

{
 "position": 1,
 "title": "Thomas Edison - Wikipedia",
 "link": "https://en.wikipedia.org/wiki/Thomas_Edison",
 "displayed_link": "en.wikipedia.org › wiki › Thomas_Edison",
 "snippet": "Thomas Alva Edison (February 11, 1847 – October 18, 1931) was an American inventor and businessman who has been described as America's greatest ..."
}
要集成的代码:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "site:wikipedia.com thomas edison",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
     print(f"Link: {result['link']}")
输出:

https://en.wikipedia.com/wiki/Edison,_New_Jersey
https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
https://www.wikipedia.com/wiki/Thomas_E._Murray
https://en.wikipedia.com/wiki/Incandescent_light_bulb
https://en.wikipedia.com/wiki/Phonograph_cylinder
https://en.wikipedia.com/wiki/Emile_Berliner
https://wikipedia.com/wiki/Consolidated_Edison
https://www.wikipedia.com/wiki/hello
https://www.wikipedia.com/wiki/Tom%20Alston
https://en.wikipedia.com/wiki/Edison_screw
Link: https://en.wikipedia.com/wiki/Edison,_New_Jersey
Link: https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia.com/wiki/Thomas_E._Murray
Link: https://en.wikipedia.com/wiki/Incandescent_light_bulb
Link: https://en.wikipedia.com/wiki/Phonograph_cylinder
Link: https://en.wikipedia.com/wiki/Emile_Berliner
Link: https://wikipedia.com/wiki/Consolidated_Edison
Link: https://www.wikipedia.com/wiki/hello
Link: https://www.wikipedia.com/wiki/Tom%20Alston
Link: https://en.wikipedia.com/wiki/Edison_screw
免责声明,我为SerpApi工作


当我这样做的时候,我会得到一个
HTTPError:HTTP Error 400:Bad Request
当我这样做的时候,我会得到一个
HTTPError:HTTP Error 400:Bad Request
非常感谢您的出色响应!但我必须问,如果我想给每个链接分配一个新变量,我会怎么做?x=wiki.com/tom,y=wiki.com/jeff,等等。再次感谢您的回复!您可以将它们全部保存在一个列表中,而不是打印它们,或者将它们保存在一个字典中。我对Python比较陌生,我不太确定如何执行此操作,更不用说将其与BeautifulSoup输出格式集成。你介意给我指对方向吗?我接受了你的回答:)我已经更新了我的答案,包括一个
链接列表,并显示了输出-它只是一个URL列表,作为字符串,然后你可以传递给一个新的
请求。get()
函数调用或做任何你想做的事情。再次感谢你!一切都很顺利,我希望能为Stackoverflow提供一个“Reddit Gold”,我可以给你:D保重,再次感谢!非常感谢您的精彩回复!但我必须问,如果我想给每个链接分配一个新变量,我会怎么做?x=wiki.com/tom,y=wiki.com/jeff,等等。再次感谢您的回复!您可以将它们全部保存在一个列表中,而不是打印它们,或者将它们保存在一个字典中。我对Python比较陌生,我不太确定如何执行此操作,更不用说将其与BeautifulSoup输出格式集成。你介意给我指对方向吗?我接受了你的回答:)我已经更新了我的答案,包括一个
链接列表,并显示了输出-它只是一个URL列表,作为字符串,然后你可以传递给一个新的
请求。get()
函数调用或做任何你想做的事情。再次感谢你!一切都很顺利,我希望能为Stackoverflow提供一个“Reddit Gold”,我可以给你:D保重,再次感谢!