在Python中使用Selenium抓取Google图像_Python_Json_Selenium_Web Scraping_Urllib2

在Python中使用Selenium抓取Google图像

python json selenium web-scraping

在Python中使用Selenium抓取Google图像,python,json,selenium,web-scraping,urllib2,Python,Json,Selenium,Web Scraping,Urllib2,现在，我一直在尝试使用以下代码抓取google图像： from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys import os import time import requests import re import urllib2 import re from threading import Thre

现在，我一直在尝试使用以下代码抓取google图像：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys 
import os
import time
import requests
import re
import urllib2
import re
from threading import Thread
import json
#Assuming I have a folder named Pictures1, the images are downloaded there. 
def threaded_func(url,i):
     raw_img = urllib2.urlopen(url).read()
     cntr = len([i for i in os.listdir("Pictures1") if image_type in i]) + 1
     f = open("Pictures1/" + image_type + "_"+ str(total), 'wb')
     f.write(raw_img)
     f.close()
driver = webdriver.Firefox()
driver.get("https://images.google.com/")
elem = driver.find_element_by_xpath('/html/body/div/div[3]/div[3]/form/div[2]/div[2]/div[1]/div[1]/div[3]/div/div/div[2]/div/input[1]')
elem.clear()
elem.send_keys("parrot")
elem.send_keys(Keys.RETURN)
image_type = "parrot_defG"
images=[]
total=0
time.sleep(10)
for a in driver.find_elements_by_class_name('rg_meta'):
     link =json.loads(a.text)["ou"]
     thread = Thread(target = threaded_func, args = (link,total))
     thread.start()
     thread.join()
     total+=1

我尝试使用Selenium打开google的图像结果页面，然后注意到每个div都有类“rg meta”，后面跟着JSON代码

我试图使用.text访问它。JSON的“ou”索引包含我试图下载的图像的源代码。我正在尝试使用“rg meta”类获取所有此类div并下载图像。但是它显示了错误“没有JSON对象可以被解码”，我不知道该怎么办

编辑：这就是我所说的：

    <div class="rg_meta">{"cl":3,"id":"FqCGaup9noXlMM:","isu":"kids.britannica.com","itg":false,"ity":"jpg","oh":600,"ou":"http://media.web.britannica.com/eb-media/89/89689-004-4C85E0F0.jpg","ow":380,"pt":"grain weevil -- Kids Encyclopedia | Children\u0026#39;s Homework Help ...","rid":"EusB0pk_sLg7vM","ru":"http://kids.britannica.com/comptons/art-143712/grain-or-granary-weevil","s":"grain weevil","sc":1,"st":"Kids Britannica","th":282,"tu":"https://encrypted-tbn2.gstatic.com/images?q\u003dtbn:ANd9GcQPbgXbRVzOicvPfBRtAkLOpJwy_wDQEC6a2q0BuTsUx-s0-h4b","tw":179}</div>

替换：

driver。通过类名称（'rg\u meta'）查找元素。

使用

driver。通过xpath（'//div[@class=“rg\u meta”]/text（）'）查找元素。

和

a.text

和

将解决您的问题

结果代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys 
import os
import time
import requests
import re
import urllib2
import re
from threading import Thread
import json
#Assuming I have a folder named Pictures1, the images are downloaded there. 
def threaded_func(url,i):
     raw_img = urllib2.urlopen(url).read()
     cntr = len([i for i in os.listdir("Pictures1") if image_type in i]) + 1
     f = open("Pictures1/" + image_type + "_"+ str(total), 'wb')
     f.write(raw_img)
     f.close()
driver = webdriver.Firefox()
driver.get("https://images.google.com/")
elem = driver.find_element_by_xpath('/html/body/div/div[3]/div[3]/form/div[2]/div[2]/div[1]/div[1]/div[3]/div/div/div[2]/div/input[1]')
elem.clear()
elem.send_keys("parrot")
elem.send_keys(Keys.RETURN)
image_type = "parrot_defG"
images=[]
total=0
time.sleep(10)
for a in driver.find_element_by_xpath('//div[@class="rg_meta"]/text()'):
     link =json.loads(a)["ou"]
     thread = Thread(target = threaded_func, args = (link,total))
     thread.start()
     thread.join()
     total+=1

打印链接会导致：

http://media.web.britannica.com/eb-media/89/89689-004-4C85E0F0.jpg

替换：

driver。通过类名称（'rg\u meta'）查找元素。

使用

driver。通过xpath（'//div[@class=“rg\u meta”]/text（）'）查找元素。

和

a.text

和

将解决您的问题

结果代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys 
import os
import time
import requests
import re
import urllib2
import re
from threading import Thread
import json
#Assuming I have a folder named Pictures1, the images are downloaded there. 
def threaded_func(url,i):
     raw_img = urllib2.urlopen(url).read()
     cntr = len([i for i in os.listdir("Pictures1") if image_type in i]) + 1
     f = open("Pictures1/" + image_type + "_"+ str(total), 'wb')
     f.write(raw_img)
     f.close()
driver = webdriver.Firefox()
driver.get("https://images.google.com/")
elem = driver.find_element_by_xpath('/html/body/div/div[3]/div[3]/form/div[2]/div[2]/div[1]/div[1]/div[3]/div/div/div[2]/div/input[1]')
elem.clear()
elem.send_keys("parrot")
elem.send_keys(Keys.RETURN)
image_type = "parrot_defG"
images=[]
total=0
time.sleep(10)
for a in driver.find_element_by_xpath('//div[@class="rg_meta"]/text()'):
     link =json.loads(a)["ou"]
     thread = Thread(target = threaded_func, args = (link,total))
     thread.start()
     thread.join()
     total+=1

打印链接会导致：

http://media.web.britannica.com/eb-media/89/89689-004-4C85E0F0.jpg

你试过打印文本的上下文吗。如果是，则提供该输出。您还尝试过调试您遇到的问题吗？@GiannisSpiliopoulos，尝试打印.text的内容。在我的终端中显示空白。我假设文本可能是unicode，所以我尝试使用JSON.dumps（）将其转换为JSON。它也不起作用。

打印什么？我不清楚将unicode字符串转换为json将如何帮助您。。。在任何情况下，请尝试打印整个页面的源代码（

print（driver.page\u source）

），并检查您的假设是否正确。@SatishGarg，下面是：您是否尝试打印a.text的上下文。如果是，则提供该输出。您还尝试过调试您遇到的问题吗？@GiannisSpiliopoulos，尝试打印.text的内容。在我的终端中显示空白。我假设文本可能是unicode，所以我尝试使用JSON.dumps（）将其转换为JSON。它也不起作用。

打印什么？我不清楚将unicode字符串转换为json将如何帮助您。。。在任何情况下，请尝试打印整个页面的源代码（

print（driver.page\u source）

），并检查您的假设是否正确。@SatishGarg，以下内容：不适用于我，请提供您进行更改的代码段。这将非常有帮助。谢谢。即使将其更正为之后，对于in驱动程序。通过xpath（…）查找元素，仍会出现错误：TypeError:哪一行上需要字符串或缓冲区？很抱歉，您的代码不起作用：（：（该行为：link=json.loads（a）[“ou”]错误是：TypeError:expected string或Buffer您可以添加对接收到的错误的回溯吗？对我不适用，请提供您进行更改的代码段。这将非常有用。谢谢您。即使将其更正为，对于in驱动程序。通过xpath（…）查找\u元素\u，出现错误：TypeError:预期的字符串或缓冲区在哪一行？抱歉，您的代码不工作：（：（该行是：link=json.loads（a）[“ou”]并且错误是：TypeError:预期的字符串或缓冲区您可以添加对接收到的错误的回溯吗？