Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/353.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python在google搜索页面中迭代_Python_Google Analytics_Web Scraping_Offset_Google News - Fatal编程技术网

Python在google搜索页面中迭代

Python在google搜索页面中迭代,python,google-analytics,web-scraping,offset,google-news,Python,Google Analytics,Web Scraping,Offset,Google News,我正在编写一个更大的代码,它将显示谷歌报纸搜索结果的链接,然后分析这些链接中的某些关键字、上下文和数据。我已经让这一部分的所有内容都发挥了作用,现在当我尝试在结果页面中迭代时,我遇到了一个问题。我不知道在没有API的情况下如何做到这一点,我不知道如何使用API。我只需要能够迭代搜索结果的多个页面,这样我就可以将我的分析应用于它。似乎有一个简单的解决方案可以遍历结果页面,但我没有看到它 对于解决这个问题有什么建议吗?我对Python有点陌生,一直在自学所有这些刮取技术,所以我不确定我是否遗漏了一些

我正在编写一个更大的代码,它将显示谷歌报纸搜索结果的链接,然后分析这些链接中的某些关键字、上下文和数据。我已经让这一部分的所有内容都发挥了作用,现在当我尝试在结果页面中迭代时,我遇到了一个问题。我不知道在没有API的情况下如何做到这一点,我不知道如何使用API。我只需要能够迭代搜索结果的多个页面,这样我就可以将我的分析应用于它。似乎有一个简单的解决方案可以遍历结果页面,但我没有看到它

对于解决这个问题有什么建议吗?我对Python有点陌生,一直在自学所有这些刮取技术,所以我不确定我是否遗漏了一些简单的东西。我知道这可能是谷歌限制自动搜索的一个问题,但即使在前100个左右的链接中加入,也将是有益的。我在谷歌的常规搜索中看到过这样的例子,但在谷歌的报纸搜索中没有看到

下面是代码的主体。如果你有任何建议,那将是很有帮助的。提前谢谢

def get_page_tree(url):
page = requests.get(url=url, verify=False)
return html.fromstring(page.text)

def find_other_news_sources(initial_url):
    forwarding_identifier = '/url?q='
    google_news_search_url = "https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro"
    google_news_search_tree = get_page_tree(url=google_news_search_url)
    other_news_sources_links = [a_link.replace(forwarding_identifier, '').split('&')[0] for a_link in google_news_search_tree.xpath('//a//@href') if forwarding_identifier in a_link]
    return other_news_sources_links

links = find_other_news_sources("https://www.google.com/search?    hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro")  

with open('textanalysistest.csv', 'wt') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    for row in links:
        print(row)

我正在考虑为一个与谷歌结构相似的站点构建一个解析器(即一组连续的结果页面,每个页面都有一个感兴趣的内容表)

Selenium包(用于基于页面元素的站点导航)和BeautifulSoup(用于html解析)的组合似乎是获取书面内容的首选武器。你可能会发现它们也很有用,尽管我不知道谷歌有什么样的防御措施来阻止刮擦

Mozilla Firefox可能使用selenium、beautifulsoup和geckodriver实现:

from bs4 import BeautifulSoup, SoupStrainer
from bs4.diagnose import diagnose
from os.path import isfile
from time import sleep
import codecs
from selenium import webdriver

def first_page(link):
    """Takes a link, and scrapes the desired tags from the html code"""
    driver = webdriver.Firefox(executable_path = 'C://example/geckodriver.exe')#Specify the appropriate driver for your browser here
    counter=1
    driver.get(link)
    html = driver.page_source
    filter_html_table(html)
    counter +=1
    return driver, counter


def nth_page(driver, counter, max_iter):
    """Takes a driver instance, a counter to keep track of iterations, and max_iter for maximum number of iterations. Looks for a page element matching the current iteration (how you need to program this depends on the html structure of the page you want to scrape), navigates there, and calls mine_page to scrape."""
    while counter <= max_iter:
        pageLink = driver.find_element_by_link_text(str(counter)) #For other strategies to retrieve elements from a page, see the selenium documentation
        pageLink.click()
        scrape_page(driver)
        counter+=1
    else:
        print("Done scraping")
    return


def scrape_page(driver):
    """Takes a driver instance, extracts html from the current page, and calls function to extract tags from html of total page"""
    html = driver.page_source #Get html from page
    filter_html_table(html) #Call function to extract desired html tags
    return


def filter_html_table(html):
    """Takes a full page of html, filters the desired tags using beautifulsoup, calls function to write to file"""
    only_td_tags = SoupStrainer("td")#Specify which tags to keep
    filtered = BeautifulSoup(html, "lxml", parse_only=only_td_tags).prettify() #Specify how to represent content
    write_to_file(filtered) #Function call to store extracted tags in a local file.
    return


def write_to_file(output):
    """Takes the scraped tags, opens a new file if the file does not exist, or appends to existing file, and writes extracted tags to file."""
    fpath = "<path to your output file>"
    if isfile(fpath):
        f = codecs.open(fpath, 'a') #using 'codecs' to avoid problems with utf-8 characters in ASCII format. 
        f.write(output)
        f.close()
    else:
        f = codecs.open(fpath, 'w') #using 'codecs' to avoid problems with utf-8 characters in ASCII format. 
        f.write(output)
        f.close()
    return
来自bs4导入BeautifulSoup,SoupStrainer
从bs4.diagnose导入诊断
从os.path导入isfile
从时间上导入睡眠
导入编解码器
从selenium导入webdriver
def首页(链接):
“”“获取链接,并从html代码中删除所需的标记”“”
driver=webdriver.Firefox(executable_path='C://example/geckodriver.exe')#在此处为浏览器指定合适的驱动程序
计数器=1
驱动程序。获取(链接)
html=driver.page\u源
筛选html表格(html)
计数器+=1
返回驱动器、计数器
第n页定义(驱动器、计数器、最大值):
“”“获取驱动程序实例、跟踪迭代的计数器和最大迭代次数的最大值。查找与当前迭代匹配的页面元素(您需要如何编程,这取决于您要刮取的页面的html结构),在那里导航,并调用我的页面刮取。”“”
我最近写这篇文章就是为了这样做。
link = <link to site to scrape>
driver, n_iter = first_page(link)
nth_page(driver, n_iter, 1000) # the 1000 lets us scrape 1000 of the result pages