Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/298.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
python:获取谷歌adsense盈利报告_Python_Twill_Html5lib - Fatal编程技术网

python:获取谷歌adsense盈利报告

python:获取谷歌adsense盈利报告,python,twill,html5lib,Python,Twill,Html5lib,我需要一个python脚本来获取google adsense收入,我发现adsense scraper: 它使用Twill和html5lib收集谷歌adsense的收入数据。当我使用它时,会收到以下错误消息: Traceback (most recent call last): File "adsense_scraper.py", line 163, in <module> data = main() File "adsense_scraper.py", line 1

我需要一个python脚本来获取google adsense收入,我发现adsense scraper: 它使用Twill和html5lib收集谷歌adsense的收入数据。当我使用它时,会收到以下错误消息:

Traceback (most recent call last):
  File "adsense_scraper.py", line 163, in <module>
    data = main()
  File "adsense_scraper.py", line 154, in main
    b = get_adsense(login, password)
  File "adsense_scraper.py", line 128, in get_adsense
    b.submit()
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\browser.py", line 467, in submit
    self._journey('open', request)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\browser.py", line 523, in _journey
    r = func(*args, **kwargs)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_mechanize.py", line 212, in open
    return self._mech_open(url, data)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_mechanize.py", line 238, in _mech_open
    response = UserAgentBase.open(self, request, data)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_opener.py", line 192, in open
    response = meth(req, response)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_http.py", line 590, in http_response
   "http", request, response, code, msg, hdrs)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_opener.py", line 209, in error
    result = apply(self._call_chain, args)
  File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
    result = func(*args)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_http.py", line 135, in http_error_302
    return self.parent.open(new)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_mechanize.py", line 212, in open
    return self._mech_open(url, data)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_mechanize.py", line 238, in _mech_open
    response = UserAgentBase.open(self, request, data)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_opener.py", line 192, in open
    response = meth(req, response)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\utils.py", line 442, in http_response
    "refresh", msg, hdrs)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_opener.py", line 209, in error
    result = apply(self._call_chain, args)
  File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
    result = func(*args)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_http.py", line 135, in http_error_302
    return self.parent.open(new)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_mechanize.py", line 212, in open
    return self._mech_open(url, data)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_mechanize.py", line 238, in _mech_open
    response = UserAgentBase.open(self, request, data)
  File "c:\python26\lib\site-packages\twill-0.9-py2.6.egg\twill\other_packages\_mechanize_dist\_opener.py", line 181, in open
    response = urlopen(self, req, data)
  File "C:\Python26\lib\urllib2.py", line 406, in _open 'unknown_open', req)
  File "C:\Python26\lib\urllib2.py", line 361, in _call_chain result = func(*args)
  File "C:\Python26\lib\urllib2.py", line 1163, in unknown_open raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: 'http>
回溯(最近一次呼叫最后一次):
文件“adsense_scraper.py”,第163行,在
data=main()
文件“adsense_scraper.py”,第154行,主目录
b=获取adsense(登录名、密码)
文件“adsense\u scraper.py”,第128行,在get\u adsense中
b、 提交()
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\browser.py”,第467行,提交
自助旅行(“开放”,请求)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\browser.py”,第523行,在
r=func(*args,**kwargs)
打开文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u mechanize.py”,第212行
返回自我。机械打开(url、数据)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u mechanize.py”,第238行,在“机械打开”中
response=UserAgentBase.open(self、request、data)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u opener.py”,第192行,处于打开状态
响应=方法(请求,响应)
http\u响应中的文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u http.py”,第590行
“http”、请求、响应、代码、消息、hdrs)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u opener.py”第209行出错
结果=应用(自调用链,参数)
文件“C:\Python26\lib\urllib2.py”,第361行,在调用链中
结果=func(*args)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u http.py”,第135行,http\u error\u 302
返回self.parent.open(新)
打开文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u mechanize.py”,第212行
返回自我。机械打开(url、数据)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u mechanize.py”,第238行,在“机械打开”中
response=UserAgentBase.open(self、request、data)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u opener.py”,第192行,处于打开状态
响应=方法(请求,响应)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\utils.py”,第442行,在http\U响应中
“刷新”,消息,hdrs)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u opener.py”第209行出错
结果=应用(自调用链,参数)
文件“C:\Python26\lib\urllib2.py”,第361行,在调用链中
结果=func(*args)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u http.py”,第135行,http\u error\u 302
返回self.parent.open(新)
打开文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u mechanize.py”,第212行
返回自我。机械打开(url、数据)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u mechanize.py”,第238行,在“机械打开”中
response=UserAgentBase.open(self、request、data)
文件“c:\python26\lib\site packages\twill-0.9-py2.6.egg\twill\other\u packages\\u mechanize\u dist\\u opener.py”,第181行,处于打开状态
响应=urlopen(自我、请求、数据)
文件“C:\Python26\lib\urllib2.py”,第406行,在_open“unknown\u open”中,请求)
文件“C:\Python26\lib\urllib2.py”,第361行,在_call_chain result=func(*args)中
文件“C:\Python26\lib\urllib2.py”,第1163行,未知\u open raise URLError('未知url类型:%s'%type)
urllib2.URLError:

谁能告诉我错误在哪里吗?有没有更好的方法通过python获取数据?谢谢

软件包有几个错误,您只提到了第一个错误

1) twill软件包不能正确处理谷歌的重定向,他补充道

    newurl = newurl.strip( "'" )
到twill/other\u packages/\u mechanize\u dist/\u http.py:108之前

    newurl = _rfc3986.clean_url(newurl, "latin-1")
修正

2) 你必须在adsense-English中设置正确的语言

3) 原始adsense_刮板存在几个问题

#!/usr/bin/env python
"""Scrapes Google AdSense data with Python using Twill

Current canonical location of this module is here:
http://github.com/etrepum/adsense_scraper/tree/master


Usage::

    from adsense_scraper import get_adsense, get_time_period
    b = get_adsense('YOUR_ADSENSE_LOGIN', 'YOUR_ADSENSE_PASSWORD')
    rows = get_time_period(b, 'yesterday')
    # The summary data is always the first row with channel == ''
    print 'I earned this much yesterday: $%(earnings)s' % rows[0]

"""
# requires html5lib, twill
import sys
import pprint
import decimal
from cStringIO import StringIO
from xml.etree import cElementTree

try:
    from html5lib import HTMLParser
    import twill.commands
except ImportError:
    print >>sys.stderr, """\
adsense_scraper has dependencies::

    Twill 0.9 http://twill.idyll.org/
    html5lib 0.11 http://code.google.com/p/html5lib/

Try this::

    $ easy_install twill html5lib
"""
    raise SystemExit()

__version__ = '0.5'

SERVICE_LOGIN_BOX_URL = "https://www.google.com/accounts/ServiceLogin?service=adsense&rm=hide&fpui=3&nui=15&alwf=true&ltmpl=adsense&passive=true&continue=https%3A%2F%2Fwww.google.com%2Fadsense%2Fgaiaauth2&followup=https%3A%2F%2Fwww.google.com%2Fadsense%2Fgaiaauth2&hl=en_US"
OVERVIEW_URL = "https://www.google.com/adsense/report/overview?timePeriod="

TIME_PERIODS = [
    'today',
    'yesterday',
    'thismonth',
    'lastmonth',
    'sincelastpayment',
]


def parse_decimal(s):
    """Return an int or decimal.Decimal given a human-readable number

    """
    light_stripped = s.strip(u'\u20ac')
    stripped = light_stripped.replace(',', '.').rstrip('%').lstrip('$')
    try:
        int(stripped)
        return light_stripped
    except ValueError:
        pass
    try:
        float(stripped)
        return light_stripped
    except ValueError:
        return decimal.Decimal(stripped)


def parse_summary_table(doc):
    """
    Parse the etree doc for summarytable, returns::

        [{'channel': unicode,
          'impressions': int,
          'clicks': int,
          'ctr': decimal.Decimal,
          'ecpm': decimal.Decimal,
          'earnings': decimal.Decimal}]

    """
    for t in doc.findall('.//table'):
        if t.attrib.get('id') == 'summarytable':
            break
    else:
        raise ValueError("summary table not found")

    res = []
    FIELDS = ['impressions', 'clicks', 'ctr', 'ecpm', 'earnings']
    for row in t.findall('.//tr'):
        celltext = []
        for c in row.findall('td'):
            tail = ''
            # adsense inserts an empty span if a row has a period in it, so
            # get the children and find the tail element to append to the text
            if c.find('a') and c.find('a').getchildren():
                tail = c.find('a').getchildren()[0].tail or ''
            celltext.append('%s%s' % ((c.text or c.findtext('a') or '').strip(), tail.strip()))

        celltext = filter( lambda x: x != "" , celltext )
        if len(celltext) != len(FIELDS):
            continue
        try:
            value_cols = map(parse_decimal, celltext)
        except decimal.InvalidOperation:
            continue
        res.append(dict(zip(FIELDS, value_cols)))

    return res


def get_adsense(login, password):
    """Returns a twill browser instance after having logged in to AdSense
    with *login* and *password*.

    The returned browser will have all of the appropriate cookies set but may
    not be at the exact page that you want data from.

    """
    b = twill.commands.get_browser()
    b.go(SERVICE_LOGIN_BOX_URL)
    for form in b.get_all_forms():
        try:
            form['Email'] = login
            form['Passwd'] = password
        except ValueError:
            continue
        else:
            break
    else:
        raise ValueError("Could not find login form on page")
    b._browser.select_form(predicate=lambda f: f is form)
    b.submit()
    return b


def get_time_period(b, period):
    """Returns the parsed summarytable for the time period *period* given
    *b* which should be the result of a get_adsense call. *period* must be
    a time period that AdSense supports:
    ``'today'``, ``'yesterday'``, ``'thismonth'``,
    ``'lastmonth'``, ``'sincelastpayment'``.

    """
    b.go(OVERVIEW_URL + period)
    # The cElementTree treebuilder doesn't work reliably enough
    # to use directly, so we parse and then dump into cElementTree.
    doc = cElementTree.fromstring(HTMLParser().parse(b.get_html()).toxml())
    return parse_summary_table(doc)


def main():
    try:
        login, password = sys.argv[1:]
    except ValueError:
        raise SystemExit("usage: %s LOGIN PASSWORD" % (sys.argv[0],))
    twill.set_output(StringIO())
    twill.commands.reset_browser()
    b = get_adsense(login, password)
    data = {}
    for period in TIME_PERIODS:
        data[period] = get_time_period(b, period)
    pprint.pprint(data)
    twill.set_output(None)
    return data

if __name__ == '__main__':
    data = main()

如何调用它?首先,我做了“pythonsetup.py安装”(setup.py在adsense scraper下载文件中提供,它安装了html5lib和twill)。然后我做了:python adsense_scraper.py登录密码(我的密码也有特殊字符,但我认为登录成功,因为当密码错误时,我会得到不同的错误)我没有看到代码有任何明显的错误。URL看起来还可以。错误信息的最后一行是否正确转录?似乎奇怪的是,没有收盘报价;上一行中的代码看起来正常。adsense_scraper.py docstring中有一些示例调用代码;您可以尝试以这种方式调用它,看看是否有帮助。请发布实际运行的代码。不是您从中下载的链接。这是代码:
#!/usr/bin/env python
"""Scrapes Google AdSense data with Python using Twill

Current canonical location of this module is here:
http://github.com/etrepum/adsense_scraper/tree/master


Usage::

    from adsense_scraper import get_adsense, get_time_period
    b = get_adsense('YOUR_ADSENSE_LOGIN', 'YOUR_ADSENSE_PASSWORD')
    rows = get_time_period(b, 'yesterday')
    # The summary data is always the first row with channel == ''
    print 'I earned this much yesterday: $%(earnings)s' % rows[0]

"""
# requires html5lib, twill
import sys
import pprint
import decimal
from cStringIO import StringIO
from xml.etree import cElementTree

try:
    from html5lib import HTMLParser
    import twill.commands
except ImportError:
    print >>sys.stderr, """\
adsense_scraper has dependencies::

    Twill 0.9 http://twill.idyll.org/
    html5lib 0.11 http://code.google.com/p/html5lib/

Try this::

    $ easy_install twill html5lib
"""
    raise SystemExit()

__version__ = '0.5'

SERVICE_LOGIN_BOX_URL = "https://www.google.com/accounts/ServiceLogin?service=adsense&rm=hide&fpui=3&nui=15&alwf=true&ltmpl=adsense&passive=true&continue=https%3A%2F%2Fwww.google.com%2Fadsense%2Fgaiaauth2&followup=https%3A%2F%2Fwww.google.com%2Fadsense%2Fgaiaauth2&hl=en_US"
OVERVIEW_URL = "https://www.google.com/adsense/report/overview?timePeriod="

TIME_PERIODS = [
    'today',
    'yesterday',
    'thismonth',
    'lastmonth',
    'sincelastpayment',
]


def parse_decimal(s):
    """Return an int or decimal.Decimal given a human-readable number

    """
    light_stripped = s.strip(u'\u20ac')
    stripped = light_stripped.replace(',', '.').rstrip('%').lstrip('$')
    try:
        int(stripped)
        return light_stripped
    except ValueError:
        pass
    try:
        float(stripped)
        return light_stripped
    except ValueError:
        return decimal.Decimal(stripped)


def parse_summary_table(doc):
    """
    Parse the etree doc for summarytable, returns::

        [{'channel': unicode,
          'impressions': int,
          'clicks': int,
          'ctr': decimal.Decimal,
          'ecpm': decimal.Decimal,
          'earnings': decimal.Decimal}]

    """
    for t in doc.findall('.//table'):
        if t.attrib.get('id') == 'summarytable':
            break
    else:
        raise ValueError("summary table not found")

    res = []
    FIELDS = ['impressions', 'clicks', 'ctr', 'ecpm', 'earnings']
    for row in t.findall('.//tr'):
        celltext = []
        for c in row.findall('td'):
            tail = ''
            # adsense inserts an empty span if a row has a period in it, so
            # get the children and find the tail element to append to the text
            if c.find('a') and c.find('a').getchildren():
                tail = c.find('a').getchildren()[0].tail or ''
            celltext.append('%s%s' % ((c.text or c.findtext('a') or '').strip(), tail.strip()))

        celltext = filter( lambda x: x != "" , celltext )
        if len(celltext) != len(FIELDS):
            continue
        try:
            value_cols = map(parse_decimal, celltext)
        except decimal.InvalidOperation:
            continue
        res.append(dict(zip(FIELDS, value_cols)))

    return res


def get_adsense(login, password):
    """Returns a twill browser instance after having logged in to AdSense
    with *login* and *password*.

    The returned browser will have all of the appropriate cookies set but may
    not be at the exact page that you want data from.

    """
    b = twill.commands.get_browser()
    b.go(SERVICE_LOGIN_BOX_URL)
    for form in b.get_all_forms():
        try:
            form['Email'] = login
            form['Passwd'] = password
        except ValueError:
            continue
        else:
            break
    else:
        raise ValueError("Could not find login form on page")
    b._browser.select_form(predicate=lambda f: f is form)
    b.submit()
    return b


def get_time_period(b, period):
    """Returns the parsed summarytable for the time period *period* given
    *b* which should be the result of a get_adsense call. *period* must be
    a time period that AdSense supports:
    ``'today'``, ``'yesterday'``, ``'thismonth'``,
    ``'lastmonth'``, ``'sincelastpayment'``.

    """
    b.go(OVERVIEW_URL + period)
    # The cElementTree treebuilder doesn't work reliably enough
    # to use directly, so we parse and then dump into cElementTree.
    doc = cElementTree.fromstring(HTMLParser().parse(b.get_html()).toxml())
    return parse_summary_table(doc)


def main():
    try:
        login, password = sys.argv[1:]
    except ValueError:
        raise SystemExit("usage: %s LOGIN PASSWORD" % (sys.argv[0],))
    twill.set_output(StringIO())
    twill.commands.reset_browser()
    b = get_adsense(login, password)
    data = {}
    for period in TIME_PERIODS:
        data[period] = get_time_period(b, period)
    pprint.pprint(data)
    twill.set_output(None)
    return data

if __name__ == '__main__':
    data = main()