将表单请求发送到SciHub不再使用urllib urllib2 python工作

将表单请求发送到SciHub不再使用urllib urllib2 python工作,python,pdf,urllib2,urllib,Python,Pdf,Urllib2,Urllib,我有一个我非常满意的小脚本,它可以从剪贴板上读取一个或多个参考书目,从谷歌学者那里获取学术论文的信息,然后将其输入到SciHub以获得pdf。由于某种原因,它停止了工作,我花了很长时间试图找出原因 测试表明程序的Google(scholarly.py)部分工作正常,问题出在SciHub部分 有什么想法吗 以下是一个参考示例: Appleyard,S.J.,Angeloni,J.和Watkins,R.(2006)经历干旱和人口密度增加的城市地区富含砷的地下水,澳大利亚珀斯。应用地球化学21(1),

我有一个我非常满意的小脚本,它可以从剪贴板上读取一个或多个参考书目,从谷歌学者那里获取学术论文的信息,然后将其输入到SciHub以获得pdf。由于某种原因,它停止了工作,我花了很长时间试图找出原因

测试表明程序的Google(scholarly.py)部分工作正常,问题出在SciHub部分

有什么想法吗

以下是一个参考示例: Appleyard,S.J.,Angeloni,J.和Watkins,R.(2006)经历干旱和人口密度增加的城市地区富含砷的地下水,澳大利亚珀斯。应用地球化学21(1),83-97

现在可以了,我添加了“用户代理”标题并重新调整了URLlib的内容。它现在正在做的事情似乎更为明显。一个反复尝试的过程,尝试从网络上收集的大量不同的代码片段。希望我的老板不要问我今天取得了什么成就。有人应该创建一个论坛,人们可以在这里获得编码问题的答案

'''Program to automatically find and download items from a bibliography or references list here are some journal papers in bibliographic format. Just copy the text to clipboard and run the script.

Ghaffour, N., T. M. Missimer and G. L. Amy (2013). "Technical review and evaluation of the economics of water desalination: Current and future challenges for better water supply sustainability." Desalination 309(0): 197-207. 

Gutiérrez Ortiz, F. J., P. G. Aguilera and P. Ollero (2014). "Biogas desulfurization by adsorption on thermally treated sewage-sludge." Separation and Purification Technology 123(0): 200-213. 

This program uses the 'scihub' website to obtain the full-text paper where
available, if no entry is found the paper is ignored and the failed downloads are listed at the end'''

    import scholarly
    import win32clipboard
    import urllib
    import urllib2
    import webbrowser
    import re


    '''Select and then copy the bibliography entries you want to download the
    papers for, python reads the clipboard'''
    win32clipboard.OpenClipboard()
    c = win32clipboard.GetClipboardData()
    win32clipboard.EmptyClipboard()

    '''Cleans up the text. removes end lines and double spaces etc.'''
    c = c.replace('\n', ' ')
    c = c.replace('\r', ' ')
    while c.find('  ') != -1:
        c = c.replace('  ', ' ')
    win32clipboard.SetClipboardText(c)
    win32clipboard.CloseClipboard()
    print "Working..."

    '''bit of regex to extract the title of the paper,
    IMPORTANT: bibliography has to be in
    author date format or you will need to revise this,
    at the moment it looks for date in brackets, then copies all the text until it
    reaches a full-stop, assuming that this is the paper title. If it is not, it
    will either fail or will be using inappropriate search terms.'''

    paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c)
    print "Analysing titles"
    print "The following titles found:"
    print "*************************"
    list_of_titles= list()
    for i in paper_info:
        print '%s...' % (i[3][:50])
        Paper_title=str(i[3])
        list_of_titles.append(Paper_title)
    paper_number=0
    failed=list()
    for title in list_of_titles:
        try:
            search_query = scholarly.search_pubs_query(title)

            info= (next(search_query))
            paper_number+=1
            print "Querying Google Scholar"
            print "**********************"
            print "Looking up paper title:"
            print title
            print "**********************"

            url=info.bib['url']
            print "Journal URL found "
            print url
            #url=next(search_query)
            print "Sending URL: ", url

            site='http://sci-hub.cc/'

            r = urllib2.Request(url=site)
            r.add_header('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')
            r.add_data(urllib.urlencode({'request': url}))
            res= urllib2.urlopen(r)



            with open("results.html", "w") as f:
                f.write(res.read())


            webbrowser.open_new("results.html")
            if not paper_number<= len(list_of_titles):
                print "Next title"
            else:
                continue

        except Exception as e:
            print repr(e)
            paper_number+=1
            print "**********************"
            print "No valid journal found for:"
            print title
            print "**********************"
            print "Continuing..."
            failed.append(title)
        continue

    if len(failed)==0:
        print 'Complete'

    else:
        print '*************************************'
        print 'The following titles did not download: '
        print '*************************************'
        print failed
        print "Please check that these are valid entries"
''自动从参考书目或参考文献列表中查找和下载项目的程序这里是一些参考书目格式的期刊论文。只需将文本复制到剪贴板并运行脚本。
加福,N.,T.M.米斯默和G.L.艾米(2013)。“海水淡化经济性的技术回顾和评估:改善供水可持续性的当前和未来挑战”,《海水淡化309(0):197-207。
Gutiérrez Ortiz,F.J.,P.G.Aguilera和P.Ollero(2014)。“在热处理的污水污泥上通过吸附进行沼气脱硫”,《分离和净化技术》123(0):200-213。
该计划使用“scihub”网站获取全文论文,其中
可用,如果未找到任何条目,则忽略该论文,失败的下载将列在“”末尾
博学
导入Win32剪贴板
导入URL库
导入urllib2
导入网络浏览器
进口稀土
''选择并复制要下载的书目条目
文件,python读取剪贴板“”
win32clipboard.OpenClipboard()
c=win32clipboard.GetClipboardData()
win32clipboard.EmptyClipboard()
''清除文本。删除端点和双空格等。“”
c=c.replace('\n','')
c=c.replace('\r','')
而c.find(“”)!=-1:
c=c。替换(“”,“”)
win32clipboard.SetClipboardText(c)
win32clipboard.CloseClipboard()
打印“工作…”
''一点正则表达式来提取论文的标题,
重要提示:参考书目必须在
作者日期格式,否则您需要修改此格式,
此时,它在括号中查找日期,然后复制所有文本,直到找到为止
到达句号,假设这是论文标题。如果不是的话,它是
将失败或将使用不适当的搜索词。“”
paper\u info=re.findall(r“(\d{4}[a-z]*)([).]+)([\”])+([\w\s\u():,-]*)(。”,c)
打印“分析标题”
打印“找到以下标题:”
打印“***********************”
标题列表=列表()
对于i in paper_信息:
打印“%s…”%(i[3][:50])
论文标题=str(i[3])
标题列表。附加(论文标题)
纸张编号=0
失败=列表()
对于标题列表中的标题:
尝试:
search\u query=scholarly.search\u pubs\u query(标题)
信息=(下一步(搜索和查询))
纸张编号+=1
打印“查询谷歌学者”
打印“***********************”
打印“查找纸张标题:”
印刷品标题
打印“***********************”
url=info.bib['url']
打印“找到日志URL”
打印url
#url=下一步(搜索\查询)
打印“发送URL:”,URL
场地http://sci-hub.cc/'
r=urllib2.Request(url=site)
r、 添加_头('User-Agent','Mozilla/5.0(X11;Linux x86_64)AppleWebKit/537.11(KHTML,类似Gecko)Chrome/23.0.1271.64 Safari/537.11')
r、 添加_数据(urllib.urlencode({'request':url}))
res=urlib2.urlopen(r)
以open(“results.html”、“w”)作为f:
f、 写入(res.read())
webbrowser.open_new(“results.html”)

如果不是纸张编号,您的代码中有一个裸露的
块,除了:
块,它正在吞噬每一个异常,并将其替换为一条无用的错误消息。尝试删除它,看看问题到底是什么。感谢Blender,我收到HTTP错误403:禁止我认为我需要伪造标题,使其看起来不是python脚本。我不能但要让它工作起来。我目前正在使用“Requests”而不是URLlib和URLlib2重新编写有问题的部分。这让人困惑,因为它已经工作了好几个星期了。。。。
'''Program to automatically find and download items from a bibliography or references list here are some journal papers in bibliographic format. Just copy the text to clipboard and run the script.

Ghaffour, N., T. M. Missimer and G. L. Amy (2013). "Technical review and evaluation of the economics of water desalination: Current and future challenges for better water supply sustainability." Desalination 309(0): 197-207. 

Gutiérrez Ortiz, F. J., P. G. Aguilera and P. Ollero (2014). "Biogas desulfurization by adsorption on thermally treated sewage-sludge." Separation and Purification Technology 123(0): 200-213. 

This program uses the 'scihub' website to obtain the full-text paper where
available, if no entry is found the paper is ignored and the failed downloads are listed at the end'''

    import scholarly
    import win32clipboard
    import urllib
    import urllib2
    import webbrowser
    import re


    '''Select and then copy the bibliography entries you want to download the
    papers for, python reads the clipboard'''
    win32clipboard.OpenClipboard()
    c = win32clipboard.GetClipboardData()
    win32clipboard.EmptyClipboard()

    '''Cleans up the text. removes end lines and double spaces etc.'''
    c = c.replace('\n', ' ')
    c = c.replace('\r', ' ')
    while c.find('  ') != -1:
        c = c.replace('  ', ' ')
    win32clipboard.SetClipboardText(c)
    win32clipboard.CloseClipboard()
    print "Working..."

    '''bit of regex to extract the title of the paper,
    IMPORTANT: bibliography has to be in
    author date format or you will need to revise this,
    at the moment it looks for date in brackets, then copies all the text until it
    reaches a full-stop, assuming that this is the paper title. If it is not, it
    will either fail or will be using inappropriate search terms.'''

    paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c)
    print "Analysing titles"
    print "The following titles found:"
    print "*************************"
    list_of_titles= list()
    for i in paper_info:
        print '%s...' % (i[3][:50])
        Paper_title=str(i[3])
        list_of_titles.append(Paper_title)
    paper_number=0
    failed=list()
    for title in list_of_titles:
        try:
            search_query = scholarly.search_pubs_query(title)

            info= (next(search_query))
            paper_number+=1
            print "Querying Google Scholar"
            print "**********************"
            print "Looking up paper title:"
            print title
            print "**********************"

            url=info.bib['url']
            print "Journal URL found "
            print url
            #url=next(search_query)
            print "Sending URL: ", url

            site='http://sci-hub.cc/'

            r = urllib2.Request(url=site)
            r.add_header('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')
            r.add_data(urllib.urlencode({'request': url}))
            res= urllib2.urlopen(r)



            with open("results.html", "w") as f:
                f.write(res.read())


            webbrowser.open_new("results.html")
            if not paper_number<= len(list_of_titles):
                print "Next title"
            else:
                continue

        except Exception as e:
            print repr(e)
            paper_number+=1
            print "**********************"
            print "No valid journal found for:"
            print title
            print "**********************"
            print "Continuing..."
            failed.append(title)
        continue

    if len(failed)==0:
        print 'Complete'

    else:
        print '*************************************'
        print 'The following titles did not download: '
        print '*************************************'
        print failed
        print "Please check that these are valid entries"