Web scraping 如何刮ColdFusion保护的网站?

Web scraping 如何刮ColdFusion保护的网站?,web-scraping,coldfusion,google-chrome-devtools,wget,Web Scraping,Coldfusion,Google Chrome Devtools,Wget,从以下网页提取PDF url非常简单 但是当我设置它时,它将在输出中显示类似的内容,而不是下载PDF文件 <p>OSA has implemented a process that requires you to enter the letters and/or numbers below before you can download this article.</p> OSA实施了一个流程,要求您在下载本文之前输入以下字母和/或数字 由于网站使用cookiecf

从以下网页提取PDF url非常简单

但是当我设置它时,它将在输出中显示类似的内容,而不是下载PDF文件

<p>OSA has implemented a process that requires you to enter the letters and/or numbers below before you can download this article.</p>
OSA实施了一个流程,要求您在下载本文之前输入以下字母和/或数字

由于网站使用cookie
cfid
,因此应使用ColdFusion进行保护。有人知道怎么刮这样一个网页吗?谢谢


编辑:Sev Roberts提供的wget解决方案不起作用。我检查了ChromeDevTools(在一个新的incognito窗口中),许多请求都是在发送的第一个请求之后发送的。我猜这是因为wget不会发送这些请求,所以的后续wget(带有cookie)将无法工作。有人能告诉我们哪些提取请求是必要的吗?谢谢。

网站有几种方法可以用来对付这种刮取和直接链接或嵌入。旧的基本方法包括:

  • 检查用户的Cookie:至少检查用户是否已经在本网站上的上一页进行了会话;有些站点可能会更进一步,查找特定cookie或会话变量的存在,以验证通过站点的真实路径
  • 检查
    cgi.http\u referer
    变量以查看用户是否来自预期来源
  • 检查
    cgi.http\u user\u代理
    是否类似于已知的人类浏览器-或检查用户代理是否类似于已知的机器人浏览器
  • 当然还有其他更智能的方法,但根据我的经验,如果你需要的不仅仅是上述方法,那么你就达到了需要验证码和/或要求用户注册和登录的程度

    显然,(2)和(3)很容易通过手动设置头来欺骗。对于(1)而言,如果您使用的是另一种语言的
    cfhttp
    或其等效版本,则需要使用cfhttpparam确保在站点响应的
    Set Cookie
    头中返回的Cookie在后续请求的头中返回。可以使用各种cfhttp包装器和替代库(如绕过cfhttp层的Java包装器)来实现这一点。但如果你想了解一个简单的例子,说明这是如何工作的,那么Ben Nadel有一个古老但很好的例子:

    通过你问题中链接的pdf url,几分钟后在Chrome中进行修补表明,如果我丢失了上一页的cookie并保留了http\u referer,那么我将看到验证码挑战,但是如果我保留cookie并丢失了http\u referer,那么我将直接进入pdf。这证实了他们关心的是饼干,而不是推荐人

    关于SO完整性的Ben示例副本:

    <cffunction
        name="GetResponseCookies"
        access="public"
        returntype="struct"
        output="false"
        hint="This parses the response of a CFHttp call and puts the cookies into a struct.">
     
        <!--- Define arguments. --->
        <cfargument
            name="Response"
            type="struct"
            required="true"
            hint="The response of a CFHttp call."
            />
        <!---
            Create the default struct in which we will hold
            the response cookies. This struct will contain structs
            and will be keyed on the name of the cookie to be set.
        --->
        <cfset LOCAL.Cookies = StructNew() />
        <!---
            Get a reference to the cookies that werew returned
            from the page request. This will give us an numericly
            indexed struct of cookie strings (which we will have
            to parse out for values). BUT, check to make sure
            that cookies were even sent in the response. If they
            were not, then there is not work to be done.
        --->
        <cfif NOT StructKeyExists(
            ARGUMENTS.Response.ResponseHeader,
            "Set-Cookie"
            )>
            <!---
                No cookies were send back in the response. Just
                return the empty cookies structure.
            --->
            <cfreturn LOCAL.Cookies />
        </cfif>
        <!---
            ASSERT: We know that cookie were returned in the page
            response and that they are available at the key,
            "Set-Cookie" of the reponse header.
        --->
        <!---
            Now that we know that the cookies were returned, get
            a reference to the struct as described above.
        --->
        <!--- 
            The cookies might be coming back as a struct or they
            might be coming back as a string. If there is only 
            ONE cookie being retunred, then it comes back as a 
            string. If that is the case, then re-store it as a 
            struct. 
        ---><!---<cfdump var="#arguments#" label="Line 305 - arguments for function GetResponseCookies" output="D:\web\safenet_GetResponseCookies.html" FORMAT="HTML">--->
        <cfif IsSimpleValue(ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ])>
            <cfset LOCAL.ReturnedCookies = {} />
            <cfset LOCAL.ReturnedCookies[1] = ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ] />
        <cfelse>
            <cfset LOCAL.ReturnedCookies = ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ] />
        </cfif>
        <!--- Loop over the returned cookies struct. --->
        <cfloop
            item="LOCAL.CookieIndex"
            collection="#LOCAL.ReturnedCookies#">
            <!---
                As we loop through the cookie struct, get
                the cookie string we want to parse.
            --->
            <cfset LOCAL.CookieString = LOCAL.ReturnedCookies[ LOCAL.CookieIndex ] />
            <!---
                For each of these cookie strings, we are going to
                need to parse out the values. We can treate the
                cookie string as a semi-colon delimited list.
            --->
            <cfloop
                index="LOCAL.Index"
                from="1"
                to="#ListLen( LOCAL.CookieString, ';' )#"
                step="1">
                <!--- Get the name-value pair. --->
                <cfset LOCAL.Pair = ListGetAt(
                    LOCAL.CookieString,
                    LOCAL.Index,
                    ";"
                    ) />
                <!---
                    Get the name as the first part of the pair
                    sepparated by the equals sign.
                --->
                <cfset LOCAL.Name = ListFirst( LOCAL.Pair, "=" ) />
                <!---
                    Check to see if we have a value part. Not all
                    cookies are going to send values of length,
                    which can throw off ColdFusion.
                --->
                <cfif (ListLen( LOCAL.Pair, "=" ) GT 1)>
                    <!--- Grab the rest of the list. --->
                    <cfset LOCAL.Value = ListRest( LOCAL.Pair, "=" ) />
                <cfelse>
                    <!---
                        Since ColdFusion did not find more than one
                        value in the list, just get the empty string
                        as the value.
                    --->
                    <cfset LOCAL.Value = "" />
                </cfif>
                <!---
                    Now that we have the name-value data values,
                    we have to store them in the struct. If we are
                    looking at the first part of the cookie string,
                    this is going to be the name of the cookie and
                    it's struct index.
                --->
                <cfif (LOCAL.Index EQ 1)>
                    <!---
                        Create a new struct with this cookie's name
                        as the key in the return cookie struct.
                    --->
                    <cfset LOCAL.Cookies[ LOCAL.Name ] = StructNew() />
                    <!---
                        Now that we have the struct in place, lets
                        get a reference to it so that we can refer
                        to it in subseqent loops.
                    --->
                    <cfset LOCAL.Cookie = LOCAL.Cookies[ LOCAL.Name ] />
                    <!--- Store the value of this cookie. --->
                    <cfset LOCAL.Cookie.Value = LOCAL.Value />
                    <!---
                        Now, this cookie might have more than just
                        the first name-value pair. Let's create an
                        additional attributes struct to hold those
                        values.
                    --->
                    <cfset LOCAL.Cookie.Attributes = StructNew() />
                <cfelse>
                    <!---
                        For all subseqent calls, just store the
                        name-value pair into the established
                        cookie's attributes strcut.
                    --->
                    <cfset LOCAL.Cookie.Attributes[ LOCAL.Name ] = LOCAL.Value />
                </cfif>
            </cfloop>
        </cfloop>
        <!--- Return the cookies. --->
        <cfreturn LOCAL.Cookies />
    </cffunction>
    
    编辑:如果使用
    wget
    而不是
    cfhttp
    ,您可以尝试回答此问题的方法,但不需要发布用户名和密码,因为您实际上不需要登录表单

    乙二醇


    …虽然正如其他人所指出的,您可能违反了源代码的服务条款,因此我不建议您这样做。

    我认为它不受ColdFusion的保护。请提供您在WGET中使用的命令行。这并不简单,但如果我足够努力,我可能可以使用ColdFusion代码来完成。Edit-似乎他们设置了一个流程来检测和防止这样的自动下载。听起来不像是CF特有的。如果这没有违反他们的TOS,我猜这需要发出一个请求来获取cookie值。提取cookie值,然后使用该cookie提交第二个请求。@SOS问题是发送了许多请求。如何找出其中哪些是必要的?好的,反复试验。所有这些都是抓屏的乐趣和冒险。没有两个网站是相同的。。。。是的,但是如果他们这么做的话,TOS可能不允许屏幕抓取:-)我不使用cf代码进行抓取,我也不理解cf代码。我只想发送原始HTTP请求以进行刮片。您能否提供原始wget命令(以及用于计算其他参数的任何其他命令),以便我可以尝试它是否有效?谢谢。啊-当我读到这个问题时,我已经在心里自动更正了
    wget
    。除了Docker的简单下载脚本之外,我不使用
    wget
    。如果有必要的话,我可以将CF翻译成Java、C#或PHP,但我不知道如何通过编程从一个wget响应中提取标题,以提供给另一个wget请求。如果您在shell脚本中执行此操作,那么最好在
    man
    页面中查找
    wget
    以查看它是否具有用于持久化cookie的内置选项,如果没有,请选择另一个具有此功能的命令行浏览器…我编辑了我的答案,添加了一个
    wget
    示例,未经测试,但它应该适用于您的案例,因为他们使用的是302重定向,
    wget
    可以遵循。它不适用于javascript重定向;对于将来的ref,如果是这种情况,那么您必须使用重定向的最终目的地作为最后一个wget URL,并且希望他们没有使用动态生成的有时间限制的唯一URL:-)@SevRoberts我认为该网站在防止使用cookie的简单wget命令方面做得更多。我认为关键是要弄清楚发送的HTTP请求的最小集合是什么。您是否能够检查coldfusion代码发送的HTTP请求,并且只保留最少的HTTP请求集(以及它们之间作为HTTP请求的依赖关系,后者可能需要从先前的HTTP请求获得一些信息),但仍然能够下载PDF文件?只有通过这种方式,才能确定所需的wget命令(一旦知道HTTP请求,确定wget命令就很简单了)?
    <cfloop item="strCookie" collection="#cookieStruct#">
        <cfhttpparam type="COOKIE" name="#strCookie#" value="#cookieStruct[strCookie].Value#" />
    </cfloop>
    
    # Get a session.
    wget --save-cookies cookies.txt \
         --keep-session-cookies \
         --delete-after \
         https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745
    
    # Now grab the page or pages we care about.
    # You may also need to add valid http_referer or http_user_agent headers
    wget --load-cookies cookies.txt \
         https://www.osapublishing.org/boe/viewmedia.cfm?uri=boe-11-5-2745&seq=0