Web scraping 如何刮ColdFusion保护的网站?
从以下网页提取PDF url非常简单 但是当我设置它时,它将在输出中显示类似的内容,而不是下载PDF文件Web scraping 如何刮ColdFusion保护的网站?,web-scraping,coldfusion,google-chrome-devtools,wget,Web Scraping,Coldfusion,Google Chrome Devtools,Wget,从以下网页提取PDF url非常简单 但是当我设置它时,它将在输出中显示类似的内容,而不是下载PDF文件 <p>OSA has implemented a process that requires you to enter the letters and/or numbers below before you can download this article.</p> OSA实施了一个流程,要求您在下载本文之前输入以下字母和/或数字 由于网站使用cookiecf
<p>OSA has implemented a process that requires you to enter the letters and/or numbers below before you can download this article.</p>
OSA实施了一个流程,要求您在下载本文之前输入以下字母和/或数字
由于网站使用cookiecfid
,因此应使用ColdFusion进行保护。有人知道怎么刮这样一个网页吗?谢谢
编辑:Sev Roberts提供的wget解决方案不起作用。我检查了ChromeDevTools(在一个新的incognito窗口中),许多请求都是在发送的第一个请求之后发送的。我猜这是因为wget不会发送这些请求,所以的后续wget(带有cookie)将无法工作。有人能告诉我们哪些提取请求是必要的吗?谢谢。网站有几种方法可以用来对付这种刮取和直接链接或嵌入。旧的基本方法包括:
cgi.http\u referer
变量以查看用户是否来自预期来源cgi.http\u user\u代理
是否类似于已知的人类浏览器-或检查用户代理是否类似于已知的机器人浏览器cfhttp
或其等效版本,则需要使用cfhttpparam确保在站点响应的Set Cookie
头中返回的Cookie在后续请求的头中返回。可以使用各种cfhttp包装器和替代库(如绕过cfhttp层的Java包装器)来实现这一点。但如果你想了解一个简单的例子,说明这是如何工作的,那么Ben Nadel有一个古老但很好的例子:
通过你问题中链接的pdf url,几分钟后在Chrome中进行修补表明,如果我丢失了上一页的cookie并保留了http\u referer,那么我将看到验证码挑战,但是如果我保留cookie并丢失了http\u referer,那么我将直接进入pdf。这证实了他们关心的是饼干,而不是推荐人
关于SO完整性的Ben示例副本:
<cffunction
name="GetResponseCookies"
access="public"
returntype="struct"
output="false"
hint="This parses the response of a CFHttp call and puts the cookies into a struct.">
<!--- Define arguments. --->
<cfargument
name="Response"
type="struct"
required="true"
hint="The response of a CFHttp call."
/>
<!---
Create the default struct in which we will hold
the response cookies. This struct will contain structs
and will be keyed on the name of the cookie to be set.
--->
<cfset LOCAL.Cookies = StructNew() />
<!---
Get a reference to the cookies that werew returned
from the page request. This will give us an numericly
indexed struct of cookie strings (which we will have
to parse out for values). BUT, check to make sure
that cookies were even sent in the response. If they
were not, then there is not work to be done.
--->
<cfif NOT StructKeyExists(
ARGUMENTS.Response.ResponseHeader,
"Set-Cookie"
)>
<!---
No cookies were send back in the response. Just
return the empty cookies structure.
--->
<cfreturn LOCAL.Cookies />
</cfif>
<!---
ASSERT: We know that cookie were returned in the page
response and that they are available at the key,
"Set-Cookie" of the reponse header.
--->
<!---
Now that we know that the cookies were returned, get
a reference to the struct as described above.
--->
<!---
The cookies might be coming back as a struct or they
might be coming back as a string. If there is only
ONE cookie being retunred, then it comes back as a
string. If that is the case, then re-store it as a
struct.
---><!---<cfdump var="#arguments#" label="Line 305 - arguments for function GetResponseCookies" output="D:\web\safenet_GetResponseCookies.html" FORMAT="HTML">--->
<cfif IsSimpleValue(ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ])>
<cfset LOCAL.ReturnedCookies = {} />
<cfset LOCAL.ReturnedCookies[1] = ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ] />
<cfelse>
<cfset LOCAL.ReturnedCookies = ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ] />
</cfif>
<!--- Loop over the returned cookies struct. --->
<cfloop
item="LOCAL.CookieIndex"
collection="#LOCAL.ReturnedCookies#">
<!---
As we loop through the cookie struct, get
the cookie string we want to parse.
--->
<cfset LOCAL.CookieString = LOCAL.ReturnedCookies[ LOCAL.CookieIndex ] />
<!---
For each of these cookie strings, we are going to
need to parse out the values. We can treate the
cookie string as a semi-colon delimited list.
--->
<cfloop
index="LOCAL.Index"
from="1"
to="#ListLen( LOCAL.CookieString, ';' )#"
step="1">
<!--- Get the name-value pair. --->
<cfset LOCAL.Pair = ListGetAt(
LOCAL.CookieString,
LOCAL.Index,
";"
) />
<!---
Get the name as the first part of the pair
sepparated by the equals sign.
--->
<cfset LOCAL.Name = ListFirst( LOCAL.Pair, "=" ) />
<!---
Check to see if we have a value part. Not all
cookies are going to send values of length,
which can throw off ColdFusion.
--->
<cfif (ListLen( LOCAL.Pair, "=" ) GT 1)>
<!--- Grab the rest of the list. --->
<cfset LOCAL.Value = ListRest( LOCAL.Pair, "=" ) />
<cfelse>
<!---
Since ColdFusion did not find more than one
value in the list, just get the empty string
as the value.
--->
<cfset LOCAL.Value = "" />
</cfif>
<!---
Now that we have the name-value data values,
we have to store them in the struct. If we are
looking at the first part of the cookie string,
this is going to be the name of the cookie and
it's struct index.
--->
<cfif (LOCAL.Index EQ 1)>
<!---
Create a new struct with this cookie's name
as the key in the return cookie struct.
--->
<cfset LOCAL.Cookies[ LOCAL.Name ] = StructNew() />
<!---
Now that we have the struct in place, lets
get a reference to it so that we can refer
to it in subseqent loops.
--->
<cfset LOCAL.Cookie = LOCAL.Cookies[ LOCAL.Name ] />
<!--- Store the value of this cookie. --->
<cfset LOCAL.Cookie.Value = LOCAL.Value />
<!---
Now, this cookie might have more than just
the first name-value pair. Let's create an
additional attributes struct to hold those
values.
--->
<cfset LOCAL.Cookie.Attributes = StructNew() />
<cfelse>
<!---
For all subseqent calls, just store the
name-value pair into the established
cookie's attributes strcut.
--->
<cfset LOCAL.Cookie.Attributes[ LOCAL.Name ] = LOCAL.Value />
</cfif>
</cfloop>
</cfloop>
<!--- Return the cookies. --->
<cfreturn LOCAL.Cookies />
</cffunction>
编辑:如果使用wget
而不是cfhttp
,您可以尝试回答此问题的方法,但不需要发布用户名和密码,因为您实际上不需要登录表单
乙二醇
…虽然正如其他人所指出的,您可能违反了源代码的服务条款,因此我不建议您这样做。我认为它不受ColdFusion的保护。请提供您在WGET中使用的命令行。这并不简单,但如果我足够努力,我可能可以使用ColdFusion代码来完成。Edit-似乎他们设置了一个流程来检测和防止这样的自动下载。听起来不像是CF特有的。如果这没有违反他们的TOS,我猜这需要发出一个请求来获取cookie值。提取cookie值,然后使用该cookie提交第二个请求。@SOS问题是发送了许多请求。如何找出其中哪些是必要的?好的,反复试验。所有这些都是抓屏的乐趣和冒险。没有两个网站是相同的。。。。是的,但是如果他们这么做的话,TOS可能不允许屏幕抓取:-)我不使用cf代码进行抓取,我也不理解cf代码。我只想发送原始HTTP请求以进行刮片。您能否提供原始wget命令(以及用于计算其他参数的任何其他命令),以便我可以尝试它是否有效?谢谢。啊-当我读到这个问题时,我已经在心里自动更正了
wget
。除了Docker的简单下载脚本之外,我不使用wget
。如果有必要的话,我可以将CF翻译成Java、C#或PHP,但我不知道如何通过编程从一个wget响应中提取标题,以提供给另一个wget请求。如果您在shell脚本中执行此操作,那么最好在man
页面中查找wget
以查看它是否具有用于持久化cookie的内置选项,如果没有,请选择另一个具有此功能的命令行浏览器…我编辑了我的答案,添加了一个wget
示例,未经测试,但它应该适用于您的案例,因为他们使用的是302重定向,wget
可以遵循。它不适用于javascript重定向;对于将来的ref,如果是这种情况,那么您必须使用重定向的最终目的地作为最后一个wget URL,并且希望他们没有使用动态生成的有时间限制的唯一URL:-)@SevRoberts我认为该网站在防止使用cookie的简单wget命令方面做得更多。我认为关键是要弄清楚发送的HTTP请求的最小集合是什么。您是否能够检查coldfusion代码发送的HTTP请求,并且只保留最少的HTTP请求集(以及它们之间作为HTTP请求的依赖关系,后者可能需要从先前的HTTP请求获得一些信息),但仍然能够下载PDF文件?只有通过这种方式,才能确定所需的wget命令(一旦知道HTTP请求,确定wget命令就很简单了)?
<cfloop item="strCookie" collection="#cookieStruct#">
<cfhttpparam type="COOKIE" name="#strCookie#" value="#cookieStruct[strCookie].Value#" />
</cfloop>
# Get a session.
wget --save-cookies cookies.txt \
--keep-session-cookies \
--delete-after \
https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745
# Now grab the page or pages we care about.
# You may also need to add valid http_referer or http_user_agent headers
wget --load-cookies cookies.txt \
https://www.osapublishing.org/boe/viewmedia.cfm?uri=boe-11-5-2745&seq=0