Python beautifulsoup从站点获取所有URL_Python_Beautifulsoup

Python beautifulsoup从站点获取所有URL

python

Python beautifulsoup从站点获取所有URL,python,beautifulsoup,Python,Beautifulsoup,在下面的文章中，我试图获取站点的所有超链接，但是我得到的输出是，我在这里做错了什么 <html> <body onload="document.acsForm.submit();"> <form name="acsForm" action="https://www.searspartsdirect.com/partsdirect/j_acegi_cas_security_check?ssonofail=true" method="post">

在下面的文章中，我试图获取站点的所有超链接，但是我得到的输出是，我在这里做错了什么

 <html>
<body onload="document.acsForm.submit();">

    <form name="acsForm" action="https://www.searspartsdirect.com/partsdirect/j_acegi_cas_security_check?ssonofail=true" method="post">
        <div style="display: none">

            <textarea rows=10 cols=80 name="logonPassword"></textarea>

            <textarea rows=10 cols=80 name="loginId"></textarea>

            <textarea rows=10 cols=80 name="screenName"></textarea>

            <textarea rows=10 cols=80 name="errorCode"></textarea>

        </div>
      </form>
</body>
 </html>

您的问题与BeautifulSoup无关-索引页面上的源代码使用Javascript重定向到另一个URL（因此，简单地下载HTML会给您带来一个枯燥的页面）

发生重定向是因为：

<body onload="document.acsForm.submit();">

…提交以下表格：

<form name="acsForm"
action="https://www.searspartsdirect.com/partsdirect/j_acegi_cas_security_check?ssonofail=true" 
method="post">

如果您只是尝试抓取浏览器重定向到的页面，您会得到一个空白页面，因此我猜您需要对“action”URL执行POST请求，并可能存储它设置的cookie

在POST数据中，可能需要包含以下字段的值：

<textarea rows=10 cols=80 name="logonPassword"></textarea>
<textarea rows=10 cols=80 name="loginId"></textarea>
<textarea rows=10 cols=80 name="screenName"></textarea>
<textarea rows=10 cols=80 name="errorCode"></textarea>

…类似于

{'logonPassword'：''，'loginId'：''，…}

，在请求中序列化并作为POST数据传递

然后使用cookie，您可以向

http://www.searspartsdirect.com/partsdirect/index.action

或类似代码，您的BeautifulSoup代码应按预期工作

可能会使这一切变得简单一点-主页上的示例基本上是您想要的

您的问题与BeautifulSoup无关-索引页上的源代码使用Javascript重定向到另一个URL（因此，只需下载HTML就会给您带来一个枯燥的页面）

发生重定向是因为：

<body onload="document.acsForm.submit();">

…提交以下表格：

<form name="acsForm"
action="https://www.searspartsdirect.com/partsdirect/j_acegi_cas_security_check?ssonofail=true" 
method="post">

如果您只是尝试抓取浏览器重定向到的页面，您会得到一个空白页面，因此我猜您需要对“action”URL执行POST请求，并可能存储它设置的cookie

在POST数据中，可能需要包含以下字段的值：

<textarea rows=10 cols=80 name="logonPassword"></textarea>
<textarea rows=10 cols=80 name="loginId"></textarea>
<textarea rows=10 cols=80 name="screenName"></textarea>
<textarea rows=10 cols=80 name="errorCode"></textarea>

…类似于

{'logonPassword'：''，'loginId'：''，…}

，在请求中序列化并作为POST数据传递

然后使用cookie，您可以向

http://www.searspartsdirect.com/partsdirect/index.action

或类似代码，您的BeautifulSoup代码应按预期工作

可能会使这一切变得更简单-主页上的示例基本上就是您想要的

您希望输出是什么？（您正在使用

print contents

打印整个内容，而内容似乎不包含任何

标记…因此

findAll（'a'））

为空…为什么不尝试在站点上使用正则表达式并返回所有URL类型字符串？@zenopy：如果您注意到我也在打印变量“a”以查找超链接，您希望输出是什么？（您正在使用

print contents

打印整个内容，而内容似乎不包含任何

标记…因此

findAll（'a'））

为空…为什么不尝试在站点上使用正则表达式，只返回所有URL类型字符串？@zenopy：如果你注意到我也在打印变量“a”以查找超链接，请给我一个你提到的站点脚本示例，我也会收到消息mechanize。\u response.httperror\u seek\u包装器：HTTP错误403：请求被机器人拒绝s、当我执行br=mechanize.Browser（）br.open（“@Rajeev

br.set\u handle\u robots（False）

时，现在我得到以下错误响应1=br.follow\u link（）文件“/usr/local/lib/python2.6/dist packages/mechanize-0.2.5-py2.6.egg/mechanize/\u mechanize.py”，第569行，follow\u link返回self.open（self.click\u link，**kwds））文件”/usr/local/lib/python2.6/dist-packages/mechanize-0.2.5-py2.6.egg/mechanize/_-mechanize.py”，第553行，在find\u-link=self.find\u-link（**kwds）文件/usr/local/lib/python2.6/dist-packages/mechanize-0.2.5-py2.6.egg/mechanize/_-mechanize.py”，第620行，在find\u-link-raise-notfounderror（）机械化。LinkNotFoundError@Rajeev不确定-可能值得单独问一个这样的问题..？你能给我一个你提到的网站脚本的例子吗？这样我就可以得到消息mechanize了。_response.httperror\u seek\u wrapper:HTTP Error 403:request disallowed by robots.txt when i do br=mechanize.Browser（）br.open（“）@Rajeev

br.set_handle_robots（False）

现在我得到以下错误响应1=br.follow_link（）文件/usr/local/lib/python2.6/dist packages/mechanize-0.2.5-py2.6.egg/mechanize/_mechanize.py”，第569行，在follow_link return self.open（self.click_link（link，**kwds））文件中“/usr/local/lib/python2.6/dist packages/mechanize-0.2.5-py2.6.egg/mechanize/_-mechanize.py”，第553行，在find_link link=self.find_link（**kwds）文件/usr/local/lib/python2.6/dist packages/mechanize-0.2.5-py2.6.egg/mechanize/_-mechanize.py”中，第620行，find_-link-nofounderror（）机械化。LinkNotFoundError@Rajeev不确定-也许值得一个单独的问题，所以。。？