Python 3.x 美化组不获取数据_Python 3.x_Web Scraping_Beautifulsoup

Python 3.x 美化组不获取数据

python-3.x web-scraping

Python 3.x 美化组不获取数据,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,我正试图从数据库中获取数据。但是，在变量soup中，无法获取诸如姓名、业务性质、电话、电子邮件等字段的任何信息。我应该在下面的代码中添加什么来获得这些数据 import requests import pandas as pd from bs4 import BeautifulSoup page = "http://www.pmas.sg/page/members-directory" pages = requests.get(page) soup = BeautifulSoup(pages.

我正试图从数据库中获取数据。但是，在变量soup中，无法获取诸如姓名、业务性质、电话、电子邮件等字段的任何信息。我应该在下面的代码中添加什么来获得这些数据

import requests 
import pandas as pd
from bs4 import BeautifulSoup
page = "http://www.pmas.sg/page/members-directory"
pages = requests.get(page)
soup = BeautifulSoup(pages.content, 'html.parser')
print(soup)

使用上述代码得到的输出是：-

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

<html>
<head>
<title>WebKnight Application Firewall Alert</title>
<meta content="NOINDEX" name="ROBOTS"/>
</head>
<body bgcolor="#ffffff" link="#FF3300" text="#000000" vlink="#FF3300">
<table cellpadding="3" cellspacing="5" width="410">
<tr>
<td align="left">
<font face="Verdana,Arial,Helvetica" size="2">
<font size="3"><b>WebKnight Application Firewall Alert</b></font><br/><br/><br/>
Your request triggered an alert! If you feel that you have received this page in error, please contact the administrator of this web site.
<br/>
<hr/>
<br/><b>What is WebKnight?</b><br/>
AQTRONIX WebKnight is an application firewall for web servers and is released under the GNU General Public License. It is an ISAPI filter for securing web servers by blocking certain requests. If an alert is triggered WebKnight will take over and protect the web server.<br/><br/>
<hr/>
<br/>For more information on WebKnight: <a href="http://www.aqtronix.com/webknight/">http://www.aqtronix.com/WebKnight/</a><br/><br/>
<b><font color="#FF3300">AQTRONIX</font> WebKnight</b></font>
</td>
</tr>
</table>
</body>
</html>


WebKnight应用程序防火墙警报
WebKnight应用程序防火墙警报



你的请求触发了警报！如果您觉得收到此页面有误，请与此网站的管理员联系。



什么是网络骑士？

AQTRONIX WebKnight是用于web服务器的应用程序防火墙，根据GNU通用公共许可证发布。它是一个ISAPI过滤器，用于通过阻止某些请求来保护web服务器。如果触发警报，WebKnight将接管并保护web服务器。




有关WebKnight的更多信息：


阿克特罗尼克斯网络骑士
WebKnight是。服务器管理员设置应用于传入请求的规则，并确定是否阻止。在这种情况下，规则包括关于允许（和必需）的用户代理
头的期望。在附近玩耍时，我注意到：
“用户代理”：“Mozilla/5.0（Windows NT 10.0；WOW64）”或5.0变体触发警报
“Mozilla/4.0（Windows NT 10.0；WOW64）”、“AppleWebKit/537.36（KHTML，像Gecko）”、“Chrome/79.0.3945.79”、“Safari/537.36”都很好，所以列表可能需要在服务器上更新
注意，索引被
指示为不需要，但我找不到任何t&C，并且没有管理刮削的robots.txt
文件
例如
这是关于标题的。因为实际上您得到的响应代码是999
，所以您只需要用户代理
。请检查下面我的答案谢谢你的回复，但是脚本现在以文本形式出现，我如何从文本中提取姓名、电子邮件和电话等，很抱歉我对python@renu我相信您希望它采用csv格式。现在你拿到了。非常感谢你！！这真的帮了我很大的忙，很高兴能帮@renuThanks给你回复，但是脚本现在是以文本形式出现的，我如何从文本中提取姓名、电子邮件和电话等，很抱歉我对pythonI的了解很少。我会让Ahmed来处理这个问题，如果我看到一个不同的注释方式来显示你的抓取，我会编辑答案。
import requests
from bs4 import BeautifulSoup
import csv
import regex

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0"
}
r = requests.get('http://www.pmas.sg/page/members-directory', headers=headers)

soup = BeautifulSoup(r.text, 'html.parser')

data = []
for item in soup.findAll('div', {'class': 'col-md-4'}):
    l = []
    for p in item.findAll('p'):
        matches = regex.findall(
            r"^(?:.*?:[[:blank:]]+\K)?.*", p.text, regex.MULTILINE)
        b = next(iter(matches))
        l.append(b)
    if l:
        print(l)
        data.append(l)


with open('data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Name', 'Nature of Business',
                     'Address', 'Contact', 'Phone#', 'Fax', 'Website', 'Email'])
    writer.writerows(data)
    print("Done")

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',
}

r = requests.get('http://www.pmas.sg/page/members-directory', headers=headers)
print(r.text)