美丽的汤不刮所有可见的网站数据（Python 3）_Python_Python 3.x_Html Parsing_Beautifulsoup

美丽的汤不刮所有可见的网站数据（Python 3）

python python-3.x

美丽的汤不刮所有可见的网站数据（Python 3）,python,python-3.x,html-parsing,beautifulsoup,Python,Python 3.x,Html Parsing,Beautifulsoup,我的问题是，我正试图从一堆不同的网站上找到所有可见的文本下载到一个.txt文件中——不幸的是，我没有从这些网站上获得所有可能的文本。我在下面发布了我的代码的工作示例： import requests from bs4 import BeautifulSoup from collections import Counter urls = ['https://www304.americanexpress.com/credit-card/compare'] with open('thisisan

我的问题是，我正试图从一堆不同的网站上找到所有可见的文本下载到一个.txt文件中——不幸的是，我没有从这些网站上获得所有可能的文本。我在下面发布了我的代码的工作示例：

import requests
from bs4 import BeautifulSoup
from collections import Counter


urls = ['https://www304.americanexpress.com/credit-card/compare']

with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
    for url in urls:
        website = requests.get(url)
        soup = BeautifulSoup(website.content)
        text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
        for item in text:
            print(item, file=outfile)

如果您测试此代码，您得到的只是以下数据--

如何准确获取此页面上的其余可见数据？根据我的研究，我很确定这与我的soup.findAll（'p'）参数有关，但我不知道应该添加什么来获取其余的数据

不要搜索段落，而是从

正文

中获取

.text

：

print(soup.body.text, file=outfile)

如果要避免将

script

标记内容写入结果，可以在顶层找到所有标记（请参见

recursive=False

）并加入文本：

print(''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)]))

嗨，Alecx，我想到了这一点，但这给了我页面上的所有数据，其中很多都是无用的（例如，if（NAV==null | | | | typeof（NAV）=“未定义”）{var NAV=new Object（）}NAV.RWD={body:document.getElementsByTagName）--这两个方法之间有折衷吗？@user3682157好的，对，但你不能轻松可靠地看到一个元素是否“可见”或者不使用Beautifulsoup。您至少可以跳过“脚本”标记。或者，您可以切换到selenium，它将真正知道什么是可见的，什么是不可见的。@user3682157我已经更新了答案，包括跳过

脚本标记内容。
print(''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)]))