使用Beauty soup和python从网页中获取所有可见文本_Python_Beautifulsoup

使用Beauty soup和python从网页中获取所有可见文本

python

使用Beauty soup和python从网页中获取所有可见文本,python,beautifulsoup,Python,Beautifulsoup,编辑** 它工作得非常好我在stackoverflow（）上的其他地方使用了这个解决方案，用beautiful soup从网页中获取文本： import requests from bs4 import BeautifulSoup # error handling from requests.packages.urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnin

编辑**

它工作得非常好

我在stackoverflow（）上的其他地方使用了这个解决方案，用beautiful soup从网页中获取文本：

import requests
from bs4 import BeautifulSoup

# error handling

from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

# settings

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

url = "http://imfuna.com"

response = requests.get(url, headers=headers, verify=False)

soup = BeautifulSoup(response.text, "lxml")

for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

front_text_count = len(text.split(" "))
print front_text_count
print text

对于大多数网站来说，它工作得很好，但是对于上面的url示例（imfuna.com），它只检索到6个单词，尽管网页上有更多的单词（例如“住宅或商业房地产测量师的数字检查”）

如果上面的示例单词没有包含在该代码的文本输出中，那么实际的代码位于p/h1标记中，我不明白为什么代码没有提取它

其他人能不能建议一种简单地阅读网页上的纯文本的方法，并正确地将其全部收集起来

谢谢

此网站似乎动态加载其内容。如果要查看浏览器会看到什么，请运行

mechanize

。否则，是的，这些是页面上唯一存在的单词。我做的一切都像你的脚本一样，除了使用html.parser而不是lxml，我在页面上的所有文本都很好。A）mechanize对我来说没有什么不同，仍然只有6个单词，但是B）…randomdude999你能发布你的代码吗？@randomdude999你能发布你的html.parser代码吗？如果有效，我会将其标记为正确：）@the_t_test_1与您的相同，但将“lxml”替换为“html.parser”。尽管您也应该尝试一下，以确认这是lxml的一个问题。