Python 2.7 美化群体，忽略<；a></a>；标记并获取所有内部文本<；p></p>；_Python 2.7_Web Scraping_Beautifulsoup

Python 2.7 美化群体，忽略<；a></a>；标记并获取所有内部文本<；p></p>；

python-2.7 web-scraping

Python 2.7 美化群体，忽略<；a></a>；标记并获取所有内部文本<；p></p>；,python-2.7,web-scraping,beautifulsoup,Python 2.7,Web Scraping,Beautifulsoup,我想获取属于新闻1的每个标记中的所有文本 import requests from bs4 import BeautifulSoup r1 = requests.get("http://www.metalinjection.net/shocking-revelations/machine-heads-robb-flynn-addresses-controversial-photo-from-his-past-in-the-wake-of-charlottesville") data1 = r1

我想获取属于

新闻1

的每个

标记中的所有文本

import requests
from bs4 import BeautifulSoup
r1  = requests.get("http://www.metalinjection.net/shocking-revelations/machine-heads-robb-flynn-addresses-controversial-photo-from-his-past-in-the-wake-of-charlottesville")
data1 = r1.text
soup1 = BeautifulSoup(data1, "lxml")
news1 = soup1.find_all("div", {"class": "article-detail"})

for x in news1:
    print x.find("p").text

这将获取第一个

文本，并且只有..当调用find\u all时，它会给出以下错误

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

所以我列了一个清单，但还是有同样的错误

text1 = []
for x in news1:
    text1.append(x.find_all("p").text)

print text1

运行代码时我遇到的错误是：

AttributeError:“ResultSet”对象没有属性“text”

，这是合理的，因为bs4

ResultSet

基本上是一个

标记

元素的列表。如果循环遍历每个“p”标记，就可以得到每个“p”标记的文本

text1 = []
for x in news1:
    for i in x.find_all("p"):
        text1.append(i.text)

或作为一行，使用列表理解：

text1 = [i.text for x in news1 for i in x.find_all("p")]

好。。。我有另一个问题。我一直得到\n2019、\n203……就像文本中的部分一样。我使用replace（“\n2019”，“”）来修复它们。有时这不起作用……有什么解决方案吗？你是说

“\u2019”

，等等？这些是unicode字符。您可以将它们替换为：

i.text.encode（'ascii'，'ignore'）