Python Web抓取仅获取主要内容_Python_Text

Python Web抓取仅获取主要内容

python text

Python Web抓取仅获取主要内容,python,text,Python,Text,问题是def keyInfo中的最后一行，它打印了很多东西、标签、标题，我只想要主要内容——文本，如何实现这一点？这段代码可以更好地提取特定站点的内容 import numpy as np import json import re from urllib.request import urlopen from bs4 import BeautifulSoup url = "http://www.npr.org/sections/thetwo-way/2017/03/06/518805720

问题是def keyInfo中的最后一行，它打印了很多东西、标签、标题，我只想要主要内容——文本，如何实现这一点？

这段代码可以更好地提取特定站点的内容

import numpy as np
import json 
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://www.npr.org/sections/thetwo-way/2017/03/06/518805720/turkey-germany-relations-at-new-low-after-erdogan-makes-nazi-comparison"

html = urlopen(url)
bsObj = BeautifulSoup(html, 'lxml')


def keyInfo(div):
  print(div.find("h1").get_text())
  print(div.find("span", {"class":"date"}).get_text())
  print(div.find("a", {"rel":"author"}).get_text().strip())
  print(div.findAll("p")) # Problem here

keyInfo(bsObj)

方法在使用Chrome开发工具查看了内容的结构之后，我注意到故事内容在

article>div[id=storytext]

中，但是

div[id=storytext]

还包括一些带有非文章内容的旁白和div。删除那些保留了本条各段的内容

想找一些更普通的吗？

如果你在寻找一个更通用的东西，你可能想考虑一下像Boilerpipe这样的东西。这是一个用于Boilerpipe的Python包装器：

请重新讨论如何询问堆栈上溢出问题，以便您的问题受到社区的欢迎。此外，请确保您熟悉如何将。请记住，这里为您试图解决的问题中的编程问题提供了明确的帮助。实际上，这太宽泛了，因为您没有提供足够的信息让读者能够有效地帮助您。修订后，这足够清楚吗？效果很好，我只修改了最后一行：

code

print（divText.get_text（）.replace（'\n'，“”）

def keyInfo(div):
  print(div.find("h1").get_text())
  article = div.find("article")
  divText = article.find("div", id="storytext")
  [a.extract() for a in divText.findAll("aside")]
  [d.extract() for d in divText.findAll("div")]
  print(divText.get_text())