如何删除不必要的<；部门>；在提取文本时使用python 我正在做动态的从网页中提取文本的工作，但是我从下面的div或页面的下部（比如ASP.NET的jjQuery Ajax）、JSP Servlet、Log4J、iBiTs、Hibernate、JDBC、Struts、HTML5、SQL、mysql、C++、UNIX中得到了支持。版权所有©2014，由TutorialPoint提供。保留所有权利_Python

python

如何删除不必要的<；部门>；在提取文本时使用python 我正在做动态的从网页中提取文本的工作，但是我从下面的div或页面的下部（比如ASP.NET的jjQuery Ajax）、JSP Servlet、Log4J、iBiTs、Hibernate、JDBC、Struts、HTML5、SQL、mysql、C++、UNIX中得到了支持。版权所有©2014，由TutorialPoint提供。保留所有权利,python,Python,我正在提取“”所需的相关文本，问题是它还提取了位于底部的文本，或者我上面所表达的其他方式。我试着使用数组，这样我就可以跳过所有这些东西。我成功地摆脱了其他链接，但在这里失败了。我的代码： import urllib from bs4 import BeautifulSoup url = "http://www.tutorialspoint.com/cplusplus/index.htm" html = urllib.urlopen(url).read() soup = BeautifulSou

我正在提取“”所需的相关文本，问题是它还提取了位于底部的文本，或者我上面所表达的其他方式。我试着使用数组，这样我就可以跳过所有这些东西。我成功地摆脱了其他链接，但在这里失败了。我的代码：

import urllib
from bs4 import BeautifulSoup

url = "http://www.tutorialspoint.com/cplusplus/index.htm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style","a","<div id="bottom" ">): # "<div id="bottom" is it correct?? or whats the correct way?
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print text

导入urllib
从bs4导入BeautifulSoup
url=”http://www.tutorialspoint.com/cplusplus/index.htm"
html=urllib.urlopen（url.read（））
soup=BeautifulSoup（html）
#杀死所有脚本和样式元素
对于汤中的脚本（[“脚本”、“样式”、“a”和“我不确定我是否正确理解了您的问题，但如果问题是这一行：
for script in soup(["script", "style","a","<div id="bottom" ">):

对于汤中的脚本（[“脚本”、“样式”、“a”和“我认为，应该可以帮助您：
text = soup.select('.content')[0].get_text()

谢谢。如果你访问了我放置的链接你会很容易理解的网站，我只想获取相关文本。我只想提取阅读材料。谢谢，这是有效的，但它也适用于其他版本的页面吗？bcz当放置“cplusplus.com/doc/tutorial/program_structure/”链接时，它给了我错误：“回溯”（最后一次调用）：文件“C:\Users\DELL\Desktop\python\s\fyp\data extraction.py”，第13行，text=soup。选择（'.content'）[0]。获取文本（）索引器：列表索引超出范围”-giantmalik刚刚编辑我的解决方案应针对每个具体页面的html进行优化。cplululus.com上的页面选择器应不同：。选择（'.C_doc'））。如果你是一个有限的网站集，这没关系。如果你想要一些通用的解决方案，可以在任何标记中找到相关内容，那么……这是一项相当复杂的任务。或许本文将帮助你：
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):

text = soup.select('.content')[0].get_text()