Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/288.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用python';s urllib2和Beautifulsoup_Python_Html_Html Parsing_Beautifulsoup_Wikipedia - Fatal编程技术网

使用python';s urllib2和Beautifulsoup

使用python';s urllib2和Beautifulsoup,python,html,html-parsing,beautifulsoup,wikipedia,Python,Html,Html Parsing,Beautifulsoup,Wikipedia,我正在尝试对维基百科进行爬网,以获取一些用于文本挖掘的数据。我正在使用python的urllib2和Beautifulsoup。我的问题是:有没有一种简单的方法可以从我阅读的文本中去掉不必要的标签(比如链接“a”或“span”) 对于此场景: import urllib2 from BeautifulSoup import * opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] i

我正在尝试对维基百科进行爬网,以获取一些用于文本挖掘的数据。我正在使用python的urllib2和Beautifulsoup。我的问题是:有没有一种简单的方法可以从我阅读的文本中去掉不必要的标签(比如链接“a”或“span”)

对于此场景:

import urllib2
from BeautifulSoup import *
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open("http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes")pool = BeautifulSoup(infile.read())
res=pool.findAll('div',attrs={'class' : 'mw-content-ltr'}) # to get to content directly
paragrapgs=res[0].findAll("p") #get all paragraphs
我得到的段落有很多参考标签,如:

paragrapgs[0]=

<p><b>Data mining</b> (the analysis step of the <b>knowledge discovery in databases</b> process,<sup id="cite_ref-Fayyad_0-0" class="reference"><a href="#cite_note-Fayyad-0"><span>[</span>1<span>]</span></a></sup> or KDD), a relatively young and interdisciplinary field of <a href="/wiki/Computer_science" title="Computer science">computer science</a><sup id="cite_ref-acm_1-0" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup><sup id="cite_ref-brittanica_2-0" class="reference"><a href="#cite_note-brittanica-2"><span>[</span>3<span>]</span></a></sup> is the process of discovering new patterns from large <a href="/wiki/Data_set" title="Data set">data sets</a> involving methods at the intersection of <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a>, <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a>, <a href="/wiki/Statistics" title="Statistics">statistics</a> and <a href="/wiki/Database_system" title="Database system">database systems</a>.<sup id="cite_ref-acm_1-1" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup> The goal of data mining is to extract knowledge from a data set in a human-understandable structure<sup id="cite_ref-acm_1-2" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup> and involves database and <a href="/wiki/Data_management" title="Data management">data management</a>, <a href="/wiki/Data_Pre-processing" title="Data Pre-processing">data preprocessing</a>, <a href="/wiki/Statistical_model" title="Statistical model">model</a> and <a href="/wiki/Statistical_inference" title="Statistical inference">inference</a> considerations, interestingness metrics, <a href="/wiki/Computational_complexity_theory" title="Computational complexity theory">complexity</a> considerations, post-processing of found structure, <a href="/wiki/Data_visualization" title="Data visualization">visualization</a> and <a href="/wiki/Online_algorithm" title="Online algorithm">online updating</a>.<sup id="cite_ref-acm_1-3" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup></p>
数据挖掘(数据库中知识发现过程的分析步骤,简称KDD)是一个相对年轻的跨学科领域,它是从大量涉及方法的交叉点发现新模式的过程。数据挖掘的目标是以人类可以理解的结构从数据集中提取知识,包括数据库和注意事项、兴趣度度量、注意事项、发现的结构的后处理,以及


你知道如何删除它们并使用纯文本吗?

这是你可以使用(以及可爱的)的方法:

导入请求
将lxml.html作为lh导入
来自Beautfulsoup进口UnicodeAmmit
URL=”http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes"
HEADERS={'User-agent':'Mozilla/5.0'}
def lhget(*args,**kwargs):
r=请求.get(*args,**kwargs)
html=unicodeammit(r.content).unicode
tree=lh.fromstring(html)
回归树
def移除(el):
el.getparent().remove(el)
tree=lhget(URL,headers=headers)
el=tree.xpath(“//div[@class='mw-content-ltr']/p”)[0]
对于el.xpath(“//sup[@class='reference']”)中的ref:
移除(参考)
打印左侧tostring(el,pretty\u print=True)
打印el.text_内容()

此外,您可以使用
api.php
而不是
index.php

#!/usr/bin/env python
import sys
import time
import urllib, urllib2
import xml.etree.cElementTree as etree

# prepare request
maxattempts = 5 # how many times to try the request before giving up
maxlag = 5 # seconds http://www.mediawiki.org/wiki/Manual:Maxlag_parameter
params = dict(action="query", format="xml", maxlag=maxlag,
              prop="revisions", rvprop="content", rvsection=0,
              titles="data_mining")
request = urllib2.Request(
    "http://en.wikipedia.org/w/api.php?" + urllib.urlencode(params), 
    headers={"User-Agent": "WikiDownloader/1.2",
             "Referer": "http://stackoverflow.com/q/8044814"})
# make request
for _ in range(maxattempts):
    response = urllib2.urlopen(request)
    if response.headers.get('MediaWiki-API-Error') == 'maxlag':
        t = response.headers.get('Retry-After', 5)
        print "retrying in %s seconds" % (t,)
        time.sleep(float(t))
    else:
        break # ready to read
else: # exhausted all attempts
    sys.exit(1)

# download & parse xml 
tree = etree.parse(response)

# find rev data 
rev_data = tree.findtext('.//rev')
if not rev_data:
    print 'MediaWiki-API-Error:', response.headers.get('MediaWiki-API-Error')
    tree.write(sys.stdout)
    print
    sys.exit(1)

print(rev_data)
输出
这些似乎在漂亮的汤标记节点上起作用。parentNode将被修改,因此相关标记将被删除。找到的标记也会作为列表返回给调用者

@staticmethod
def seperateCommentTags(parentNode):
    commentTags = []
    for descendant in parentNode.descendants:
        if isinstance(descendant, element.Comment):
            commentTags.append(descendant)
    for commentTag in commentTags:
        commentTag.extract()
    return commentTags

@staticmethod
def seperateScriptTags(parentNode):
    scripttags = parentNode.find_all('script')
    scripts = []
    for scripttag in scripttags:
        script = scripttag.extract()
        if script is not None:
            scripts.append(script)
    return scripts

感谢您的回答,您知道如何使用xpath的remove函数删除引用后的所有选项卡吗?基本上是在使用el=tree.xpath(“//div[@class='mw-content-ltr']”)获取整个内容之后如何删除标记后的其余标记?已更新以删除引用。
请求
美化组
在此完全不需要
lxml.html.parse()
接受URL。
requests
用于根据OP的代码片段设置用户代理字符串
BeautifulSoup
用于检测文档的编码,因为它未在文档的元数据中指定,因此
lxml
不知道该做什么。
{{Distinguish|analytics|information extraction|data analysis}}

'''Data mining''' (the analysis step of the '''knowledge discovery in databases..
@staticmethod
def seperateCommentTags(parentNode):
    commentTags = []
    for descendant in parentNode.descendants:
        if isinstance(descendant, element.Comment):
            commentTags.append(descendant)
    for commentTag in commentTags:
        commentTag.extract()
    return commentTags

@staticmethod
def seperateScriptTags(parentNode):
    scripttags = parentNode.find_all('script')
    scripts = []
    for scripttag in scripttags:
        script = scripttag.extract()
        if script is not None:
            scripts.append(script)
    return scripts