Python 维基百科上的靓汤_Python_Beautifulsoup_Wiki_Wikipedia_Information Retrieval

Python 维基百科上的靓汤

python

Python 维基百科上的靓汤,python,beautifulsoup,wiki,wikipedia,information-retrieval,Python,Beautifulsoup,Wiki,Wikipedia,Information Retrieval,我很难让这个脚本从维基百科的一系列文章中获取信息我试图做的是迭代一系列wiki URL，并拉出wiki门户类别（例如）上的页面链接我知道我正在浏览的所有wiki页面都有一个页面链接部分但是，当我尝试遍历它们时，会收到以下错误消息： Traceback (most recent call last): File "./wiki_parent.py", line 37, in <module> cleaned = pages.get_text() AttributeErr

我很难让这个脚本从维基百科的一系列文章中获取信息

我试图做的是迭代一系列wiki URL，并拉出wiki门户类别（例如）上的页面链接

我知道我正在浏览的所有wiki页面都有一个页面链接部分
但是，当我尝试遍历它们时，会收到以下错误消息：

Traceback (most recent call last):
  File "./wiki_parent.py", line 37, in <module>
    cleaned = pages.get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

它存储在端口ID目录中，如下所示：

{1:'类别：亚伯拉罕神话'，2:'类别：抽象'，3:'类别：学术学科'，4:'类别：激进主义'，5:'类别：活动家'，6:'类别：演员'，7:'类别：健美操'，8:'类别：航空航天工程'，9:'类别：美学'，10:'类别：不可知论'，11:'类别：农业'…}

所需输出为：

parent_num, page_ID, page_num

我意识到代码有点粗糙，但我只是想让它正常工作：

#!/usr/bin/env python
import os,re,nltk
from bs4 import BeautifulSoup
from urllib import urlopen
url = "https://en.wikipedia.org/wiki/"+'Category:Furniture'

rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki'

reg = re.compile('[\w]+:[\w]+')
number=1
port_ID = {}
for root,dirs,files in os.walk(rootdir):
    for file in files:
        if reg.match(file):
            port_ID[number]=file
            number+=1


test_file = open('test_file.csv', 'w')

for key, value in port_ID.iteritems():

    url = "https://en.wikipedia.org/wiki/"+str(value)
    raw = urlopen(url).read()
    soup=BeautifulSoup(raw)
    pages = soup.find("div" , { "id" : "mw-pages" })
    cleaned = pages.get_text()
    cleaned = cleaned.encode('utf-8')
    pages = cleaned.split('\n')
    pages = pages[4:-2]
    test = test = port_ID.items()[0]

    page_ID = 1
    for item in pages:
        test_file.write('%s %s %s\n' % (test[0],item,page_ID))
        page_ID+=1
    page_ID = 1

您正在循环中刮取几页。但是有些页面可能没有任何

标记。因此，您将在第行获得

属性错误

cleaned = pages.get_text()

您可以使用

if

条件检查，如：

if pages:
    # do stuff

或者，您可以使用

try except

块来避免它

try:
    cleaned = pages.get_text()
    # do stuff
except AttributeError as e:
    # do something

那么，在代码中，页面是非绑定的。你可能想再次检查一下你是如何使用soup的。find（）对不起，我是一个非常新手的程序员，但是在这种情况下，页面绑定为“无”是什么意思？有没有一种简单的方法来解决这个问题？@jdv12您能给出一个更好的示例来说明您正在刮取的内容以及所需的输出应该是什么样子吗？

try:
    cleaned = pages.get_text()
    # do stuff
except AttributeError as e:
    # do something