Python数据管理器wikipedia保护策略_Python

Python数据管理器wikipedia保护策略

python

Python数据管理器wikipedia保护策略,python,Python,当运行此命令时，我得到以下错误。我相信这是一个维基保护功能。我该怎么过来呢。我基本上是在尝试抓取wiki页面并搜索代码中的链接。我为我糟糕的代码道歉，或者如果我犯了任何糟糕的错误？我是python新手，很多python都是经过剪切、复制和粘贴的 > > Traceback (most recent call last): File > > "C:\Users\MICHAEL\Desktop\Project X\dataprod.py", line 51

当运行此命令时，我得到以下错误。我相信这是一个维基保护功能。我该怎么过来呢。我基本上是在尝试抓取wiki页面并搜索代码中的链接。我为我糟糕的代码道歉，或者如果我犯了任何糟糕的错误？我是python新手，很多python都是经过剪切、复制和粘贴的

>     > Traceback (most recent call last):   File
>     > "C:\Users\MICHAEL\Desktop\Project X\dataprod.py", line 51, in <module>
>     >     page = urlopen(pg)   File "C:\Program Files (x86)\Python36-32\lib\urllib\request.py", line 223, in urlopen
>     >     return opener.open(url, data, timeout)   File "C:\Program Files (x86)\Python36-32\lib\urllib\request.py", line 511, in open
>     >     req = Request(fullurl, data)   File "C:\Program Files (x86)\Python36-32\lib\urllib\request.py", line 329, in __init__
>     >     self.full_url = url   File "C:\Program Files (x86)\Python36-32\lib\urllib\request.py", line 355, in full_url
>     >     self._parse()   File "C:\Program Files (x86)\Python36-32\lib\urllib\request.py", line 384, in _parse
>     >     raise ValueError("unknown url type: %r" % self.full_url) ValueError: unknown url type: '/wiki/Wikipedia:Protection_policy#semi'

。代码如下：

##DataFile. Access info -> Store Info
import shelve

#Saving data in raw txt format
f = open("data.txt", 'w')
print("...")

from urllib.request import urlopen

###############
#Data Scraping#
###############

#Importing relevant librarys
from urllib.request import urlopen
from bs4 import BeautifulSoup, SoupStrainer
import httplib2
import warnings
import requests
import contextlib

#Specifying URL(s)

quote_page = 'https://en.wikipedia.org/wiki/Dog'

#
requests.packages.urllib3.disable_warnings()
response = requests.get(quote_page, verify=False)
response.status_code
#
http = httplib2.Http()
status, response = http.request(quote_page)

quotes = []
for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        quotes.append(link['href'])
#        print(link['href'])

#for loop
info = []
for pg in quotes:

#querying the page and pulling html format
    page = urlopen(pg)

#store and convert using BeautifulSoup into 'soup'
    soup = BeautifulSoup(page, 'html.parser')

#Take out the <div> attribrute
    name_box = soup.find('html')

#Take data using by taking 'text'
    name = name_box.text.strip()

#data info Extra
    info.append((name))

#Displaying data grabbed
    print("PULLED DATA                                         .")

#Saving data as CSV
import csv
from datetime import datetime

# open a csv file with append, so old data will not be erased
with open("index.csv", 'a', encoding='utf-8') as csv_file:
    writer = csv.writer(csv_file)

#for loop
    for name in info:
        writer.writerow([name])
f.write(name)
print(f, name)


Exit=input("Press '1' to save and close: ")

if Exit == 1:
    f.close()
    exit()

您需要在请求中添加一个用户代理，将您的脚本标识为bot。请换

response = requests.get(quote_page, verify=False, headers= {'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})

试着这样做：

##DataFile. Access info -> Store Info
import shelve

#Saving data in raw txt format
f = open("data.txt", 'w')
print("...")


###############
#Data Scraping#
###############

#Importing relevant librarys
from bs4 import BeautifulSoup
import warnings
import requests
import contextlib

#Specifying URL(s)

quote_page = 'https://en.wikipedia.org/wiki/Dog'

#
requests.packages.urllib3.disable_warnings()
response = requests.get(quote_page , verify=False, headers= {'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})
status = response.status_code
#


quotes = []
linkL = BeautifulSoup(response.content, 'html.parser')
for link in linkL.find_all("a"):
    if link.has_attr('href'):
        quotes.append(link['href'])
#        print(link['href'])

#for loop
info = []
for pg in quotes:

#querying the page and pulling html format
    page = requests.get(pg, verify=False, headers= {'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}))

#store and convert using BeautifulSoup into 'soup'
    soup = BeautifulSoup(page.content, 'html.parser')

#Take out the <div> attribrute
    name_box = soup.find('html')

#Take data using by taking 'text'
    name = name_box.text.strip()

#data info Extra
    info.append((name))

#Displaying data grabbed
    print("PULLED DATA                                         .")

#Saving data as CSV
import csv
from datetime import datetime

# open a csv file with append, so old data will not be erased
with open("index.csv", 'a', encoding='utf-8') as csv_file:
    writer = csv.writer(csv_file)

#for loop
    for name in info:
        writer.writerow([name])
f.write(name)
print(f, name)


Exit=input("Press '1' to save and close: ")

if Exit == 1:
    f.close()
    exit()

Python附带了IDLE，它允许轻松的交互作业。您是否控制了http.requestquote\u页面的返回值？它是否下载了正确的页面或警告消息？对此仍然是新的，我不知道从何处开始我收到此错误“ValueError:太多的值无法解压缩预期的2”，在此行状态下，response=http.requestquote\u page？或者在哪里？我现在收到此错误“TypeError:get missing 1 required Position argument:“url”您需要包括出现此问题的行我的道歉：回溯最近的调用上次：文件C:\Users\MICHAEL\Desktop\Project Y\Test.py，第25行，in response=requests。getquote\u pageverify=False，headers={'User-Agent'：'Mozilla/5.0兼容；Googlebot/2.1；+'}TypeError:get缺少1个必需的位置参数：“url”