Python 3.x 通过Python抓取Wikipedia图像的问题_Python 3.x_Regex_Web Scraping_Argparse_Python Requests Html

Python 3.x 通过Python抓取Wikipedia图像的问题

python-3.x regex web-scraping

Python 3.x 通过Python抓取Wikipedia图像的问题,python-3.x,regex,web-scraping,argparse,python-requests-html,Python 3.x,Regex,Web Scraping,Argparse,Python Requests Html,我用Python编写了一个程序，用于在Wikipedia中删除查询的第一个图像链接类似于以下图像：我的Python程序需要以下库：请求 bs4 html 再当我运行代码时，我给出一个参数，它返回一个定义的错误（'Image-Not-Found'）。请帮我解决这个问题。我的Python程序源代码： import requests import bs4 import re import html # Create the parser my_parser = argparse.Ar

我用Python编写了一个程序，用于在Wikipedia中删除查询的第一个图像链接
类似于以下图像：

我的Python程序需要以下库：

请求
bs4
html
再

当我运行代码时，我给出一个参数，它返回一个定义的错误（'Image-Not-Found'）。请帮我解决这个问题。我的Python程序源代码：

import requests
import bs4
import re
import html

# Create the parser
my_parser = argparse.ArgumentParser(description='Wikipedia Image Grabber')

# Add the arguments
my_parser.add_argument('Phrase',
                       metavar='Phrase',
                       type=str,
                       help='Phrase to Search')

# Execute the parse_args() method
args = my_parser.parse_args()
Phrase = args._get_kwargs()[0][1]
if '.' in Phrase or '-' in Phrase:
    if '.' in Phrase and '-' in Phrase:
        Phrase = str(Phrase).replace('-',' ')
    elif '-' in Phrase and not '.' in Phrase:
        Phrase = str(Phrase).replace('-',' ')

    Phrase = html.escape(Phrase)
request = requests.get('https://fa.wikipedia.org/wiki/Special:Search?search=%s&go=Go&ns0=1' % Phrase).text
parser = bs4.BeautifulSoup(request, 'html.parser')
none_search_finder = parser.find_all('p', attrs = {'class':'mw-search-nonefound'})
if len(none_search_finder)==1:
    print('No-Result')
    exit()
else:
    search_results = parser.find_all('div' , attrs = {'class':'mw-search-result-heading'})
    if len(search_results)==0:
        search_result = parser.find_all('h1', attrs = {'id':'firstHeading'})
        if len(search_result)!=0:
            
            link = 'https://fa.wikipedia.org/wiki/'+str(Phrase)

        else:
            print('Result-Error')
            exit()
    else:

        selected_result = search_results[0]
        regex_exp = r".*<a href=\"(.*)\" title="
        regex_get_uri = re.findall(regex_exp, str(selected_result))
        regex_result = str(regex_get_uri[0])
        link = 'https://fa.wikipedia.org'+regex_result
    
    #---------------
    second_request = requests.get(link)
    second_request_source = second_request.text
    second_request_parser = bs4.BeautifulSoup(second_request_source, 'html.parser')
    image_finder = second_request_parser.find_all('a', attrs = {'class':'image'})
    if len(image_finder) == 0:
        print('No-Image')
        exit()
    else:
        image_finder_e = image_finder[0]
        second_regex = r".*src=\"(.*)\".*decoding=\"async\""
        regex_finder = re.findall(second_regex, str(image_finder_e))
        if len(regex_finder)!=0:
            regexed_uri = str(regex_finder[0])
            img_link = regexed_uri.replace('//','https://')
            print(img_link)
        else:
            print("Image-Not-Found")

导入请求
进口bs4
进口稀土
导入html
#创建解析器
my_parser=argparse.ArgumentParser（description='Wikipedia Image Grabber'）
#添加参数
my_解析器。添加_参数（'短语'，
metavar='Phrase'，
类型=str，
help='要搜索的短语'）
#执行parse_args（）方法
args=my_parser.parse_args（）
短语=args.\u get\u kwargs（）[0][1]
如果短语中的“.”或短语中的“-”：
如果短语中的“.”和短语中的“-”：
短语=str（短语）。替换（'-'，''）
elif“-”在短语中，而不是“.”在短语中：
短语=str（短语）。替换（'-'，''）
短语=html.escape（短语）
请求=请求。获取（'https://fa.wikipedia.org/wiki/Special:Search?search=%s&go=Go&ns0=1“%Phrase”）.text
parser=bs4.BeautifulSoup（请求'html.parser'）
none\u search\u finder=parser.find\u all（'p'，attrs={'class'：'mw-search-nonefound'}）
如果len（无搜索查找器）=1：
打印（'无结果'）
退出（）
其他：
search_results=parser.find_all（'div'，attrs={'class'：'mw-search-result-heading'}）
如果len（搜索结果）=0：
search_result=parser.find_all（'h1'，attrs={'id'：'firstHeading'}）
如果len（搜索结果）=0:
链接https://fa.wikipedia.org/wiki/“+str（短语）
其他：
打印（'结果错误'）
退出（）
其他：
所选结果=搜索结果[0]
regex_exp=r“*您可以在没有regex的情况下执行此操作，而代码不起作用的原因是，在浏览器和响应上，decoding=“async”
位置不相同
这里有一个没有正则表达式的解决方案
import re
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Google'
soup = BeautifulSoup(requests.get(url).text,'html.parser')

imglinks = soup.find_all('a', attrs = {'class':'image'})[0]
for img in imglinks.find_all('img'):
    print(img['src'].replace('//','https://'))

输出：
https://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Google_2015_logo.svg/196px-Google_2015_logo.svg.png

您可以不使用正则表达式来执行此操作，代码不起作用的原因是，在浏览器和响应上，decoding=“async”
位置不相同
这里有一个没有正则表达式的解决方案
import re
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Google'
soup = BeautifulSoup(requests.get(url).text,'html.parser')

imglinks = soup.find_all('a', attrs = {'class':'image'})[0]
for img in imglinks.find_all('img'):
    print(img['src'].replace('//','https://'))

输出：
https://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Google_2015_logo.svg/196px-Google_2015_logo.svg.png