Python 对谷歌搜索的阿拉伯语结果进行编码_Python_Encoding

Python 对谷歌搜索的阿拉伯语结果进行编码

python encoding

Python 对谷歌搜索的阿拉伯语结果进行编码,python,encoding,Python,Encoding,我编写此函数是为了从谷歌搜索中获得前10名结果： def google_search(self,query): """ This function returns the urls of top 10 of google search result for a keyword """ params = {'q':query} url = 'https://www.google.com/search?'+urllib.urlencode(param

我编写此函数是为了从谷歌搜索中获得前10名结果：

def google_search(self,query):
    """
        This function returns the urls of top 10  of google search result for a keyword
    """
    params = {'q':query}
    url = 'https://www.google.com/search?'+urllib.urlencode(params)
    result = urlfetch.fetch(url=url)
    content = result.content
    soup = BeautifulSoup(content)
    list = soup.findAll("li", {'class':'g'})
    urls = []
    for item in list:
        link = item.findAll('a')[0]
        url = 'https://www.google.com'+link['href']
        urls.append(url.encode('utf-8'))
    return urls

然后我编写了另一个函数，它基于谷歌搜索查找相关的wikepedia文章

def wikipedia_search(self,query,language='en'):
    """
        This function returns a list of urls and title of top wikepedia search result for a keyword
    """
    q = query+u' site:%s.wikipedia.org' %language
    urls = self.google_search(q.encode('utf-8'))
    list =[]
    for url in urls:
        title = re.findall(r'/wiki/(.*)&s',url.encode('utf-8'))[0].replace("_"," ")
        link = re.findall(r'q=(.*)&s',url)[0]
        url_tag = {'url':link ,'title' :title}
        list.append(url_tag)
    return list

但当我尝试用阿拉伯语搜索时，得到的结果如下： {'title'：'%25D8%25A8%25D9%25A9%2583%25D9%25D9%2585%25D8%25A9'，'url'：''}，{'title'：'%25D8%25A8%25D9%25A8%25AA%25D9%25D8%25D9%2586%25D8%25AF%25D8%25B3%25D9%25D8%25B1'，'url'：'}

这基本上是我无法探索的。

数据是用URL引用转义的UTF-8编码字节，因此您要解码：

url=urllib.unquote（url）.decode（'utf8'）

演示：

（帖子直接引用自，因为我还不能发表评论）

数据是通过URL引用转义的UTF-8编码字节，因此您要解码：

url=urllib.unquote（url）.decode（'utf8'）

演示：

（因为我还不能发表评论，所以直接引用文章）

这对url非常有效，但对url的简单操作标题的问题仍然是一样的。这对url非常有效，但对url的简单操作标题的问题仍然是一样的。

>>> import urllib 
>>> url='example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> urllib.unquote(url).decode('utf8') 
u'example.com?title=\u043f\u0440\u0430\u0432\u043e\u0432\u0430\u044f+\u0437\u0430\u0449\u0438\u0442\u0430'
>>> print urllib.unquote(url).decode('utf8')
example.com?title=правовая+защита