Python 对谷歌搜索的阿拉伯语结果进行编码
我编写此函数是为了从谷歌搜索中获得前10名结果:Python 对谷歌搜索的阿拉伯语结果进行编码,python,encoding,Python,Encoding,我编写此函数是为了从谷歌搜索中获得前10名结果: def google_search(self,query): """ This function returns the urls of top 10 of google search result for a keyword """ params = {'q':query} url = 'https://www.google.com/search?'+urllib.urlencode(param
def google_search(self,query):
"""
This function returns the urls of top 10 of google search result for a keyword
"""
params = {'q':query}
url = 'https://www.google.com/search?'+urllib.urlencode(params)
result = urlfetch.fetch(url=url)
content = result.content
soup = BeautifulSoup(content)
list = soup.findAll("li", {'class':'g'})
urls = []
for item in list:
link = item.findAll('a')[0]
url = 'https://www.google.com'+link['href']
urls.append(url.encode('utf-8'))
return urls
然后我编写了另一个函数,它基于谷歌搜索查找相关的wikepedia文章
def wikipedia_search(self,query,language='en'):
"""
This function returns a list of urls and title of top wikepedia search result for a keyword
"""
q = query+u' site:%s.wikipedia.org' %language
urls = self.google_search(q.encode('utf-8'))
list =[]
for url in urls:
title = re.findall(r'/wiki/(.*)&s',url.encode('utf-8'))[0].replace("_"," ")
link = re.findall(r'q=(.*)&s',url)[0]
url_tag = {'url':link ,'title' :title}
list.append(url_tag)
return list
但当我尝试用阿拉伯语搜索时,得到的结果如下:
{'title':'%25D8%25A8%25D9%25A9%2583%25D9%25D9%2585%25D8%25A9','url':''},{'title':'%25D8%25A8%25D9%25A8%25AA%25D9%25D8%25D9%2586%25D8%25AF%25D8%25B3%25D9%25D8%25B1','url':'}
这基本上是我无法探索的。数据是用URL引用转义的UTF-8编码字节,因此您要解码: url=urllib.unquote(url).decode('utf8') 演示:
(帖子直接引用自,因为我还不能发表评论)数据是通过URL引用转义的UTF-8编码字节,因此您要解码: url=urllib.unquote(url).decode('utf8') 演示:
(因为我还不能发表评论,所以直接引用文章)这对url非常有效,但对url的简单操作标题的问题仍然是一样的。这对url非常有效,但对url的简单操作标题的问题仍然是一样的。
>>> import urllib
>>> url='example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> urllib.unquote(url).decode('utf8')
u'example.com?title=\u043f\u0440\u0430\u0432\u043e\u0432\u0430\u044f+\u0437\u0430\u0449\u0438\u0442\u0430'
>>> print urllib.unquote(url).decode('utf8')
example.com?title=правовая+защита