Python 更换loop Numpy的dot产品
我正在尝试使用更快的东西(比如NumPy)来替换循环的点积 我对点积做了研究,了解了点积,能够以几种方式处理玩具数据,但不是100%,当涉及到用数据框实现实际使用时 我看着这些线和其他的线,没有运气和 希望做类似的事情,在np数组中处理玩具编号Python 更换loop Numpy的dot产品,python,python-3.x,numpy,nlp,self,Python,Python 3.x,Numpy,Nlp,Self,我正在尝试使用更快的东西(比如NumPy)来替换循环的点积 我对点积做了研究,了解了点积,能够以几种方式处理玩具数据,但不是100%,当涉及到用数据框实现实际使用时 我看着这些线和其他的线,没有运气和 希望做类似的事情,在np数组中处理玩具编号 u1 =np.array([1,2,3]) u2 =np.array([2,3,4]) v1.dot(v2) 20 这是当前正在工作的get-dot产品 我希望在没有for循环的情况下执行此操作 def get_dot_product(self, co
u1 =np.array([1,2,3])
u2 =np.array([2,3,4])
v1.dot(v2)
20
这是当前正在工作的get-dot产品
我希望在没有for循环的情况下执行此操作
def get_dot_product(self, courseid1, courseid2, unit_vectors):
u1 = unit_vectors[courseid1]
u2 = unit_vectors[courseid2]
dot_product = 0.0
for dimension in u1:
if dimension in u2:
dot_product += u1[dimension] * u2[dimension]
return dot_product
**代码**
#!/usr/bin/env python
# coding: utf-8
class SearchRecommendationSystem:
def __init__(self):
pass
def get_bag_of_words(self, titles_lines):
bag_of_words = {}
for index, row in titles_lines.iterrows():
courseid, course_bag_of_words = self.get_course_bag_of_words(row)
for word in course_bag_of_words:
word = str(word).strip() # added
if word not in bag_of_words:
bag_of_words[word] = course_bag_of_words[word]
else:
bag_of_words[word] += course_bag_of_words[word]
return bag_of_words
def get_course_bag_of_words(self, line):
course_bag_of_words = {}
courseid = line['courseid']
title = line['title'].lower()
description = line['description'].lower()
wordlist = title.split() + description.split()
if len(wordlist) >= 10:
for word in wordlist:
word = str(word).strip() # added
if word not in course_bag_of_words:
course_bag_of_words[word] = 1
else:
course_bag_of_words[word] += 1
return courseid, course_bag_of_words
def get_sorted_results(self, d):
kv_list = d.items()
vk_list = []
for kv in kv_list:
k, v = kv
vk = v, k
vk_list.append(vk)
vk_list.sort()
vk_list.reverse()
k_list = []
for vk in vk_list[:10]:
v, k = vk
k_list.append(k)
return k_list
def get_keywords(self, titles_lines, bag_of_words):
n = sum(bag_of_words.values())
keywords = {}
for index, row in titles_lines.iterrows():
courseid, course_bag_of_words = self.get_course_bag_of_words(row)
term_importance = {}
for word in course_bag_of_words:
word = str(word).strip() # extra
tf_course = (float(course_bag_of_words[word]) / sum(course_bag_of_words.values()))
tf_overall = float(bag_of_words[word]) / n
term_importance[word] = tf_course / tf_overall
keywords[str(courseid)] = self.get_sorted_results(term_importance)
return keywords
def get_inverted_index(self, keywords):
inverted_index = {}
for courseid in keywords:
for keyword in keywords[courseid]:
if keyword not in inverted_index:
keyword = str(keyword).strip() # added
inverted_index[keyword] = []
inverted_index[keyword].append(courseid)
return inverted_index
def get_search_results(self, query_terms, keywords, inverted_index):
search_results = {}
for term in query_terms:
term = str(term).strip()
if term in inverted_index:
for courseid in inverted_index[term]:
if courseid not in search_results:
search_results[courseid] = 0.0
search_results[courseid] += (
1 / float(keywords[courseid].index(term) + 1) *
1 / float(query_terms.index(term) + 1)
)
sorted_results = self.get_sorted_results(search_results)
return sorted_results
def get_titles(self, titles_lines):
titles = {}
for index, row in titles_lines.iterrows():
titles[row['courseid']] = row['title'][:60]
return titles
def get_unit_vectors(self, keywords, categories_lines):
norm = 1.884
cat = {}
subcat = {}
for line in categories_lines[1:]:
courseid_, category, subcategory = line.split('\t')
cat[courseid_] = category.strip()
subcat[courseid_] = subcategory.strip()
unit_vectors = {}
for courseid in keywords:
u = {}
if courseid in cat:
u[cat[courseid]] = 1 / norm
u[subcat[courseid]] = 1 / norm
for keyword in keywords[courseid]:
u[keyword] = (1 / float(keywords[courseid].index(keyword) + 1) / norm)
unit_vectors[courseid] = u
return unit_vectors
def get_dot_product(self, courseid1, courseid2, unit_vectors):
u1 = unit_vectors[courseid1]
u2 = unit_vectors[courseid2]
dot_product = 0.0
for dimension in u1:
if dimension in u2:
dot_product += u1[dimension] * u2[dimension]
return dot_product
def get_recommendation_results(self, seed_courseid, keywords, inverted_index, unit_vectors):
courseids = []
seed_courseid = str(seed_courseid).strip()
for keyword in keywords[seed_courseid]:
for courseid in inverted_index[keyword]:
if courseid not in courseids and courseid != seed_courseid:
courseids.append(courseid)
dot_products = {}
for courseid in courseids:
dot_products[courseid] = self.get_dot_product(seed_courseid, courseid, unit_vectors)
sorted_results = self.get_sorted_results(dot_products)
return sorted_results
def Final(self):
print("Reading Title file.......")
titles_lines = open('s2-titles.txt', encoding="utf8").readlines()
print("Reading Category file.......")
categories_lines = open('s2-categories.tsv', encoding = "utf8").readlines()
print("Getting Supported Functions Data")
bag_of_words = self.get_bag_of_words(titles_lines)
keywords = self.get_keywords(titles_lines, bag_of_words)
inverted_index = self.get_inverted_index(keywords)
titles = self.get_titles(titles_lines)
print("Getting Unit Vectors")
unit_vectors = self.get_unit_vectors(keywords=keywords, categories_lines=categories_lines)
#Search Part
print("\n ############# Started Search Query System ############# \n")
query = input('Input your search query: ')
while query != '':
query_terms = query.split()
search_sorted_results = self.get_search_results(query_terms, keywords, inverted_index)
print(f"==> search results for query: {query.split()}")
for search_result in search_sorted_results:
print(f"{search_result.strip()} - {str(titles[search_result]).strip()}")
#ask again for query or quit the while loop if no query is given
query = input('Input your search query [hit return to finish]: ')
print("\n ############# Started Recommendation Algorithm System ############# \n")
# Recommendation ALgorithm Part
seed_courseid = (input('Input your seed courseid: '))
while seed_courseid != '':
seed_courseid = str(seed_courseid).strip()
recom_sorted_results = self.get_recommendation_results(seed_courseid, keywords, inverted_index, unit_vectors)
print('==> recommendation results:')
for rec_result in recom_sorted_results:
print(f"{rec_result.strip()} - {str(titles[rec_result]).strip()}")
get_dot_product_ = self.get_dot_product(seed_courseid, str(rec_result).strip(), unit_vectors)
print(f"Dot Product Value: {get_dot_product_}")
seed_courseid = (input('Input seed courseid [hit return to finish]:'))
if __name__ == '__main__':
obj = SearchRecommendationSystem()
obj.Final()
s2类别。tsv
courseid category subcategory
21526 Design 3D & Animation
153082 Marketing Advertising
225436 Marketing Affiliate Marketing
19482 Office Productivity Apple
33883 Office Productivity Apple
59526 IT & Software Operating Systems
29219 Personal Development Career Development
35057 Personal Development Career Development
40751 Personal Development Career Development
65210 Personal Development Career Development
234414 Personal Development Career Development
s2-titles.txt外观示例
courseidXXXYYYZZZtitleXXXYYYZZZdescription
3586XXXYYYZZZLearning Tools for Mrs B's Science Classes This is a series of lessons that will introduce students to the learning tools that will be utilized throughout the schoXXXYYYZZZThis is a series of lessons that will introduce students to the learning tools that will be utilized throughout the school year The use of these tools serves multiple purposes 1 Allow the teacher to give immediate and meaningful feedback on work that is in progress 2 Allow students to have access to content and materials when outside the classroom 3 Provide a variety of methods for students to experience learning materials 4 Provide a variety of methods for students to demonstrate learning 5 Allow for more time sensitive correction grading and reflections on concepts that are assessed
改进方法
def get_dot_product(self, courseid1, courseid2, unit_vectors):
# u1 = unit_vectors[courseid1]
# u2 = unit_vectors[courseid2]
# dimensions = set(u1).intersection(set(u2))
# dot_product = sum(u1[dimension] * u2.get(dimension, 0) for dimension in dimensions)
u1 = unit_vectors[courseid1]
u2 = unit_vectors[courseid2]
dot_product = sum(u1[dimension] * u2.get(dimension, 0) for dimension in u2)
return dot_product
改进方法
def get_dot_product(self, courseid1, courseid2, unit_vectors):
# u1 = unit_vectors[courseid1]
# u2 = unit_vectors[courseid2]
# dimensions = set(u1).intersection(set(u2))
# dot_product = sum(u1[dimension] * u2.get(dimension, 0) for dimension in dimensions)
u1 = unit_vectors[courseid1]
u2 = unit_vectors[courseid2]
dot_product = sum(u1[dimension] * u2.get(dimension, 0) for dimension in u2)
return dot_product
显然,
unit\u vectors
是一个字典,可以从中提取2个值,u1
和u2
但这些是什么?显然,dicts也是如此(这个迭代对于列表没有意义):
但是什么是u1[dimension]
?名单?数组
通常情况下,dict
可以通过键进行访问,就像您在这里所做的那样。没有numpy风格的“矢量化”vals=list(u1.values())
获取所有值的列表,可以想象这些值可以组成一个数组(如果元素正确)
一个np.dot(arr1,arr2)
可能会起作用
如果你给出一些具体的小例子,用真实的工作数据(跳过复杂的生成代码),你会得到最好的答案。把重点放在问题的核心上,这样我们就可以用30秒的时间来把握问题
===
更深入地查看dot
函数;这复制了核心(我认为)。起初,我忽略了这样一个事实:您不是在迭代u2
键,而是在寻找匹配的键
def foo(dd):
x = 0
u1 = dd['u1']
u2 = dd['u2']
for k in u1:
if k in u2:
x += u1[k]*u2[k]
return x
然后制作一本词典:
In [30]: keys=list('abcde'); values=[1,2,3,4,5]
In [31]: adict = {k:v for k,v in zip(keys,values)}
In [32]: dd = {'u1':adict, 'u2':adict}
In [41]: dd
Out[41]:
{'u1': {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5},
'u2': {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}}
In [42]: foo(dd)
Out[42]: 55
在本例中,子字典匹配,因此我们使用一个简单的数组dot
获得相同的值:
In [43]: np.dot(values,values)
Out[43]: 55
但是,如果u2
不同,具有不同的键/值对,并且可能具有不同的键,则结果将不同。我看不到通过键进行迭代访问的方法。与字典访问相比,作业的产品部分总和较小
In [44]: dd['u2'] = {'e':3, 'f':4, 'a':3}
In [45]: foo(dd)
Out[45]: 18
我们可以构造更适合快速dot
计算的其他数据结构。但这是另一个主题。显然单位向量
是一个字典,从中可以提取2个值,u1
和u2
但这些是什么?显然,dicts也是如此(这个迭代对于列表没有意义):
但是什么是u1[dimension]
?名单?数组
通常情况下,dict
可以通过键进行访问,就像您在这里所做的那样。没有numpy风格的“矢量化”vals=list(u1.values())
获取所有值的列表,可以想象这些值可以组成一个数组(如果元素正确)
一个np.dot(arr1,arr2)
可能会起作用
如果你给出一些具体的小例子,用真实的工作数据(跳过复杂的生成代码),你会得到最好的答案。把重点放在问题的核心上,这样我们就可以用30秒的时间来把握问题
===
更深入地查看dot
函数;这复制了核心(我认为)。起初,我忽略了这样一个事实:您不是在迭代u2
键,而是在寻找匹配的键
def foo(dd):
x = 0
u1 = dd['u1']
u2 = dd['u2']
for k in u1:
if k in u2:
x += u1[k]*u2[k]
return x
然后制作一本词典:
In [30]: keys=list('abcde'); values=[1,2,3,4,5]
In [31]: adict = {k:v for k,v in zip(keys,values)}
In [32]: dd = {'u1':adict, 'u2':adict}
In [41]: dd
Out[41]:
{'u1': {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5},
'u2': {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}}
In [42]: foo(dd)
Out[42]: 55
在本例中,子字典匹配,因此我们使用一个简单的数组dot
获得相同的值:
In [43]: np.dot(values,values)
Out[43]: 55
但是,如果u2
不同,具有不同的键/值对,并且可能具有不同的键,则结果将不同。我看不到通过键进行迭代访问的方法。与字典访问相比,作业的产品部分总和较小
In [44]: dd['u2'] = {'e':3, 'f':4, 'a':3}
In [45]: foo(dd)
Out[45]: 18
我们可以构造更适合快速dot
计算的其他数据结构。但这是另一个话题。您能否更新您的问题,说明如何运行get\u dot\u product
?fast numpydot
使用数字数组,并在编译库(如BLAS
)中完成大部分工作。它也适用于某些对象数据类型数组,但速度较慢。请更新您的问题,说明如何运行get\u dot\u product
?快速numpydot
使用数字数组,并在编译库(如BLAS
)中完成大部分工作。它也适用于某些对象数据类型数组,但速度较慢。我刚刚尝试过,但抛出了一个错误TypeError:不支持的操作数类型*:'dict'和'dict
dot\u product=u1.dot(u2)
,这将出现在这行代码中。我刚刚更新了代码,注释行也可以正常工作我刚刚试过,但抛出了一个错误TypeError:不支持的操作数类型*:'dict'和'dict
dot_product=u1.dot(u2)
,这将在这行代码上。我刚刚更新了代码,注释行也可以正常工作。我读取了完整的代码,它可以正常工作,它没有使用python以外的任何东西,5年来它一直使用python 3+,但试图更新到更现代的方法。我不知道如何提取单位向量[courseid1]和单位向量[courseid2]的值。我只想将这些值=添加到u1和u2中,以便有一个简洁的问题。我阅读了完整的代码,它按原样工作,它没有使用python以外的任何东西。5年来,它在python 3+中工作,但试图更新到更现代的方法。我不知道如何提取单位向量[courseid1]和单位向量[courseid2]的值。我只想将这些值=添加到u1和u2中,以得到一个简洁的问题。