Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/290.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 更换loop Numpy的dot产品_Python_Python 3.x_Numpy_Nlp_Self - Fatal编程技术网

Python 更换loop Numpy的dot产品

Python 更换loop Numpy的dot产品,python,python-3.x,numpy,nlp,self,Python,Python 3.x,Numpy,Nlp,Self,我正在尝试使用更快的东西(比如NumPy)来替换循环的点积 我对点积做了研究,了解了点积,能够以几种方式处理玩具数据,但不是100%,当涉及到用数据框实现实际使用时 我看着这些线和其他的线,没有运气和 希望做类似的事情,在np数组中处理玩具编号 u1 =np.array([1,2,3]) u2 =np.array([2,3,4]) v1.dot(v2) 20 这是当前正在工作的get-dot产品 我希望在没有for循环的情况下执行此操作 def get_dot_product(self, co

我正在尝试使用更快的东西(比如NumPy)来替换循环的点积

我对点积做了研究,了解了点积,能够以几种方式处理玩具数据,但不是100%,当涉及到用数据框实现实际使用时

我看着这些线和其他的线,没有运气和

希望做类似的事情,在np数组中处理玩具编号

u1 =np.array([1,2,3])
u2 =np.array([2,3,4])
v1.dot(v2)
20
这是当前正在工作的get-dot产品

我希望在没有for循环的情况下执行此操作

def get_dot_product(self, courseid1, courseid2, unit_vectors):
    u1 = unit_vectors[courseid1]
    u2 = unit_vectors[courseid2]
    dot_product = 0.0
    for dimension in u1:
        if dimension in u2:
            dot_product += u1[dimension] * u2[dimension]
    return dot_product
**代码**



    #!/usr/bin/env python
    # coding: utf-8
    
    
    
    class SearchRecommendationSystem:
    
        def __init__(self):
            pass
    
 
    def get_bag_of_words(self, titles_lines):
        bag_of_words = {}
        for index, row in titles_lines.iterrows():
            courseid, course_bag_of_words = self.get_course_bag_of_words(row)
            for word in course_bag_of_words:
                word = str(word).strip()  # added
                if word not in bag_of_words:
                    bag_of_words[word] = course_bag_of_words[word]
                else:
                    bag_of_words[word] += course_bag_of_words[word]
        return bag_of_words

    def get_course_bag_of_words(self, line):
        course_bag_of_words = {}
        courseid = line['courseid']
        title = line['title'].lower()
        description = line['description'].lower()
        wordlist = title.split() + description.split()
        if len(wordlist) >= 10:
            for word in wordlist:
                word = str(word).strip()  # added
                if word not in course_bag_of_words:
                    course_bag_of_words[word] = 1
                else:
                    course_bag_of_words[word] += 1
        return courseid, course_bag_of_words

    def get_sorted_results(self, d):
        kv_list = d.items()
        vk_list = []
        for kv in kv_list:
            k, v = kv
            vk = v, k
            vk_list.append(vk)
        vk_list.sort()
        vk_list.reverse()
        k_list = []
        for vk in vk_list[:10]:
            v, k = vk
            k_list.append(k)
        return k_list

    def get_keywords(self, titles_lines, bag_of_words):
        n = sum(bag_of_words.values())
        keywords = {}
        for index, row in titles_lines.iterrows():
            courseid, course_bag_of_words = self.get_course_bag_of_words(row)
            term_importance = {}
            for word in course_bag_of_words:
                word = str(word).strip()  # extra
                tf_course = (float(course_bag_of_words[word]) / sum(course_bag_of_words.values()))
                tf_overall = float(bag_of_words[word]) / n
                term_importance[word] = tf_course / tf_overall
            keywords[str(courseid)] = self.get_sorted_results(term_importance)
        return keywords

    def get_inverted_index(self, keywords):
        inverted_index = {}
        for courseid in keywords:
            for keyword in keywords[courseid]:
                if keyword not in inverted_index:
                    keyword = str(keyword).strip()  # added
                    inverted_index[keyword] = []
                inverted_index[keyword].append(courseid)
        return inverted_index

    def get_search_results(self, query_terms, keywords, inverted_index):
        search_results = {}
        for term in query_terms:
            term = str(term).strip()
            if term in inverted_index:
                for courseid in inverted_index[term]:
                    if courseid not in search_results:
                        search_results[courseid] = 0.0
                    search_results[courseid] += (
                            1 / float(keywords[courseid].index(term) + 1) *
                            1 / float(query_terms.index(term) + 1)
                    )
        sorted_results = self.get_sorted_results(search_results)
        return sorted_results

    def get_titles(self, titles_lines):
        titles = {}
        for index, row in titles_lines.iterrows():
            titles[row['courseid']] = row['title'][:60]
        return titles
    
        def get_unit_vectors(self, keywords, categories_lines):
            norm = 1.884
            cat = {}
            subcat = {}
            for line in categories_lines[1:]:
                courseid_, category, subcategory = line.split('\t')
                cat[courseid_] = category.strip()
                subcat[courseid_] = subcategory.strip()
            unit_vectors = {}
            for courseid in keywords:
                u = {}
                if courseid in cat:
                    u[cat[courseid]] = 1 / norm
                    u[subcat[courseid]] = 1 / norm
                for keyword in keywords[courseid]:
                    u[keyword] = (1 / float(keywords[courseid].index(keyword) + 1) / norm)
                unit_vectors[courseid] = u
            return unit_vectors
    
        def get_dot_product(self, courseid1, courseid2, unit_vectors):
            u1 = unit_vectors[courseid1]
            u2 = unit_vectors[courseid2]
            dot_product = 0.0
            for dimension in u1:
                if dimension in u2:
                    dot_product += u1[dimension] * u2[dimension]
            return dot_product
    
        def get_recommendation_results(self, seed_courseid, keywords, inverted_index, unit_vectors):
            courseids = []
            seed_courseid = str(seed_courseid).strip()
            for keyword in keywords[seed_courseid]:
                for courseid in inverted_index[keyword]:
                    if courseid not in courseids and courseid != seed_courseid:
                        courseids.append(courseid)
    
            dot_products = {}
            for courseid in courseids:
                dot_products[courseid] = self.get_dot_product(seed_courseid, courseid, unit_vectors)
            sorted_results = self.get_sorted_results(dot_products)
            return sorted_results
    
    
        def Final(self):
            print("Reading Title file.......")
            titles_lines = open('s2-titles.txt', encoding="utf8").readlines()
            print("Reading Category file.......")
            categories_lines = open('s2-categories.tsv', encoding = "utf8").readlines()
            print("Getting Supported Functions Data")
            bag_of_words = self.get_bag_of_words(titles_lines)
            keywords = self.get_keywords(titles_lines, bag_of_words)
            inverted_index = self.get_inverted_index(keywords)
            titles = self.get_titles(titles_lines)
    
            print("Getting Unit Vectors")
            unit_vectors = self.get_unit_vectors(keywords=keywords, categories_lines=categories_lines)
    
            #Search Part
            print("\n ############# Started Search Query System ############# \n")
            query = input('Input your search query: ')
            while query != '':
                query_terms = query.split()
                search_sorted_results = self.get_search_results(query_terms, keywords, inverted_index)
                print(f"==> search results for query: {query.split()}")
    
                for search_result in search_sorted_results:
                    print(f"{search_result.strip()} - {str(titles[search_result]).strip()}")
    
                #ask again for query or quit the while loop if no query is given
                query = input('Input your search query [hit return to finish]: ')
    
    
            print("\n ############# Started Recommendation Algorithm System ############# \n")
            # Recommendation ALgorithm Part
            seed_courseid = (input('Input your seed courseid: '))
            while seed_courseid != '':
                seed_courseid = str(seed_courseid).strip()
                recom_sorted_results = self.get_recommendation_results(seed_courseid, keywords, inverted_index, unit_vectors)
                print('==> recommendation results:')
                for rec_result in recom_sorted_results:
                    print(f"{rec_result.strip()} - {str(titles[rec_result]).strip()}")
                    get_dot_product_ = self.get_dot_product(seed_courseid, str(rec_result).strip(), unit_vectors)
                    print(f"Dot Product Value: {get_dot_product_}")
                seed_courseid = (input('Input seed courseid [hit return to finish]:'))
    
    
    if __name__ == '__main__':
        obj = SearchRecommendationSystem()
        obj.Final()
s2类别。tsv

    courseid    category    subcategory
    21526   Design  3D & Animation
    153082  Marketing   Advertising
    225436  Marketing   Affiliate Marketing
    19482   Office Productivity Apple
    33883   Office Productivity Apple
    59526   IT & Software   Operating Systems
    29219   Personal Development    Career Development
    35057   Personal Development    Career Development
    40751   Personal Development    Career Development
    65210   Personal Development    Career Development
    234414  Personal Development    Career Development
s2-titles.txt外观示例

courseidXXXYYYZZZtitleXXXYYYZZZdescription
3586XXXYYYZZZLearning Tools for Mrs  B's Science Classes This is a series of lessons that will introduce students to the learning tools that will be utilized throughout the schoXXXYYYZZZThis is a series of lessons that will introduce students to the learning tools that will be utilized throughout the school year  The use of these tools serves multiple purposes       1  Allow the teacher to give immediate and meaningful feedback on work that is in progress    2  Allow students to have access to content and materials when outside the classroom    3  Provide a variety of methods for students to experience learning materials    4  Provide a variety of methods for students to demonstrate learning    5  Allow for more time sensitive correction  grading and reflections on concepts that are assessed  
改进方法

def get_dot_product(self, courseid1, courseid2, unit_vectors):
    # u1 = unit_vectors[courseid1]
    # u2 = unit_vectors[courseid2]
    # dimensions = set(u1).intersection(set(u2))
    # dot_product = sum(u1[dimension] * u2.get(dimension, 0) for dimension in dimensions)

    u1 = unit_vectors[courseid1]
    u2 = unit_vectors[courseid2]
    dot_product = sum(u1[dimension] * u2.get(dimension, 0) for dimension in u2)
    return dot_product
改进方法

def get_dot_product(self, courseid1, courseid2, unit_vectors):
    # u1 = unit_vectors[courseid1]
    # u2 = unit_vectors[courseid2]
    # dimensions = set(u1).intersection(set(u2))
    # dot_product = sum(u1[dimension] * u2.get(dimension, 0) for dimension in dimensions)

    u1 = unit_vectors[courseid1]
    u2 = unit_vectors[courseid2]
    dot_product = sum(u1[dimension] * u2.get(dimension, 0) for dimension in u2)
    return dot_product

显然,
unit\u vectors
是一个字典,可以从中提取2个值,
u1
u2

但这些是什么?显然,dicts也是如此(这个迭代对于列表没有意义):

但是什么是
u1[dimension]
?名单?数组

通常情况下,
dict
可以通过
键进行访问,就像您在这里所做的那样。没有numpy风格的“矢量化”
vals=list(u1.values())
获取所有值的列表,可以想象这些值可以组成一个数组(如果元素正确)

一个
np.dot(arr1,arr2)
可能会起作用

如果你给出一些具体的小例子,用真实的工作数据(跳过复杂的生成代码),你会得到最好的答案。把重点放在问题的核心上,这样我们就可以用30秒的时间来把握问题

===

更深入地查看
dot
函数;这复制了核心(我认为)。起初,我忽略了这样一个事实:您不是在迭代
u2
键,而是在寻找匹配的键

def foo(dd):
    x = 0
    u1 = dd['u1']
    u2 = dd['u2']
    for k in u1:
        if k in u2:
            x += u1[k]*u2[k]
    return x
然后制作一本词典:

In [30]: keys=list('abcde'); values=[1,2,3,4,5]
In [31]: adict = {k:v for k,v in zip(keys,values)}
In [32]: dd = {'u1':adict, 'u2':adict}

In [41]: dd
Out[41]: 
{'u1': {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5},
 'u2': {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}}
In [42]: foo(dd)
Out[42]: 55
在本例中,子字典匹配,因此我们使用一个简单的数组
dot
获得相同的值:

In [43]: np.dot(values,values)
Out[43]: 55
但是,如果
u2
不同,具有不同的键/值对,并且可能具有不同的键,则结果将不同。我看不到通过键进行迭代访问的方法。与字典访问相比,作业的产品部分总和较小

In [44]: dd['u2'] = {'e':3, 'f':4, 'a':3}
In [45]: foo(dd)
Out[45]: 18

我们可以构造更适合快速
dot
计算的其他数据结构。但这是另一个主题。

显然
单位向量
是一个字典,从中可以提取2个值,
u1
u2

但这些是什么?显然,dicts也是如此(这个迭代对于列表没有意义):

但是什么是
u1[dimension]
?名单?数组

通常情况下,
dict
可以通过
键进行访问,就像您在这里所做的那样。没有numpy风格的“矢量化”
vals=list(u1.values())
获取所有值的列表,可以想象这些值可以组成一个数组(如果元素正确)

一个
np.dot(arr1,arr2)
可能会起作用

如果你给出一些具体的小例子,用真实的工作数据(跳过复杂的生成代码),你会得到最好的答案。把重点放在问题的核心上,这样我们就可以用30秒的时间来把握问题

===

更深入地查看
dot
函数;这复制了核心(我认为)。起初,我忽略了这样一个事实:您不是在迭代
u2
键,而是在寻找匹配的键

def foo(dd):
    x = 0
    u1 = dd['u1']
    u2 = dd['u2']
    for k in u1:
        if k in u2:
            x += u1[k]*u2[k]
    return x
然后制作一本词典:

In [30]: keys=list('abcde'); values=[1,2,3,4,5]
In [31]: adict = {k:v for k,v in zip(keys,values)}
In [32]: dd = {'u1':adict, 'u2':adict}

In [41]: dd
Out[41]: 
{'u1': {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5},
 'u2': {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}}
In [42]: foo(dd)
Out[42]: 55
在本例中,子字典匹配,因此我们使用一个简单的数组
dot
获得相同的值:

In [43]: np.dot(values,values)
Out[43]: 55
但是,如果
u2
不同,具有不同的键/值对,并且可能具有不同的键,则结果将不同。我看不到通过键进行迭代访问的方法。与字典访问相比,作业的产品部分总和较小

In [44]: dd['u2'] = {'e':3, 'f':4, 'a':3}
In [45]: foo(dd)
Out[45]: 18

我们可以构造更适合快速
dot
计算的其他数据结构。但这是另一个话题。

您能否更新您的问题,说明如何运行
get\u dot\u product
?fast numpy
dot
使用数字数组,并在编译库(如
BLAS
)中完成大部分工作。它也适用于某些对象数据类型数组,但速度较慢。请更新您的问题,说明如何运行
get\u dot\u product
?快速numpy
dot
使用数字数组,并在编译库(如
BLAS
)中完成大部分工作。它也适用于某些对象数据类型数组,但速度较慢。我刚刚尝试过,但抛出了一个错误
TypeError:不支持的操作数类型*:'dict'和'dict
dot\u product=u1.dot(u2)
,这将出现在这行代码中。我刚刚更新了代码,注释行也可以正常工作我刚刚试过,但抛出了一个错误
TypeError:不支持的操作数类型*:'dict'和'dict
dot_product=u1.dot(u2)
,这将在这行代码上。我刚刚更新了代码,注释行也可以正常工作。我读取了完整的代码,它可以正常工作,它没有使用python以外的任何东西,5年来它一直使用python 3+,但试图更新到更现代的方法。我不知道如何提取单位向量[courseid1]和单位向量[courseid2]的值。我只想将这些值=添加到u1和u2中,以便有一个简洁的问题。我阅读了完整的代码,它按原样工作,它没有使用python以外的任何东西。5年来,它在python 3+中工作,但试图更新到更现代的方法。我不知道如何提取单位向量[courseid1]和单位向量[courseid2]的值。我只想将这些值=添加到u1和u2中,以得到一个简洁的问题。