Python 使用word2vec提取段落的主要特征

Python 使用word2vec提取段落的主要特征,python,word2vec,feature-extraction,Python,Word2vec,Feature Extraction,我刚刚掌握了谷歌的word2vec模型,对这个概念还很陌生。我尝试使用以下方法提取段落的主要特征 from gensim.models.keyedvectors import KeyedVectors model = KeyedVectors.load_word2vec_format('../../usr/myProject/word2vec/GoogleNews-vectors-negative300.bin', binary=True) ... for para in paragraph

我刚刚掌握了谷歌的word2vec模型,对这个概念还很陌生。我尝试使用以下方法提取段落的主要特征

from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('../../usr/myProject/word2vec/GoogleNews-vectors-negative300.bin', binary=True)

...

for para in paragraph_array:
    para_name = "para_"+ file_name + '{0}'
    sentence_array = d[para_name.format(number_of_paragraphs)] = []

    # Split Paragraph on basis of '.' or ? or !.
    for l in re.split(r"\.|\?|\!", para):
        # Split line into list using space.
        sentence_array.append(l)
        #sentence_array.append(l.split(" "))

     print (model.wv.most_similar(positive=para, topn = 1))
但是我得到了下面的错误,它说检查的段落不是词汇表中的一个单词

关键错误:“word”加纳共和国是西非的一个国家。它西面与科特迪瓦(也称为象牙海岸)接壤,北面与布基纳法索接壤,东面与多哥接壤,南面与几内亚湾接壤。“加纳”一词的意思是“勇士之王”,杰克逊,约翰·G.《非洲文明导论》,2001年。第201页。“几内亚”(通过法语Guinoye)这个名字的由来是指西非海岸(如几内亚湾)。“不在词汇表中”

现在我知道
most_similar()
函数需要一个数组。但我想知道如何将其翻译为使用word2vec模型提取一个主要特征或单词,以显示段落的主要概念

已修改

我修改了上面的代码,将单词数组传递到
most\u similar()
方法中,得到以下错误

回溯(最近一次呼叫最后一次): 文件“/home/manuelanatarajeyaraj/PycharmProjects/ChatbotWord2Vec/new_approach.py”,第108行,in 打印(model.wv.most_相似(正=字数组,topn=1)) 文件“/home/manuelanatarajeyaraj/usr/myProject/my_project/lib/python3.5/site packages/gensim/models/keyedvectors.py”,第361行,大多数类似文件 对于单词,正+负的权重: ValueError:要解压缩的值太多(应为2个)

修改后的实施

for sentence in sentence_array:
    if sentence:
        for w in re.split(r"\.|\?|\!|\@|\#|\$|\%|\^|\&|\*|\(|\)|\-",   sentence):
            split_word = w.split(" ")
            if split_word:
                word_array.append(split_word)
print(model.wv.most_similar(positive=word_array, topn=1))
for file_name in files:
    file_identifier = file_name
    file_array = file_dictionary[file_identifier] =[]
    #file_array = file_dictionary[file_name.format((file_count))] = []
    file_path = directory_path+'/'+file_name

    with open(file_path) as f:
        #Level 2 Intents : Each file's main intent (One for each file)
        first_line = f.readline()
        print ()
        print("Level 2 Intent for ", c, " : ", first_line)

        #Level 3 Intents : Each paragraph's main intent (one for each para)

        paragraph_count = 0

        data = f.read()
        splat = data.split("\n")
        paragraph_array = []

        for number, paragraph in enumerate(splat, 1):
            paragraph_identifier = file_name + "_paragraph_" + str(paragraph_count)
            #print(paragraph_identifier)
            paragraph_array = paragraph_dictionary[paragraph_identifier.format(paragraph_count)] = []
            if paragraph :
                paragraph_array.append(paragraph)
            paragraph_count += 1
            if len(paragraph_array) >0 :
                file_array.append(paragraph_array)

            # Level 4 Intents : Each sentence's main intent (one for each sentence)

            sentence_count = 0
            sentence_array = []

            for sentence in paragraph_array:
                for line in re.split(r"\.|\?|\!", sentence):
                    sentence_identifier = paragraph_identifier + "_sentence_" + str(sentence_count)
                    sentence_array = sentence_dictionary[sentence_identifier.format(sentence_count)] = []
                    if line :
                        sentence_array.append(line)
                        sentence_count += 1

                    # Level 5 Intents : Each word with a certain level of prominance (one for each prominant word)

                    word_count = 0
                    word_array = []

                    for words in sentence_array:
                        for word in re.split(r" ", words):
                            word_identifier = sentence_identifier + "_word_" + str(word_count)
                            word_array = word_dictionary[word_identifier.format(word_count)] = []

                            if word :
                                word_array.append(word)
                                word_count += 1

非常感谢您在这方面提出的任何建议。

您的错误表明您正在查找整个字符串(
“加纳共和国是西非的一个国家。它西面与科特迪瓦(也称为象牙海岸)接壤,北面与布基那法索接壤,东面与多哥接壤,南面与几内亚湾接壤。”“加纳”的意思是“勇士之王”,杰克逊,约翰·G.《非洲文明概论》,2001年,第201页。它是“几内亚”(通过法语Guinoye)这个名字的来源,用来指西非海岸(如几内亚湾)。
)好像它是一个词,而这个词并不存在

most_similor()
方法可以获取一系列正面示例,但您必须将该字符串标记为可能位于单词向量集内的单词(这可能需要同时中断空格和标点符号,以匹配Google为准备该单词向量集所做的任何操作)

在这种情况下,
most_similor()
将对所有给定单词的向量进行平均,并返回接近该平均值的其他单词


这是否真的抓住了文本的“主要概念”还不清楚。虽然单词向量在识别文本的概念方面可能很有用,但这不是它们的主要或唯一功能,也不是自动的。你可能希望将这组单词过滤成其他方式的唯一单词,例如总体上不太常见的单词或其他impactful在一些依赖语料库的度量中(如TF/IDF)。

您的错误表明您正在查找整个字符串(
“加纳共和国是西非的一个国家。它西面与科特迪瓦(也称为象牙海岸)接壤,北面与布基纳法索接壤,东面与多哥接壤,南面与几内亚湾接壤。”加纳”的意思是“勇士之王”,杰克逊,约翰·G.《非洲文明概论》,2001年,第201页。它是“几内亚”(通过法语Guinoye)这个名字的来源,用来指西非海岸(如几内亚湾)。
)就好像它是一个词,而这个词并不存在

most_similor()
方法可以获取一系列正面示例,但您必须将该字符串标记为可能位于单词向量集内的单词(这可能需要同时中断空格和标点符号,以匹配Google为准备该单词向量集所做的任何操作)

在这种情况下,
most_similor()
将对所有给定单词的向量进行平均,并返回接近该平均值的其他单词


这是否真的抓住了文本的“主要概念”还不清楚。虽然单词向量在识别文本的概念方面可能很有用,但这不是它们的主要或唯一功能,也不是自动的。你可能希望将这组单词过滤成其他方式的唯一单词,例如总体上不太常见的单词或其他impactful在一些依赖于语料库的度量中(如TF/IDF)。

我重新编写了整个代码,添加了检查点,以避免在从段落、句子到单词的每一级对象中存储空字符串

工作版本

for sentence in sentence_array:
    if sentence:
        for w in re.split(r"\.|\?|\!|\@|\#|\$|\%|\^|\&|\*|\(|\)|\-",   sentence):
            split_word = w.split(" ")
            if split_word:
                word_array.append(split_word)
print(model.wv.most_similar(positive=word_array, topn=1))
for file_name in files:
    file_identifier = file_name
    file_array = file_dictionary[file_identifier] =[]
    #file_array = file_dictionary[file_name.format((file_count))] = []
    file_path = directory_path+'/'+file_name

    with open(file_path) as f:
        #Level 2 Intents : Each file's main intent (One for each file)
        first_line = f.readline()
        print ()
        print("Level 2 Intent for ", c, " : ", first_line)

        #Level 3 Intents : Each paragraph's main intent (one for each para)

        paragraph_count = 0

        data = f.read()
        splat = data.split("\n")
        paragraph_array = []

        for number, paragraph in enumerate(splat, 1):
            paragraph_identifier = file_name + "_paragraph_" + str(paragraph_count)
            #print(paragraph_identifier)
            paragraph_array = paragraph_dictionary[paragraph_identifier.format(paragraph_count)] = []
            if paragraph :
                paragraph_array.append(paragraph)
            paragraph_count += 1
            if len(paragraph_array) >0 :
                file_array.append(paragraph_array)

            # Level 4 Intents : Each sentence's main intent (one for each sentence)

            sentence_count = 0
            sentence_array = []

            for sentence in paragraph_array:
                for line in re.split(r"\.|\?|\!", sentence):
                    sentence_identifier = paragraph_identifier + "_sentence_" + str(sentence_count)
                    sentence_array = sentence_dictionary[sentence_identifier.format(sentence_count)] = []
                    if line :
                        sentence_array.append(line)
                        sentence_count += 1

                    # Level 5 Intents : Each word with a certain level of prominance (one for each prominant word)

                    word_count = 0
                    word_array = []

                    for words in sentence_array:
                        for word in re.split(r" ", words):
                            word_identifier = sentence_identifier + "_word_" + str(word_count)
                            word_array = word_dictionary[word_identifier.format(word_count)] = []

                            if word :
                                word_array.append(word)
                                word_count += 1
访问字典项的代码

#Accessing any paragraph array can be done as follows
print (paragraph_dictionary['S08_set4_a5.txt.clean_paragraph_4'])

#Accessing any sentence corresponding to a paragraph
print (sentence_dictionary['S08_set4_a5.txt.clean_paragraph_4_sentence_1'])

#Accessing any word corresponding to a sentence
print (word_dictionary['S08_set4_a5.txt.clean_paragraph_4_sentence_1_word_3'])
输出

['Celsius was born in Uppsala in Sweden. He was professor of astronomy at Uppsala University from 1730 to 1744, but traveled from 1732 to 1735 visiting notable observatories in Germany, Italy and France.']
[' He was professor of astronomy at Uppsala University from 1730 to 1744, but traveled from 1732 to 1735 visiting notable observatories in Germany, Italy and France']
['of']

我重新编写了整个代码,添加了检查点,以避免从段落、句子到单词的每一级对象都存储空字符串

工作版本

for sentence in sentence_array:
    if sentence:
        for w in re.split(r"\.|\?|\!|\@|\#|\$|\%|\^|\&|\*|\(|\)|\-",   sentence):
            split_word = w.split(" ")
            if split_word:
                word_array.append(split_word)
print(model.wv.most_similar(positive=word_array, topn=1))
for file_name in files:
    file_identifier = file_name
    file_array = file_dictionary[file_identifier] =[]
    #file_array = file_dictionary[file_name.format((file_count))] = []
    file_path = directory_path+'/'+file_name

    with open(file_path) as f:
        #Level 2 Intents : Each file's main intent (One for each file)
        first_line = f.readline()
        print ()
        print("Level 2 Intent for ", c, " : ", first_line)

        #Level 3 Intents : Each paragraph's main intent (one for each para)

        paragraph_count = 0

        data = f.read()
        splat = data.split("\n")
        paragraph_array = []

        for number, paragraph in enumerate(splat, 1):
            paragraph_identifier = file_name + "_paragraph_" + str(paragraph_count)
            #print(paragraph_identifier)
            paragraph_array = paragraph_dictionary[paragraph_identifier.format(paragraph_count)] = []
            if paragraph :
                paragraph_array.append(paragraph)
            paragraph_count += 1
            if len(paragraph_array) >0 :
                file_array.append(paragraph_array)

            # Level 4 Intents : Each sentence's main intent (one for each sentence)

            sentence_count = 0
            sentence_array = []

            for sentence in paragraph_array:
                for line in re.split(r"\.|\?|\!", sentence):
                    sentence_identifier = paragraph_identifier + "_sentence_" + str(sentence_count)
                    sentence_array = sentence_dictionary[sentence_identifier.format(sentence_count)] = []
                    if line :
                        sentence_array.append(line)
                        sentence_count += 1

                    # Level 5 Intents : Each word with a certain level of prominance (one for each prominant word)

                    word_count = 0
                    word_array = []

                    for words in sentence_array:
                        for word in re.split(r" ", words):
                            word_identifier = sentence_identifier + "_word_" + str(word_count)
                            word_array = word_dictionary[word_identifier.format(word_count)] = []

                            if word :
                                word_array.append(word)
                                word_count += 1
访问字典项的代码

#Accessing any paragraph array can be done as follows
print (paragraph_dictionary['S08_set4_a5.txt.clean_paragraph_4'])

#Accessing any sentence corresponding to a paragraph
print (sentence_dictionary['S08_set4_a5.txt.clean_paragraph_4_sentence_1'])

#Accessing any word corresponding to a sentence
print (word_dictionary['S08_set4_a5.txt.clean_paragraph_4_sentence_1_word_3'])
输出

['Celsius was born in Uppsala in Sweden. He was professor of astronomy at Uppsala University from 1730 to 1744, but traveled from 1732 to 1735 visiting notable observatories in Germany, Italy and France.']
[' He was professor of astronomy at Uppsala University from 1730 to 1744, but traveled from 1732 to 1735 visiting notable observatories in Germany, Italy and France']
['of']

感谢您的解释。因此,在将肯定示例列表传递给
most_-simular()
方法的情况下,是否可以使用
list
对象,其中段落中的每个单词都是列表项,然后将该
list
传递给
most_-simular()
method?是的,尽管您可能还想丢弃模型不知道的单词,以避免对不存在的单词触发
KeyError
。谢谢。我将尝试此方法并返回给您。我按照您的建议修改了代码,并且遇到了另一个错误。我已将修改后的代码和错误添加到上述问题中。您的建议在这方面,ice将非常感谢