Python：以列表索引的百分比除以长度来获取项目位置_Python

Python：以列表索引的百分比除以长度来获取项目位置

python

Python：以列表索引的百分比除以长度来获取项目位置,python,Python,我有一套课文。这些文本中的每一个都被规范化并标记为一个列表——我将在下面发布代码——因此我有一个列表列表，每个列表都是一个文本。我想做的是获取文本中每个单词的所有位置例如，“这是一个文本；它不是一个长文本。” 但是，这些位置是不可比较的，因此我想通过将它们除以文本长度来规范它们： here: 0.1 is: 0.2, 0.6 然后，我的目标是能够在一组文本中收集所有此类单词的实例，并对这些位置进行平均，以查看某些单词是否定期出现在文本的特定部分。这是什么。我正试图用Python实现这一点

我有一套课文。这些文本中的每一个都被规范化并标记为一个列表——我将在下面发布代码——因此我有一个列表列表，每个列表都是一个文本。我想做的是获取文本中每个单词的所有位置

例如，“这是一个文本；它不是一个长文本。”

但是，这些位置是不可比较的，因此我想通过将它们除以文本长度来规范它们：

here: 0.1
is:   0.2, 0.6

然后，我的目标是能够在一组文本中收集所有此类单词的实例，并对这些位置进行平均，以查看某些单词是否定期出现在文本的特定部分。这是什么。我正试图用Python实现这一点：

# =-=-=-=-=-=-=-=-=-=-=
# Data Load & Tokenize
# =-=-=-=-=-=-=-=-=-=-= 

import pandas
import re
from nltk.tokenize import WhitespaceTokenizer

# LOAD
colnames = ['author', 'title', 'date' , 'length', 'text']
df = pandas.read_csv('../data/talks_3.csv', names=colnames)
talks = df.text.tolist()
authors = df.author.tolist()
dates = df.date.tolist()
years = [re.sub('[A-Za-z ]', '', item) for item in dates]
authordate = [author+" "+year for author, year in zip(authors, years)]

# TOKENIZE
tokenizer = WhitespaceTokenizer()
texts = []
for talk in talks:   
    raw = re.sub(r"[^\w\d'\s]+",'', talk).lower()
    tokens = tokenizer.tokenize(raw)
    texts.append(tokens)

这就是我的绊脚石——它很快就从工作代码变成了伪代码：

def get_word_placement(listname):
    wordplaces = {}
    for word in listname:
        get the word
        get its location of listname[word]/len(listname)
        attach those locations to word

如果您

枚举列表

，那么您就有了索引，可以除以长度得到相对位置：

代码：

word_list = 'Here is a text it is not a long text'.split()
print(word_list)

word_with_position = [
    (word, float(i)/len(word_list)) for i, word in enumerate(word_list)]
print(word_with_position)

['Here', 'is', 'a', 'text', 'it', 'is', 'not', 'a', 'long', 'text']

[('Here', 0.0), ('is', 0.1), ('a', 0.2), ('text', 0.3), ('it', 0.4), 
 ('is', 0.5), ('not', 0.6), ('a', 0.7), ('long', 0.8), ('text', 0.9)]

from collections import defaultdict

word_with_positions = defaultdict(list)
for i, word in enumerate(word_list):
    word_with_positions[word].append(float(i)/len(word_list))

print(word_with_positions)

{'a': [0.2, 0.7], 'text': [0.3, 0.9], 'is': [0.1, 0.5], 'it': [0.4], 
 'Here': [0.0], 'long': [0.8], 'not': [0.6]}

结果：

word_list = 'Here is a text it is not a long text'.split()
print(word_list)

word_with_position = [
    (word, float(i)/len(word_list)) for i, word in enumerate(word_list)]
print(word_with_position)

['Here', 'is', 'a', 'text', 'it', 'is', 'not', 'a', 'long', 'text']

[('Here', 0.0), ('is', 0.1), ('a', 0.2), ('text', 0.3), ('it', 0.4), 
 ('is', 0.5), ('not', 0.6), ('a', 0.7), ('long', 0.8), ('text', 0.9)]

from collections import defaultdict

word_with_positions = defaultdict(list)
for i, word in enumerate(word_list):
    word_with_positions[word].append(float(i)/len(word_list))

print(word_with_positions)

{'a': [0.2, 0.7], 'text': [0.3, 0.9], 'is': [0.1, 0.5], 'it': [0.4], 
 'Here': [0.0], 'long': [0.8], 'not': [0.6]}

作为一句格言：

word_list = 'Here is a text it is not a long text'.split()
print(word_list)

word_with_position = [
    (word, float(i)/len(word_list)) for i, word in enumerate(word_list)]
print(word_with_position)

['Here', 'is', 'a', 'text', 'it', 'is', 'not', 'a', 'long', 'text']

[('Here', 0.0), ('is', 0.1), ('a', 0.2), ('text', 0.3), ('it', 0.4), 
 ('is', 0.5), ('not', 0.6), ('a', 0.7), ('long', 0.8), ('text', 0.9)]

from collections import defaultdict

word_with_positions = defaultdict(list)
for i, word in enumerate(word_list):
    word_with_positions[word].append(float(i)/len(word_list))

print(word_with_positions)

{'a': [0.2, 0.7], 'text': [0.3, 0.9], 'is': [0.1, 0.5], 'it': [0.4], 
 'Here': [0.0], 'long': [0.8], 'not': [0.6]}

结果：

word_list = 'Here is a text it is not a long text'.split()
print(word_list)

word_with_position = [
    (word, float(i)/len(word_list)) for i, word in enumerate(word_list)]
print(word_with_position)

['Here', 'is', 'a', 'text', 'it', 'is', 'not', 'a', 'long', 'text']

[('Here', 0.0), ('is', 0.1), ('a', 0.2), ('text', 0.3), ('it', 0.4), 
 ('is', 0.5), ('not', 0.6), ('a', 0.7), ('long', 0.8), ('text', 0.9)]

from collections import defaultdict

word_with_positions = defaultdict(list)
for i, word in enumerate(word_list):
    word_with_positions[word].append(float(i)/len(word_list))

print(word_with_positions)

{'a': [0.2, 0.7], 'text': [0.3, 0.9], 'is': [0.1, 0.5], 'it': [0.4], 
 'Here': [0.0], 'long': [0.8], 'not': [0.6]}

美好的好的，我将尝试一下，看看我是否可以编译元组列表，这样每个单词在列表中只出现一次，并且有多个位置——我必须对单个文本或整个语料库这样做。你比我快。谢谢你卸下！