Python：格式化和计算元组_Python_Formatting_Nlp_Tuples

Python：格式化和计算元组

python formatting nlp

Python：格式化和计算元组,python,formatting,nlp,tuples,Python,Formatting,Nlp,Tuples,我有一个元组，由单格图、双格图和三叉图组成，如下所示： ('be',) ('true',) ('But',) ('I',) ('And', 'but') ('but', 'my') ('my', 'Noble') ('For', 'thy', 'escape') ('thy', 'escape', 'would') ('escape', 'would', 'teach') ('would', 'teach', 'me') I 2 am 2 STOP 2 I am 2 am Sam 1 Sam

我有一个元组，由单格图、双格图和三叉图组成，如下所示：

('be',)
('true',)
('But',)
('I',)
('And', 'but')
('but', 'my')
('my', 'Noble')
('For', 'thy', 'escape')
('thy', 'escape', 'would')
('escape', 'would', 'teach')
('would', 'teach', 'me')

I 2
am 2
STOP 2
I am 2
am Sam 1
Sam I 1
Sam STOP 1
* Sam 1
* I 1
am STOP 1
* * I 1
* I am 1
I am Sam 1
am Sam STOP 1

我需要找到所有副本，删除除1之外的所有副本，并将其格式化为如下所示：

('be',)
('true',)
('But',)
('I',)
('And', 'but')
('but', 'my')
('my', 'Noble')
('For', 'thy', 'escape')
('thy', 'escape', 'would')
('escape', 'would', 'teach')
('would', 'teach', 'me')

I 2
am 2
STOP 2
I am 2
am Sam 1
Sam I 1
Sam STOP 1
* Sam 1
* I 1
am STOP 1
* * I 1
* I am 1
I am Sam 1
am Sam STOP 1

末尾的数字（如果有多少个重复项）以及星号的意思是它在一段时间后被替换为

以下是我目前的代码：

with open(file, "r") as filestring:
data = filestring.read().replace('\n', '').replace(',', ' ').replace('.', '    <STOP>').replace("'", '').replace(':', ' ')
txtlist = data.split()
uni = zip(*[txtlist[i:] for i in range(1)])
bi = zip(*[txtlist[i:] for i in range(2)])
tri = zip(*[txtlist[i:] for i in range(3)])
with open("output.txt", "w") as myfile:
    for item in uni:
        myfile.write(str(item)+"\n")
    for item in bi:
        myfile.write(str(item)+"\n")
    for item in tri:
        myfile.write(str(item)+"\n")

你当前的输入要么有am，要么有Sam，要么有STOP，那么你如何得到你的结果呢？这只是一个例子，我希望它是如何形成的，哪些是重复的，哪些是Ngram或元组中的标记？一个简单的方法来移除重复或避免重复，首先是将它们添加到一个集合中。如果要计算唯一的可散列对象（例如元组），请使用dict或collections.defaultdict.n grams中的标记，或仅从dict或collections.defaultdict中滚动自己的标记，因此如果am Sam重复5次，则它将显示为am Sam 5