Python 编写递归函数,根据条件过滤存储在树中每个节点上的文档中的单词
我有一个树状结构,每个节点都附加了一个包含5个文档的列表。每个文件都有一定的字数。我想保留每个节点上的所有单词,这些单词在该节点中占多数,在其他节点中占少数,或者在该节点中超过60%的文档中,在其同级节点中不到40%的文档中 例如:A是父节点,B、C是其子节点,每个子节点都有一个包含5个文档的列表:Python 编写递归函数,根据条件过滤存储在树中每个节点上的文档中的单词,python,recursion,tree,Python,Recursion,Tree,我有一个树状结构,每个节点都附加了一个包含5个文档的列表。每个文件都有一定的字数。我想保留每个节点上的所有单词,这些单词在该节点中占多数,在其他节点中占少数,或者在该节点中超过60%的文档中,在其同级节点中不到40%的文档中 例如:A是父节点,B、C是其子节点,每个子节点都有一个包含5个文档的列表: B = [['a','b','c','d','m'],['b','d','m','n'],['c','d','e','o'],['c','e','f','n'],['b','c','e','g']]
B = [['a','b','c','d','m'],['b','d','m','n'],['c','d','e','o'],['c','e','f','n'],['b','c','e','g']]
C = [['a','m','n'],['a','m','o'],['b','c','m','n'],['c','n','o'],['b','n','o','g']]
所以我想把b,c,d保留在b中,因为它们在b中占多数,在c中占少数,在c中同样是m,n,o。最后,b和c看起来像:
B = [['b','c','d'],['b','d'],['c','d'],['c'],['b','c']]
C = [['m','n'],['m','o'],['m','n'],['n','o'],['n','o']]
已使用以下代码解决了上述问题:
B = [['a','b','c','d','m'],['b','d','m','n'],['c','d','e','o'],['c','e','f','n'],['b','c','e','g']]
C = [['a','m','n'],['a','m','o'],['b','c','m','n'],['c','n','o'],['b','n','o','g']]
# 1. Retrieve the set of all words
wordSet = set([word for words in B+C for word in words])
# 2. Compute the occurrences of each word in each node
occurB = {word:0 for word in wordSet}
occurC = {word:0 for word in wordSet}
for word in wordSet:
for document in B:
if word in document:
occurB[word] += 1
for document in C:
if word in document:
occurC[word] += 1
# 3. Filter the nodes using majority and minority
majorityB, minorityB = int(0.6 * len(B)), int(0.4 * len(B))
majorityC, minorityC = int(0.6 * len(C)), int(0.4 * len(C))
newB = [[word for word in document if occurB[word] >= majorityB and occurC[word] <= minorityC] for document in B]
newC = [[word for word in document if occurC[word] >= majorityC and occurB[word] <= minorityB] for document in C]
print(newB) # [['b', 'c', 'd'], ['b', 'd'], ['c', 'd', 'e'], ['c', 'e'], ['b', 'c', 'e']]
print(newC) # [['m', 'n'], ['m', 'o'], ['m', 'n'], ['n', 'o'], ['n', 'o']]
我想编写一个递归函数,修改上面的代码,以便它可以用于整个树。此外,特定节点上的子节点数量可以超过2个,即对于父节点a,我们可以有B、C、D、e等子节点,而且B、C、D、e可以在树中有自己的子节点。请帮助我如何做到这一点。我假设:
该树通过字典类型进行编码
对于包含文档的节点,每个节点的文档数始终正好为5
如果满足所有同级节点的少数标准,则保留一个字
下面是我对代码的建议,根本没有优化,但它似乎在我这边完成了工作:
def RecFilterWords(d):
for v in d.values():
if isinstance(v, dict):
RecFilterWords(v)
elif isinstance(v, list):
FilterWords(d)
def FilterWords(d):
# 1. Retrieve the set of all words
wordSet = []
nb = 0
for v in d.values():
if isinstance(v, list):
nb += 1
wordSet += [word for words in v for word in words]
if nb == 1:
return
wordSet = set(wordSet)
# 2. Compute the occurrences of each word in each node
occur = {}
for k,v in d.items():
occur[k] = {word:0 for word in wordSet}
for word in wordSet:
for document in v:
if word in document:
occur[k][word] += 1
# 3. Filter the nodes using majority and minority
newD = d.copy()
for k,v in d.items():
if isinstance(v, list):
newV = v[:]
for k1,v1 in d.items():
if isinstance(v1, list):
majority = 3
minority = 2
if k == k1:
newV = [[word for word in document if occur[k1][word] >= majority] for document in newV]
else:
newV = [[word for word in document if occur[k1][word] <= minority] for document in newV]
newD[k] = newV
d.clear()
d.update(newD)
# Example with B, C, D as child nodes
B = [['a','b','c','d','m'],['b','d','m','n'],['c','d','e','o'],['c','e','f','n'],['b','c','e','g']]
C = [['a','m','n'],['a','m','o'],['b','c','m','n'],['c','n','o'],['b','n','o','g']]
D = [['b','m','n'],['b'],['g'],['o','b','g'],['b','g']]
# The tree is coded through a dictionary type
d = {"A1": {"B": {"B": B,
"C": C,
"D": D},
"C":C
},
"A2": {"B":B,
"C":C}
}
RecFilterWords(d)
print(d)
# It prints:
# {'A1': {'B': {'B': [['c', 'd'], ['d'], ['c', 'd', 'e'], ['c', 'e'], ['c', 'e']],
# 'C': [['m', 'n'], ['m', 'o'], ['m', 'n'], ['n', 'o'], ['n', 'o']],
# 'D': [[], [], ['g'], ['g'], ['g']]},
# 'C': [['a', 'm', 'n'], ['a', 'm', 'o'], ['b', 'c', 'm', 'n'], ['c', 'n', 'o'], ['b', 'n', 'o', 'g']]},
# 'A2': {'B': [['b', 'c', 'd'], ['b', 'd'], ['c', 'd', 'e'], ['c', 'e'], ['b', 'c', 'e']],
# 'C': [['m', 'n'], ['m', 'o'], ['m', 'n'], ['n', 'o'], ['n', 'o']]}
# }
@劳伦特H:我已经把问题贴在这里了。请帮忙。谢谢你的帮助。@AnkitaPatnaik如果这个答案适合你,请接受它。