Python 如何查找和计算列表和文本之间的多个交点？_Python_Python 3.x_Text_Text Files_Intersection

Python 如何查找和计算列表和文本之间的多个交点？

python python-3.x text

Python 如何查找和计算列表和文本之间的多个交点？,python,python-3.x,text,text-files,intersection,Python,Python 3.x,Text,Text Files,Intersection,我目前正在用Python编写一个程序，计算德语文本中的英语成分。我想知道在整篇文章中有多少次使用英语。为此，我列出了德语中的所有英语，如下所示： abchecken abchillen abdancen abdimmen abfall-container abflug-terminal from collections import Counter anglicisms = open("anglicisms.txt").read().split() matches = [] for line

我目前正在用Python编写一个程序，计算德语文本中的英语成分。我想知道在整篇文章中有多少次使用英语。为此，我列出了德语中的所有英语，如下所示：

abchecken
abchillen
abdancen
abdimmen
abfall-container
abflug-terminal

from collections import Counter
anglicisms = open("anglicisms.txt").read().split()

matches = []
for line in textf:
    matches.extend([word for word in line.split() if word in anglicisms])

anglicismsInText = Counter(matches)

[('b', 5), ('c', 5), ('a', 3)]

而且这个名单还在继续。。。然后我检查了这个列表和要分析的文本之间的交叉点，但这只给了我一个出现在两个文本中的所有单词的列表，例如：

英语：4:{'abdansen'，'abchecken'，'terminal'}

我真的希望程序能输出这些单词出现的次数（最好按频率排序），例如：

Anglicisms: abdancen(5), abchecken(2), terminal(1)

这是我目前掌握的代码：

 #counters to zero
 lines, blanklines, sentences, words = 0, 0, 0, 0

 print ('-' * 50)

 while True:
     try:
       #def text file
       filename = input("Please enter filename: ")
       textf = open(filename, 'r')
       break
     except IOError:
       print( 'Cannot open file "%s" ' % filename )

 #reads one line at a time
 for line in textf:
   print( line, )  # test
   lines += 1

   if line.startswith('\n'):
     blanklines += 1
   else:
     #sentence ends with . or ! or ?
    #count these characters
     sentences += line.count('.') + line.count('!') + line.count('?')

     #create a list of words
     #use None to split at any whitespace regardless of length
     tempwords = line.split(None)
     print(tempwords)

     #total words
     words += len(tempwords)

 #anglicisms
     words1 = set(open(filename).read().split())
     words2 = set(open("anglicisms.txt").read().split())

     duplicates  = words1.intersection(words2)


 textf.close()
 print( '-' * 50)
 print( "Lines       : ", lines)
 print( "Blank lines : ", blanklines)
 print( "Sentences   : ", sentences)
 print( "Words       : ", words)
 print( "Anglicisms  :  %d:%s"%(len(duplicates),duplicates))

我的第二个问题是，这不包括那些英语，换句话说。例如，如果“big”在英语词组列表中，而“bigfoot”在文本中，则忽略此事件。我怎样才能解决这个问题

来自瑞士的亲切问候

我会这样做：

abchecken
abchillen
abdancen
abdimmen
abfall-container
abflug-terminal

from collections import Counter
anglicisms = open("anglicisms.txt").read().split()

matches = []
for line in textf:
    matches.extend([word for word in line.split() if word in anglicisms])

anglicismsInText = Counter(matches)

[('b', 5), ('c', 5), ('a', 3)]

关于第二个问题，我觉得有点难。以你的例子来说，“大”是一种英语，而“bigfoot”应该匹配，但是“Abigail”呢？还是“过大”？是否每次在字符串中发现英语时都应该匹配？一开始？最后？一旦知道了这一点，就应该构建一个与之匹配的正则表达式

编辑：要匹配以英语开头的字符串，请执行以下操作：

def derivatesFromAnglicism(word):
    return any([word.startswith(a) for a in anglicism])

matches.extend([word for word in line.split() if derivatesFromAnglicism(word)])

这将解决您的第一个问题：

anglicisms = ["a", "b", "c"]
words = ["b", "b", "b", "a", "a", "b", "c", "a", "b", "c", "c", "c", "c"]

results = map(lambda angli: (angli, words.count(angli)), anglicisms)
results.sort(key=lambda p:-p[1])

结果如下所示：

abchecken
abchillen
abdancen
abdimmen
abfall-container
abflug-terminal

from collections import Counter
anglicisms = open("anglicisms.txt").read().split()

matches = []
for line in textf:
    matches.extend([word for word in line.split() if word in anglicisms])

anglicismsInText = Counter(matches)

[('b', 5), ('c', 5), ('a', 3)]

对于你的第二个问题，我认为正确的方法是使用正则表达式。

类似于：排序（[{w:text.count（w）}For w in words]）是你想要的吗？只在开始的时候就足够了，因为大多数英语在结尾处都被拒绝了，例如动漫->animes@boban添加了如何匹配以英语开头的字符串。如果你有太多的英语，预构建正则表达式可能会更快，或者将英语列表分成不同的列表（例如，通过起始字符）正则表达式是可怕的！