Python 如何查找和计算列表和文本之间的多个交点?
我目前正在用Python编写一个程序,计算德语文本中的英语成分。我想知道在整篇文章中有多少次使用英语。为此,我列出了德语中的所有英语,如下所示:Python 如何查找和计算列表和文本之间的多个交点?,python,python-3.x,text,text-files,intersection,Python,Python 3.x,Text,Text Files,Intersection,我目前正在用Python编写一个程序,计算德语文本中的英语成分。我想知道在整篇文章中有多少次使用英语。为此,我列出了德语中的所有英语,如下所示: abchecken abchillen abdancen abdimmen abfall-container abflug-terminal from collections import Counter anglicisms = open("anglicisms.txt").read().split() matches = [] for line
abchecken
abchillen
abdancen
abdimmen
abfall-container
abflug-terminal
from collections import Counter
anglicisms = open("anglicisms.txt").read().split()
matches = []
for line in textf:
matches.extend([word for word in line.split() if word in anglicisms])
anglicismsInText = Counter(matches)
[('b', 5), ('c', 5), ('a', 3)]
而且这个名单还在继续。。。
然后我检查了这个列表和要分析的文本之间的交叉点,但这只给了我一个出现在两个文本中的所有单词的列表,例如:英语:4:{'abdansen','abchecken','terminal'}
我真的希望程序能输出这些单词出现的次数(最好按频率排序),例如:
Anglicisms: abdancen(5), abchecken(2), terminal(1)
这是我目前掌握的代码:
#counters to zero
lines, blanklines, sentences, words = 0, 0, 0, 0
print ('-' * 50)
while True:
try:
#def text file
filename = input("Please enter filename: ")
textf = open(filename, 'r')
break
except IOError:
print( 'Cannot open file "%s" ' % filename )
#reads one line at a time
for line in textf:
print( line, ) # test
lines += 1
if line.startswith('\n'):
blanklines += 1
else:
#sentence ends with . or ! or ?
#count these characters
sentences += line.count('.') + line.count('!') + line.count('?')
#create a list of words
#use None to split at any whitespace regardless of length
tempwords = line.split(None)
print(tempwords)
#total words
words += len(tempwords)
#anglicisms
words1 = set(open(filename).read().split())
words2 = set(open("anglicisms.txt").read().split())
duplicates = words1.intersection(words2)
textf.close()
print( '-' * 50)
print( "Lines : ", lines)
print( "Blank lines : ", blanklines)
print( "Sentences : ", sentences)
print( "Words : ", words)
print( "Anglicisms : %d:%s"%(len(duplicates),duplicates))
我的第二个问题是,这不包括那些英语,换句话说。例如,如果“big”在英语词组列表中,而“bigfoot”在文本中,则忽略此事件。我怎样才能解决这个问题
来自瑞士的亲切问候 我会这样做:
abchecken
abchillen
abdancen
abdimmen
abfall-container
abflug-terminal
from collections import Counter
anglicisms = open("anglicisms.txt").read().split()
matches = []
for line in textf:
matches.extend([word for word in line.split() if word in anglicisms])
anglicismsInText = Counter(matches)
[('b', 5), ('c', 5), ('a', 3)]
关于第二个问题,我觉得有点难。以你的例子来说,“大”是一种英语,而“bigfoot”应该匹配,但是“Abigail”呢?还是“过大”?是否每次在字符串中发现英语时都应该匹配?一开始?最后?一旦知道了这一点,就应该构建一个与之匹配的正则表达式
编辑:要匹配以英语开头的字符串,请执行以下操作:
def derivatesFromAnglicism(word):
return any([word.startswith(a) for a in anglicism])
matches.extend([word for word in line.split() if derivatesFromAnglicism(word)])
这将解决您的第一个问题:
anglicisms = ["a", "b", "c"]
words = ["b", "b", "b", "a", "a", "b", "c", "a", "b", "c", "c", "c", "c"]
results = map(lambda angli: (angli, words.count(angli)), anglicisms)
results.sort(key=lambda p:-p[1])
结果如下所示:
abchecken
abchillen
abdancen
abdimmen
abfall-container
abflug-terminal
from collections import Counter
anglicisms = open("anglicisms.txt").read().split()
matches = []
for line in textf:
matches.extend([word for word in line.split() if word in anglicisms])
anglicismsInText = Counter(matches)
[('b', 5), ('c', 5), ('a', 3)]
对于你的第二个问题,我认为正确的方法是使用正则表达式。类似于:排序([{w:text.count(w)}For w in words])是你想要的吗?只在开始的时候就足够了,因为大多数英语在结尾处都被拒绝了,例如动漫->animes@boban添加了如何匹配以英语开头的字符串。如果你有太多的英语,预构建正则表达式可能会更快,或者将英语列表分成不同的列表(例如,通过起始字符)正则表达式是可怕的!