使用python和打印匹配项比较两个csv文件中的第一列
我有两个csv文件,每个文件包含如下所示的NGRAM:使用python和打印匹配项比较两个csv文件中的第一列,python,csv,match,nltk,Python,Csv,Match,Nltk,我有两个csv文件,每个文件包含如下所示的NGRAM: drinks while strutting,4,1.435486010883783160220299732E-8 and since that,6,4.306458032651349480660899195E-8 the state face,3,2.153229016325674740330449597E-8 它是一个三个单词的短语,后面跟一个频率数,后面跟一个相对频率数 我想写一个脚本,找到两个csv文件中的NGRAM,划分它们的相
drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8
它是一个三个单词的短语,后面跟一个频率数,后面跟一个相对频率数
我想写一个脚本,找到两个csv文件中的NGRAM,划分它们的相对频率,并将它们打印到一个新的csv文件中。我希望它在三个单词短语与另一个文件中的三个单词短语匹配时找到匹配项,然后将第一个csv文件中该短语的相对频率除以第二个csv文件中该短语的相对频率。然后我想把短语和两个相对频率的划分打印到一个新的csv文件中
下面是我所能看到的。我的脚本正在比较行,但只在整行(包括频率和相对频率)完全匹配时才找到匹配项。我意识到这是因为我在寻找两个整组的交集,但我不知道如何做得不同。请原谅我;我不熟悉编码。你能给我的任何帮助让我更靠近一点都会是一个很大的帮助
import csv
import io
alist, blist = [], []
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist.append(row)
first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))
matches = set(first_set).intersection(secnd_set)
c = csv.writer(open("matchedngrams.csv", "a"))
c.writerow(matches)
print matches
print len(matches)
在新文件中没有转储
res
(繁琐)。第一个元素是短语,另外两个元素是频率。使用dict
而不是set
一起进行匹配和映射
import csv
import io
alist, blist = [], []
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist.append(row)
f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}
res = {}
for k,v in f_dict.items():
if k in s_dict:
res[k] = float(v[1])/float(s_dict[k][1])
print(res)
我的脚本正在比较行,但只在整行(包括频率和相对频率)完全匹配时才找到匹配项。我意识到这是因为我在寻找两个整组的交集,但我不知道如何做得不同
这正是字典的用途:当您有一个单独的键和值时(或者当值的一部分是键时)。因此:
现在,您不能直接在字典上使用set方法。Python3在这里提供了一些帮助,但您使用的是2.7。因此,您必须明确地编写它:
matches = {key for key in a_dict if key in b_dict}
或:
但是你真的不需要这个装置;这里要做的就是迭代它们。因此:
for key in a_dict:
if key in b_dict:
a_values = a_dict[key]
b_values = b_dict[key]
do_stuff_with(a_values[2], b_values[2])
作为旁注,你真的不需要在一开始就建立列表,只是为了把它们变成集合,或者说是口述。只需建立集合或目录:
a_set = set()
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
a_set.add(tuple(row))
a_dict = {}
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
a_dict[row[0]] = row
此外,如果你了解理解,那么这三个版本都迫切需要转换:
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
# Now any of these
a_list = list(reader)
a_set = {tuple(row) for row in reader}
a_dict = {row[0]: row for row in reader}
您可以将第一个文件中的相对频率存储到字典中,然后迭代第二个文件,如果第一列与原始文件中的任何内容匹配,则将结果直接写入输出文件:
import csv
tmp = {}
# if 1 file is much larger than the other, load the smaller one here
# make sure it will fit into the memory
with open("ngrams.csv", "rb") as fr:
# using tuple unpacking to extract fixed number of columns from each row
for txt, abs, rel in csv.reader(fr):
# converting strings like "1.435486010883783160220299732E-8"
# to float numbers
tmp[txt] = float(rel)
with open("matchedngrams.csv", "wb") as fw:
writer = csv.writer(fw)
# the 2nd input file will be processed per 1 line to save memory
# the order of items from this file will be preserved
with open("ngramstest.csv", "rb") as fr:
for txt, abs, rel in csv.reader(fr):
if txt in tmp:
# not sure what you want to do with absolute, I use 0 here:
writer.writerow((txt, 0, tmp[txt] / float(rel)))
避免保存较小的数字,因为它们会出现下溢问题(请参阅),将较小的数字与另一个分开会产生更多的下溢问题,因此请执行以下操作以预处理相对频率:
>>> import math
>>> num = 1.435486010883783160220299732E-8
>>> logged = math.log(num)
>>> logged
-18.0591772685384
>>> math.exp(logged)
1.4354860108837844e-08
现在阅读csv
。由于您只操纵相对频率,所以第二列无关紧要,因此让我们跳过它,将第一列(即短语)保存为键,将第三列(即相对频率)保存为值:
import csv, math
# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""
textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""
with open('ngrams-1.csv', 'w') as fout:
for line in textfile.split('\n'):
fout.write(line + '\n')
with open('ngrams-2.csv', 'w') as fout:
for line in textfile2.split('\n'):
fout.write(line + '\n')
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'
ngramdict1 = {}
ngramdict2 = {}
with open(ngramfile1, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict1[phrase] = math.log(float(rel))
with open(ngramfile2, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict2[phrase] = math.log(float(rel))
现在,对于棘手的部分,您需要将ngramdict2的短语的相对频率除以ngramdict1的短语,即:
if phrase_from_ngramdict1 == phrase_from_ngramdict2:
relfreq = relfreq_from_ngramdict2 / relfreq_from_ngramdict1
因为我们把相对频率保持在对数单位,我们不需要除法,只需要简单地减去它,即
if phrase_from_ngramdict1 == phrase_from_ngramdict2:
logrelfreq = logrelfreq_from_ngramdict2 - logrelfreq_from_ngramdict1
要获得两者中出现的短语,您不需要逐个检查短语,只需将dictionary.keys()
转换为一个集合,然后执行set1.intersection(set2)
,请参阅
[out]:
set(['drinks while strutting', 'the state face', 'and since that'])
现在我们用相对频率打印出来:
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
relfreq1 = ngramdict1[p]
relfreq2 = ngramdict2[p]
combined_relfreq = relfreq2 - relfreq1
fout.write(",".join([p, str(combined_relfreq)])+ '\n')
ngramcombined.csv
如下所示:
drinks while strutting,-0.69314718056
the state face,-1.09861228867
and since that,-0.69314718056
以下是完整的代码:
import csv, math
# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""
textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""
with open('ngrams-1.csv', 'w') as fout:
for line in textfile.split('\n'):
fout.write(line + '\n')
with open('ngrams-2.csv', 'w') as fout:
for line in textfile2.split('\n'):
fout.write(line + '\n')
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'
ngramdict1 = {}
ngramdict2 = {}
with open(ngramfile1, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict1[phrase] = math.log(float(rel))
with open(ngramfile2, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict2[phrase] = math.log(float(rel))
# Find the intersecting phrases.
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)
# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
relfreq1 = ngramdict1[p]
relfreq2 = ngramdict2[p]
combined_relfreq = relfreq2 - relfreq1
fout.write(",".join([p, str(combined_relfreq)])+ '\n')
如果您喜欢超级不可读但很短的代码(行数):
对于k,v在f_dict:
=>值错误:太多的值无法解压缩
,也res[k]=v[1]/s_dict[k][1]
=>类型错误:不支持的操作数类型/:'str'和'str'
@Aprillion我修复了它。现在就尝试进行元组解包,您需要将f_dict
更改为f_dict.items()
@Aprillion抱歉,我没有测试就写了它并快速重新加载了。。。这是一个很好的讲座,但问题是划分相对频率,不确定你为什么要用绝对频率来研究,我有点困惑集合在这里应该如何帮助…@Aprillion:集合在OP的问题中。重点在于说明他想使用字典而不是集合,以及他想做什么而不是集合交集。所以我不知道你为什么认为电视机应该有用,而整个答案都是关于扔掉它们。
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
relfreq1 = ngramdict1[p]
relfreq2 = ngramdict2[p]
combined_relfreq = relfreq2 - relfreq1
fout.write(",".join([p, str(combined_relfreq)])+ '\n')
drinks while strutting,-0.69314718056
the state face,-1.09861228867
and since that,-0.69314718056
import csv, math
# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""
textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""
with open('ngrams-1.csv', 'w') as fout:
for line in textfile.split('\n'):
fout.write(line + '\n')
with open('ngrams-2.csv', 'w') as fout:
for line in textfile2.split('\n'):
fout.write(line + '\n')
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'
ngramdict1 = {}
ngramdict2 = {}
with open(ngramfile1, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict1[phrase] = math.log(float(rel))
with open(ngramfile2, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict2[phrase] = math.log(float(rel))
# Find the intersecting phrases.
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)
# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
relfreq1 = ngramdict1[p]
relfreq2 = ngramdict2[p]
combined_relfreq = relfreq2 - relfreq1
fout.write(",".join([p, str(combined_relfreq)])+ '\n')
import csv, math
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'
ngramdict1 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile1, 'r'), delimiter=',')}
ngramdict2 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile2, 'r'), delimiter=',')}
# Find the intersecting phrases.
overlap_phrases = set(ngramdict1.keys()).intersection(set(ngramdict2.keys()))
# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
fout.write(",".join([p, str(ngramdict2[p] - ngramdict1[p])])+ '\n')