如何在python中在数组中存储csv字段值
假设我有两个csv文件 file1.csv如何在python中在数组中存储csv字段值,python,csv,Python,Csv,假设我有两个csv文件 file1.csv event_id, polarity 1124, 0.3763 36794, 0.638 dhejjd, 0.3627 文件2.csv event_id, tallies 61824, 0.3 36794, 0.8 dhejjd, 0.9 dthdnb, 0.66 我想访问每个事件id的极性和计数。如何读取两个数组中的这些文件,以便为每个[event_id]获取极性和计数,然后使用这两
event_id, polarity
1124, 0.3763
36794, 0.638
dhejjd, 0.3627
文件2.csv
event_id, tallies
61824, 0.3
36794, 0.8
dhejjd, 0.9
dthdnb, 0.66
我想访问每个事件id的极性和计数。如何读取两个数组中的这些文件,以便为每个[event_id]获取极性和计数,然后使用这两个值执行计算。
我尝试过这个,但没有发现我犯了一个错误:
for event_id, polarity in file1reader: ValueError: need more than 1 value to unpack
我的代码:
导入csv
file1reader = csv.reader(open("file1.csv"), delimiter=",")
file2reader = csv.reader(open("file2.csv"), delimiter=",")
header1 = file1reader.next() #header
header2 = file2reader.next() #header
for event_id, polarity in file1reader:
#if event_id and polarity in file1reader:
for event_id, tallies in file2reader:
#if event_id in file2reader:
if file1reader.event_id == file2reader.event_id:
print event_id, polarity, tallies
break
file1reader.close()
file2reader.close()
您不需要在两个
csvreader
对象上循环。您可以首先使用itertools.chain
连接2csvreader
。然后使用字典(with)将事件id
存储为键,将极性
存储为值
import csv
from itertools import chain
d={}
with open('a1.txt', 'rb') as csvfile1,open('ex.txt', 'rb') as csvfile2:
spamreader1 = csv.reader(csvfile1, delimiter=',')
spamreader2 = csv.reader(csvfile2, delimiter=',')
spamreader1.next()
spamreader2.next()
sp=chain(spamreader1,spamreader2)
for i,j in sp:
d.setdefault(i,[]).append(j)
print d
结果:
{'36794': ['0.638', '0.8'],
'61824': ['0.3'],
'1124': ['0.3763'],
'dthdnb': ['0.66'],
'dhejjd': ['0.3627', '0.9']}
您可以使用dict对它们进行分组:
from collections import defaultdict
d = defaultdict (list)
with open("file1.csv") as f1, open("file2.csv") as f2:
d = defaultdict(list)
next(f1),next(f2)
r1 = csv.reader(f1,skipinitialspace=True)
r2 = csv.reader(f2,skipinitialspace=True)
for row in r1:
d[row[0]].append(float(row[1]))
for row in r2:
d[row[0]].append(float(row[1]))
defaultdict(<type 'list'>, {'36794': [0.638, 0.8], '61824': [0.3], '1124': [0.3763], 'dthdnb': [0.66], 'dhejjd': [0.3627, 0.9]})
from operator import mul
for k, v in filter(lambda x: len(x[1])== 2, d.items()):
print(mul(*v))
0.5104
0.32643
输入:
event_id, polarity
1124, 0.3763
36794, 0.638
dhejjd, 0.3627
file2.csv
event_id, tallies
61824, 0.3
36794, 0.8
dhejjd, 0.9
dthdnb, 0.66
输出:
defaultdict(<type 'list'>, {'36794': [0.638, 0.8], '61824': [0.3], '1124': [0.3763], 'dthdnb': [0.667], 'dhejjd': [0.3627, 0.9]})
defaultdict(,{'36794':[0.638,0.8],'61824':[0.3],'1124':[0.3763],'dthdnb':[0.667],'dhejd':[0.3627,0.9])
当您第一次循环查看文件2
时,点击停止迭代,文件将保留在那里。要多次阅读,你必须多次打开它——但整个过程都是浪费。假设您可以将所有数据放入内存,您只需将数据读入dicts
:
import csv
file1 = {}
file2 = {}
with open('file1.csv', 'r') as input1:
reader = csv.reader(input1)
reader.next()
for row in reader:
file1[row[0]] = row[1]
with open('file2.csv', 'r') as input2:
reader = csv.reader(input2)
reader.next()
for row in reader:
file2[row[0]] = row[1]
# And now we can directly compare without looping through file 2 every time
for key in file1:
# try/except is more pythonic.
try:
print key, file1[key], file2[key]
except KeyError:
pass
这节省了处理时间,因为您不必进行太多的循环,并且不必每次转到file1的下一次迭代时都打开和关闭文件
注意:本例中我最初使用的是dictreader
,但这是基于假设您有多个列,我认为这是错误的。在这种情况下,只需使用列表索引
如果要让多个列具有相同的名称和不同的顺序,可以使用dictreader
如果是这种情况,您需要使用读写器
,代码如下:
import csv
file1 = {}
file2 = {}
with open('file1.csv', 'r') as input1:
reader = csv.DictReader(input1)
# Don't use next so we can use the headers as keys
for row in reader:
file1[row['event_id']] = row['polarity']
with open('file2.csv', 'r') as input2:
reader = csv.DictReader(input2)
# Don't use next so we can use the headers as keys
for row in reader:
file2[row['event_id']] = row['tallies']
# And now we can directly compare without looping through file 2 every time
for key in file1:
# try/except is more pythonic.
try:
print key, file1[key], file2[key]
except KeyError:
pass
使用数据帧而不是numpy数组
import pandas as pd
df = pd.read_csv("file1.csv", index_col="event_id", skipinitialspace=True)
df2 = pd.read_csv("file2.csv", index_col="event_id", skipinitialspace=True)
df = df.merge(df2, how='outer', left_index=True, right_index=True)
p.S.更正了代码,使其能够运行。“外部”连接意味着,如果给定的“事件id”仅存在“极性”或“计数”,则缺少的值被编码为NaN
s。输出是
polarity tallies
event_id
1124 0.3763 NaN
36794 0.6380 0.80
61824 NaN 0.30
dhejjd 0.3627 0.90
dthdnb NaN 0.66
如果只需要同时存在这两个行,请使用how='inner'
p.p.S
要进一步使用此数据帧,您可以,例如,
用一些值替换NaNs
,比如说0
:
df.fillna(0, inplace=True)
可以按标签选择图元
df.loc["dhejjd","polarity"]
df.loc[:,"tallies"]
或按整数位置
df.iloc[0:3,:]
如果你从未使用过熊猫,那么学习和习惯熊猫需要一些时间。每秒钟都是值得的。我建议将这两个文件中的数据存储到字典字典中,使用
collections.defaultdict
可以轻松创建字典
import csv
from collections import defaultdict
import json # just for pretty printing resulting data structure
event_data = defaultdict(dict)
filename1 = "file1.csv"
filename2 = "file2.csv"
with open(filename1, "rb") as file1:
file1reader = csv.reader(file1, delimiter=",", skipinitialspace=True)
next(file1reader) # skip over header
for event_id, polarity in file1reader:
event_data[event_id]['polarity'] = float(polarity)
with open(filename2, "rb") as file2:
file2reader = csv.reader(file2, delimiter=",", skipinitialspace=True)
next(file2reader) # skip over header
for event_id, tallies in file2reader:
event_data[event_id]['tallies'] = float(tallies)
print 'event_data:', json.dumps(event_data, indent=4)
print
# print as table
for event_id in sorted(event_data):
print 'event_id: {!r:<8} polarity: {:<8} tallies: {:<8}'.format(
event_id,
event_data[event_id].get('polarity', None),
event_data[event_id].get('tallies', None))
什么不起作用?具体点。你有什么错误吗?你看起来真的是这样吗?是的。对于事件id,file1reader中的极性:ValueError:需要超过1个值才能unpack@MEH,它有空格吗?@MEH。你能发布你的csv文件的实际格式吗,你的问题是一团糟。我得到:回溯(最近一次调用):文件“combined5.py”,第14行,在file1[row['event_id']]=row['polarity']KeyError:'polarity'检查列标题名称是否与键匹配。注意-您不需要使用
dictreader
来执行此操作,您可以使用列表索引,我已将答案更改为匹配。请注意大写字母、额外空格等。它们都很重要。您可以打印行
以获取实际键的示例。我的实际键是事件id和极性。事件id和计数也是如此。如果将file1[row['event\u id']]=row['polarity']
切换到print row.keys()
打印出来的键会得到什么?该错误表明,极性
与标题行中的不完全相同。没有人会这样做,如果我们的答案真的很糟糕,那么我们如何才能找到我们正在犯的错误?不是我的错误,但这甚至不会运行,所以我不太惊讶它被否决,你也不会链接到熊猫,这不是内置的或解释它是什么doing@Padraic坎宁安:谢谢你的建设性批评。我纠正并扩展了我的答案accordingly@Vignesh卡莱:谢谢你的邀请support@lanenok这是一种同伙堆栈溢出的感觉
import csv
from collections import defaultdict
import json # just for pretty printing resulting data structure
event_data = defaultdict(dict)
filename1 = "file1.csv"
filename2 = "file2.csv"
with open(filename1, "rb") as file1:
file1reader = csv.reader(file1, delimiter=",", skipinitialspace=True)
next(file1reader) # skip over header
for event_id, polarity in file1reader:
event_data[event_id]['polarity'] = float(polarity)
with open(filename2, "rb") as file2:
file2reader = csv.reader(file2, delimiter=",", skipinitialspace=True)
next(file2reader) # skip over header
for event_id, tallies in file2reader:
event_data[event_id]['tallies'] = float(tallies)
print 'event_data:', json.dumps(event_data, indent=4)
print
# print as table
for event_id in sorted(event_data):
print 'event_id: {!r:<8} polarity: {:<8} tallies: {:<8}'.format(
event_id,
event_data[event_id].get('polarity', None),
event_data[event_id].get('tallies', None))