如何在python中在数组中存储csv字段值

如何在python中在数组中存储csv字段值,python,csv,Python,Csv,假设我有两个csv文件 file1.csv event_id, polarity 1124, 0.3763 36794, 0.638 dhejjd, 0.3627 文件2.csv event_id, tallies 61824, 0.3 36794, 0.8 dhejjd, 0.9 dthdnb, 0.66 我想访问每个事件id的极性和计数。如何读取两个数组中的这些文件,以便为每个[event_id]获取极性和计数,然后使用这两

假设我有两个csv文件 file1.csv

event_id, polarity
   1124,   0.3763
  36794,   0.638
 dhejjd,   0.3627
文件2.csv

event_id, tallies
   61824,   0.3
   36794,   0.8
   dhejjd,   0.9
   dthdnb,   0.66
我想访问每个事件id的极性和计数。如何读取两个数组中的这些文件,以便为每个[event_id]获取极性和计数,然后使用这两个值执行计算。 我尝试过这个,但没有发现我犯了一个错误:

 for event_id, polarity in file1reader: ValueError: need more than 1 value to unpack
我的代码: 导入csv

file1reader = csv.reader(open("file1.csv"), delimiter=",")
file2reader = csv.reader(open("file2.csv"), delimiter=",")

header1 = file1reader.next() #header
header2 = file2reader.next() #header

for event_id, polarity in file1reader:
    #if event_id and polarity in file1reader:
      for event_id, tallies in file2reader:
        #if event_id in file2reader:
          if file1reader.event_id == file2reader.event_id:
            print event_id, polarity, tallies   
            break   
file1reader.close()
file2reader.close() 

您不需要在两个
csvreader
对象上循环。您可以首先使用
itertools.chain
连接2
csvreader
。然后使用字典(with)将
事件id
存储为键,将
极性
存储为值

import csv
from itertools import chain
d={}
with open('a1.txt', 'rb') as csvfile1,open('ex.txt', 'rb') as csvfile2:
     spamreader1 = csv.reader(csvfile1, delimiter=',')
     spamreader2 = csv.reader(csvfile2, delimiter=',')
     spamreader1.next()
     spamreader2.next()
     sp=chain(spamreader1,spamreader2)
     for i,j in sp:
            d.setdefault(i,[]).append(j)
     print d
结果:

{'36794': ['0.638', '0.8'], 
 '61824': ['0.3'], 
 '1124': ['0.3763'], 
 'dthdnb': ['0.66'], 
 'dhejjd': ['0.3627', '0.9']}

您可以使用dict对它们进行分组:

from collections import defaultdict
d = defaultdict (list)

with open("file1.csv") as f1, open("file2.csv") as f2:
    d = defaultdict(list)
    next(f1),next(f2)
    r1 = csv.reader(f1,skipinitialspace=True)
    r2 = csv.reader(f2,skipinitialspace=True)
    for row in r1:
        d[row[0]].append(float(row[1]))
    for row in r2:
        d[row[0]].append(float(row[1]))

defaultdict(<type 'list'>, {'36794': [0.638, 0.8], '61824': [0.3], '1124': [0.3763], 'dthdnb': [0.66], 'dhejjd': [0.3627, 0.9]})

from operator import mul
for k, v in filter(lambda x: len(x[1])== 2, d.items()):
    print(mul(*v))
0.5104
0.32643
输入:

event_id, polarity
   1124,   0.3763
  36794,   0.638
 dhejjd,   0.3627
file2.csv

event_id, tallies
   61824,   0.3
   36794,   0.8
   dhejjd,   0.9
   dthdnb,   0.66
输出:

defaultdict(<type 'list'>, {'36794': [0.638, 0.8], '61824': [0.3], '1124': [0.3763], 'dthdnb': [0.667], 'dhejjd': [0.3627, 0.9]})
defaultdict(,{'36794':[0.638,0.8],'61824':[0.3],'1124':[0.3763],'dthdnb':[0.667],'dhejd':[0.3627,0.9])

当您第一次循环查看
文件2
时,点击停止迭代,文件将保留在那里。要多次阅读,你必须多次打开它——但整个过程都是浪费。假设您可以将所有数据放入内存,您只需将数据读入
dicts

import csv

file1 = {}

file2 = {}

with open('file1.csv', 'r') as input1:

    reader = csv.reader(input1)
    reader.next()

    for row in reader:
        file1[row[0]] = row[1]

with open('file2.csv', 'r') as input2:

    reader = csv.reader(input2)
    reader.next()

    for row in reader:
        file2[row[0]] = row[1]


# And now we can directly compare without looping through file 2 every time

for key in file1:
    # try/except is more pythonic.
    try:
        print key, file1[key], file2[key]
    except KeyError:
        pass
这节省了处理时间,因为您不必进行太多的循环,并且不必每次转到file1的下一次迭代时都打开和关闭文件

注意:本例中我最初使用的是
dictreader
,但这是基于假设您有多个列,我认为这是错误的。在这种情况下,只需使用
列表索引

如果要让多个列具有相同的名称和不同的顺序,可以使用
dictreader

如果是这种情况,您需要使用
读写器
,代码如下:

import csv

file1 = {}

file2 = {}

with open('file1.csv', 'r') as input1:

    reader = csv.DictReader(input1)
    # Don't use next so we can use the headers as keys

    for row in reader:
        file1[row['event_id']] = row['polarity']

with open('file2.csv', 'r') as input2:

    reader = csv.DictReader(input2)
    # Don't use next so we can use the headers as keys

    for row in reader:
        file2[row['event_id']] = row['tallies']


# And now we can directly compare without looping through file 2 every time

for key in file1:
    # try/except is more pythonic.
    try:
        print key, file1[key], file2[key]
    except KeyError:
        pass
使用数据帧而不是numpy数组

import pandas as pd
df = pd.read_csv("file1.csv", index_col="event_id", skipinitialspace=True)
df2 = pd.read_csv("file2.csv", index_col="event_id", skipinitialspace=True)
df = df.merge(df2, how='outer', left_index=True, right_index=True)
p.S.更正了代码,使其能够运行。“外部”连接意味着,如果给定的“事件id”仅存在“极性”或“计数”,则缺少的值被编码为
NaN
s。输出是

          polarity  tallies
event_id                   
1124        0.3763      NaN
36794       0.6380     0.80
61824          NaN     0.30
dhejjd      0.3627     0.90
dthdnb         NaN     0.66
如果只需要同时存在这两个行,请使用
how='inner'

p.p.S 要进一步使用此数据帧,您可以,例如, 用一些值替换
NaNs
,比如说
0

df.fillna(0, inplace=True)
可以按标签选择图元

df.loc["dhejjd","polarity"]
df.loc[:,"tallies"]
或按整数位置

df.iloc[0:3,:]

如果你从未使用过熊猫,那么学习和习惯熊猫需要一些时间。每秒钟都是值得的。

我建议将这两个文件中的数据存储到字典字典中,使用
collections.defaultdict
可以轻松创建字典

import csv
from collections import defaultdict
import json  # just for pretty printing resulting data structure

event_data = defaultdict(dict)

filename1 = "file1.csv"
filename2 = "file2.csv"

with open(filename1, "rb") as file1:
    file1reader = csv.reader(file1, delimiter=",", skipinitialspace=True)
    next(file1reader)  # skip over header
    for event_id, polarity in file1reader:
        event_data[event_id]['polarity'] = float(polarity)

with open(filename2, "rb") as file2:
    file2reader = csv.reader(file2, delimiter=",", skipinitialspace=True)
    next(file2reader)  # skip over header
    for event_id, tallies in file2reader:
        event_data[event_id]['tallies'] = float(tallies)

print 'event_data:', json.dumps(event_data, indent=4)
print

# print as table
for event_id in sorted(event_data):
    print 'event_id: {!r:<8} polarity: {:<8} tallies: {:<8}'.format(
        event_id,
        event_data[event_id].get('polarity', None),
        event_data[event_id].get('tallies', None))

什么不起作用?具体点。你有什么错误吗?你看起来真的是这样吗?是的。对于事件id,file1reader中的极性:ValueError:需要超过1个值才能unpack@MEH,它有空格吗?@MEH。你能发布你的csv文件的实际格式吗,你的问题是一团糟。我得到:回溯(最近一次调用):文件“combined5.py”,第14行,在file1[row['event_id']]=row['polarity']KeyError:'polarity'检查列标题名称是否与键匹配。注意-您不需要使用
dictreader
来执行此操作,您可以使用列表索引,我已将答案更改为匹配。请注意大写字母、额外空格等。它们都很重要。您可以
打印行
以获取实际键的示例。我的实际键是事件id和极性。事件id和计数也是如此。如果将
file1[row['event\u id']]=row['polarity']
切换到
print row.keys()
打印出来的键会得到什么?该错误表明,
极性
与标题行中的不完全相同。没有人会这样做,如果我们的答案真的很糟糕,那么我们如何才能找到我们正在犯的错误?不是我的错误,但这甚至不会运行,所以我不太惊讶它被否决,你也不会链接到熊猫,这不是内置的或解释它是什么doing@Padraic坎宁安:谢谢你的建设性批评。我纠正并扩展了我的答案accordingly@Vignesh卡莱:谢谢你的邀请support@lanenok这是一种同伙堆栈溢出的感觉
import csv
from collections import defaultdict
import json  # just for pretty printing resulting data structure

event_data = defaultdict(dict)

filename1 = "file1.csv"
filename2 = "file2.csv"

with open(filename1, "rb") as file1:
    file1reader = csv.reader(file1, delimiter=",", skipinitialspace=True)
    next(file1reader)  # skip over header
    for event_id, polarity in file1reader:
        event_data[event_id]['polarity'] = float(polarity)

with open(filename2, "rb") as file2:
    file2reader = csv.reader(file2, delimiter=",", skipinitialspace=True)
    next(file2reader)  # skip over header
    for event_id, tallies in file2reader:
        event_data[event_id]['tallies'] = float(tallies)

print 'event_data:', json.dumps(event_data, indent=4)
print

# print as table
for event_id in sorted(event_data):
    print 'event_id: {!r:<8} polarity: {:<8} tallies: {:<8}'.format(
        event_id,
        event_data[event_id].get('polarity', None),
        event_data[event_id].get('tallies', None))