使两列CSV文件用户id显示一次,并以空格分隔会议列表-python

使两列CSV文件用户id显示一次,并以空格分隔会议列表-python,python,csv,Python,Csv,在上面的链接中有很好的解释。但在我的情况是有点不同 user meetings 178787 287750 178787 151515 178787 158478 576585 896352 576585 985639 576585 456988 预期结果是 user meetings 178787 "[287750,151515,158478]" 576585 "[896352,985639,456988]" 如何使用pyth

在上面的链接中有很好的解释。但在我的情况是有点不同

user     meetings
178787    287750
178787    151515
178787    158478
576585    896352
576585    985639
576585    456988
预期结果是

user       meetings
178787   "[287750,151515,158478]"
576585   "[896352,985639,456988]"

如何使用python和上面的代码实现这一点。提前感谢。

您可以逐行阅读文件,拆分行并将会议添加到字典中,其中关键是用户。使用该方法可以非常巧妙地完成此操作

from collections import defaultdict
import csv

inpath = ''  # Path to input CSV file
outpath = ''  # Path to output CSV file

output = defaultdict(list)  # Dictionary like {user_id: [meetings]}

for row in csv.DictReader(open(inpath)):
    output[row['user']].append(row['meetings'])

with open(outpath, 'w') as f:
    for user, meetings in output.items():
        row = user + ',' + str(meetings) + '\n'
        f.write(row)
然后,我们可以使用制表符将这本词典写回同一个文件,使所有内容对齐

因此,假设您的文件名为f.csv,则代码如下所示:

d = {}
for l in open('f.csv').read().split('\n')[1:-1]:
    u, m = l.split()
    d.setdefault(u, []).append(m)

with open('f.csv', 'w') as f:
    f.write('user\tmeetings\n')
    for u, m in d.items():
        f.write(u + '\t' + str(m) + '\n')
产生以下所需输出:

user    meetings
178787  ['287750', '151515', '158478']
576585  ['896352', '985639', '456988']
既然用户将是关键,我们就编一本字典吧。注意:这最终会将整个文件加载到内存中一次,但不需要先按用户对文件进行排序。还要注意,输出也没有排序,因为dict.items不会以任何确定的顺序检索字典项

output = {}
with f as open('input.csv'):
    for line in f:
        user, meeting = line.strip('\r\n').split()
        # we strip newlines before splitting on whitespace

        if user not in output and user != 'user': 
            # the user was not found in the dict (and we want to skip the header)
            output[user] = [meeting] # add the user, with the first meeting
        else: # user already exists in dict
            output[user].append(meeting) # add meeting to user entry

# print output header
print("user meetings") # I used a single space, feel free to use '\t' etc.
# lets retrieve all meetings per user
for user, meetings in output.items() # in python2, use .iteritems() instead
    meetings = ','.join(_ for _ in meetings) # format ["1","2","3"] to "1,2,3"
    print('{} "[{}]"'.format(user, meetings))
发烧友:排序输出。我首先对键进行排序。注意,这将使用更多的内存,因为我也在创建一个键列表

# same as before
output = {}
with f as open('input.csv'):
for line in f:
    user, meeting = line.strip('\r\n').split()
    # we strip newlines before splitting on whitespace

    if user not in output and user != 'user': 
        # the user was not found in the dict (and we want to skip the header)
        output[user] = [meeting] # add the user, with the first meeting
    else: # user already exists in dict
        output[user].append(meeting) # add meeting to user entry

# print output header
print("user meetings") # I used a single space, feel free to use '\t' etc.

# sort my dict keys before printing them:
for user in sorted(output.keys()):
    meetings = ','.join(_ for _ in output[user])
    print('{} "[{}]"'.format(user, meetings))
熊猫提供了一个很好的解决方案:

import pandas as pd

df = pd.read_csv('myfile.csv', columns=['user', 'meetings'])
df_grouped = df.groupby('user')['meetings'].apply(list).astype(str).reset_index()

发布当前代码我喜欢使用dict.setdefaulthere@cowbert是的,这是一个很酷的用法,非常感谢jp_数据分析,它工作得很好。