Python 如何将CSV文件中具有相同键的后续行分组

Python 如何将CSV文件中具有相同键的后续行分组,python,string,csv,Python,String,Csv,如果col1等于前一行中的同一个值,我将尝试解析col3,然后将输出写入一个新文件。我有一个CSV文件,如下所示: col1,col2,col3 a,12,"hello " a,13,"good day" a,14,"nice weather" b,1,"cat" b,2,"dog and cat" c,2,"animals are cute" 我想要的输出: col1,col3 a,"hello good day nice weather" b,"cat dog and cat" c,"an

如果col1等于前一行中的同一个值,我将尝试解析col3,然后将输出写入一个新文件。我有一个CSV文件,如下所示:

col1,col2,col3
a,12,"hello "
a,13,"good day"
a,14,"nice weather"
b,1,"cat"
b,2,"dog and cat"
c,2,"animals are cute"
我想要的输出:

col1,col3
a,"hello good day nice weather"
b,"cat dog and cat"
c,"animals are cute"
这就是我尝试过的:

import csv

with open('myfile.csv', 'rb') as inputfile, open('outputfile.csv','wb') as outputfile:
    reader=csv.reader(inputfile)
    writer=csv.writer(outputfile)
    next(reader)
    for row in reader:
        while row[0]==row[0]:
            concat_text=" ".join(row[2])
        print concat_text
        writer.writerow((row[0],concat_text))

它运行,但我没有输出。感谢您的帮助。

如果您对使用数据帧感兴趣,可以对数据帧进行分组,然后输出唯一值:

import pandas as pd

df = pd.read_csv('test.txt')
print(df)
原始数据帧

  col1  col2              col3
0    a    12            hello 
1    a    13          good day
2    a    14      nice weather
3    b     1               cat
4    b     2       dog and cat
5    c     2  animals are cute
df2 = df.groupby(df['col1'])
df2 = df2['col3'].unique()
df2 = df2.reset_index()

print(df2)
第二个数据帧

  col1  col2              col3
0    a    12            hello 
1    a    13          good day
2    a    14      nice weather
3    b     1               cat
4    b     2       dog and cat
5    c     2  animals are cute
df2 = df.groupby(df['col1'])
df2 = df2['col3'].unique()
df2 = df2.reset_index()

print(df2)
将导致:

  col1                              col3
0    a  [hello , good day, nice weather]
1    b                [cat, dog and cat]
2    c                [animals are cute]
要连接第三列,需要使用
apply

df2['col3'] = df2['col3'].apply(lambda x: ' '.join(s.strip() for s in x))

  col1                          col3
0    a   hello good day nice weather
1    b               cat dog and cat
2    c              animals are cute

完整代码:

import pandas as pd

df = pd.read_csv('test.txt')
df2 = df.groupby(df['col1'])

df2 = df2['col3'].unique()
df2 = df2.reset_index()

df2['col3'] = df2['col3'].apply(lambda x: ' '.join(s.strip() for s in x))

df2.to_csv('output.csv')

如果您对使用感兴趣,可以对数据帧进行分组,然后输出唯一值:

import pandas as pd

df = pd.read_csv('test.txt')
print(df)
原始数据帧

  col1  col2              col3
0    a    12            hello 
1    a    13          good day
2    a    14      nice weather
3    b     1               cat
4    b     2       dog and cat
5    c     2  animals are cute
df2 = df.groupby(df['col1'])
df2 = df2['col3'].unique()
df2 = df2.reset_index()

print(df2)
第二个数据帧

  col1  col2              col3
0    a    12            hello 
1    a    13          good day
2    a    14      nice weather
3    b     1               cat
4    b     2       dog and cat
5    c     2  animals are cute
df2 = df.groupby(df['col1'])
df2 = df2['col3'].unique()
df2 = df2.reset_index()

print(df2)
将导致:

  col1                              col3
0    a  [hello , good day, nice weather]
1    b                [cat, dog and cat]
2    c                [animals are cute]
要连接第三列,需要使用
apply

df2['col3'] = df2['col3'].apply(lambda x: ' '.join(s.strip() for s in x))

  col1                          col3
0    a   hello good day nice weather
1    b               cat dog and cat
2    c              animals are cute

完整代码:

import pandas as pd

df = pd.read_csv('test.txt')
df2 = df.groupby(df['col1'])

df2 = df2['col3'].unique()
df2 = df2.reset_index()

df2['col3'] = df2['col3'].apply(lambda x: ' '.join(s.strip() for s in x))

df2.to_csv('output.csv')

问题是,您正在将同一行与其自身进行比较。此版本将最后一行与当前行进行比较。输出不是以引号分隔的,但它是正确的。script.py的内容

#!/usr/bin/env python

import csv

with open('myfile.csv', 'rb') as inputfile, open('outputfile.csv','wb') as outputfile:
    reader=csv.reader(inputfile)
    writer=csv.writer(outputfile)
    next(reader)
    lastRow = None
    # assumes data is in order on first column
    for row in reader:
        if not lastRow:
            # start processing line with the first column and third column
            concat_text = row[2].strip()
            lastRow = row
            print concat_text
        else:
            if lastRow[0]==row[0]:
                # add to line
                concat_text = concat_text + ' ' + row[2].strip()
                print concat_text
            else:
                # end processing
                print concat_text
                writer.writerow((lastRow[0],concat_text))
                # start processing
                concat_text = row[2]
                print concat_text
            lastRow = row
    # write out last element
    print concat_text
    writer.writerow((lastRow[0],concat_text))
运行后./script.py输出文件.csv的内容

a,hello good day nice weather
b,cat dog and cat
c,animals are cute

问题是,您正在将同一行与其自身进行比较。此版本将最后一行与当前行进行比较。输出不是以引号分隔的,但它是正确的。script.py的内容

#!/usr/bin/env python

import csv

with open('myfile.csv', 'rb') as inputfile, open('outputfile.csv','wb') as outputfile:
    reader=csv.reader(inputfile)
    writer=csv.writer(outputfile)
    next(reader)
    lastRow = None
    # assumes data is in order on first column
    for row in reader:
        if not lastRow:
            # start processing line with the first column and third column
            concat_text = row[2].strip()
            lastRow = row
            print concat_text
        else:
            if lastRow[0]==row[0]:
                # add to line
                concat_text = concat_text + ' ' + row[2].strip()
                print concat_text
            else:
                # end processing
                print concat_text
                writer.writerow((lastRow[0],concat_text))
                # start processing
                concat_text = row[2]
                print concat_text
            lastRow = row
    # write out last element
    print concat_text
    writer.writerow((lastRow[0],concat_text))
运行后./script.py输出文件.csv的内容

a,hello good day nice weather
b,cat dog and cat
c,animals are cute

行[0]==行[0]:…
永远不会前进,它是一个无限循环。
行[0]==行[0]:…
永远不会前进,它是一个无限循环。这是因为
hello
在原始数据中后面有一个空格。@Leb记得将
df2.添加到_csv('somefile.csv')
@Ilja。谢谢。谢谢,我认为熊猫是另一种很好的方式。不客气。这个答案只是作为你和任何可能的未来观众的一个选择。如果您无法使用
pandas
,那么这里的其他答案将非常重要。这是因为
hello
在原始数据中有一个空格。@Leb请记住将
df2.to_csv('somefile.csv')
@Ilja添加进去。谢谢。谢谢,我认为熊猫是另一种很好的方式。不客气。这个答案只是作为你和任何可能的未来观众的一个选择。如果您无法使用
pandas
,这里的其他答案将非常准确。