Pandas 处理数据帧中的新行字符_Pandas_Amazon Athena

Pandas 处理数据帧中的新行字符

pandas

Pandas 处理数据帧中的新行字符,pandas,amazon-athena,Pandas,Amazon Athena,当我按列分组时，我会从另一个字段中获取未预期的值以下是示例数据：这就是我导入它的方式： import pandas as pd df = pd.read_csv('s3://todel162/bigd/test.csv', header=None, escapechar='\\') df.columns=['id', 'client', 'code', 'm_text', 'atpt', 'date'] df.groupby('id')['id'].count() 输出为： id 123

当我按列分组时，我会从另一个字段中获取未预期的值

以下是示例数据：

这就是我导入它的方式：

import pandas as pd
df = pd.read_csv('s3://todel162/bigd/test.csv', header=None, escapechar='\\')
df.columns=['id', 'client', 'code', 'm_text', 'atpt', 'date']
df.groupby('id')['id'].count()

输出为：

id
1234                                1
3456                                1
5432                              118
report it as soon as possible"      1
Name: id, dtype: int64

基本上，两个双引号中的所有文本都应该是单个单元格的一部分。例如

"this is line one
and some text on line two"

有没有办法正确导入这样的数据（无需修改源文件）？

在这种特殊情况下，您可以使用param

skipinitialspace

df = pd.read_csv('Book1.csv', header=None, skipinitialspace=True, escapechar='\\')
df.loc[115:]

        0          1      2  \
115  5432  some_code  case0   
116  5432  some_code  case0   
117  5432  some_code  case0   
118  1234  some_code  case1   
119  3456   new_code  case2   

                                                     3  4               5  
115                                         this is ok  6  20181201031613  
116                                         this is ok  6  20181201031613  
117                                         this is ok  6  20181201031613  
118  welcome to this new bug and \nreport it as soo...  3  20181201031613  
119  this is another newline \nfollowed by a back s...  4  20181201031613

如果要删除字符串中的

\n

，只需

df[3]=df[3].str.replace（'\n'，''）

在这种特殊情况下，可以使用参数

skipinitialspace

df = pd.read_csv('Book1.csv', header=None, skipinitialspace=True, escapechar='\\')
df.loc[115:]

        0          1      2  \
115  5432  some_code  case0   
116  5432  some_code  case0   
117  5432  some_code  case0   
118  1234  some_code  case1   
119  3456   new_code  case2   

                                                     3  4               5  
115                                         this is ok  6  20181201031613  
116                                         this is ok  6  20181201031613  
117                                         this is ok  6  20181201031613  
118  welcome to this new bug and \nreport it as soo...  3  20181201031613  
119  this is another newline \nfollowed by a back s...  4  20181201031613

如果您想删除字符串中的

\n

，只需

df[3]=df[3].str.replace（'\n'，''）

我几乎可以肯定，如果没有任何预/后处理，您就无法完成。非常好@Chris+1:）非常好，克里斯+1 :)