Pandas 如果其他列为空,则连接某些列
我有一个CSV文件,应该是这样的:Pandas 如果其他列为空,则连接某些列,pandas,Pandas,我有一个CSV文件,应该是这样的: ID, years_active, issues ------------------------------- 'Truck1', 8, 'In dire need of a paintjob' 'Car 5', 3, 'To small for large groups' 但是,CSV的格式有点不正确,目前看起来是这样的 ID, years_active, issues ------------------------ 'Truck1', 8, 'In
ID, years_active, issues
-------------------------------
'Truck1', 8, 'In dire need of a paintjob'
'Car 5', 3, 'To small for large groups'
但是,CSV的格式有点不正确,目前看起来是这样的
ID, years_active, issues
------------------------
'Truck1', 8, 'In dire need'
'','', 'of a'
'','', 'paintjob'
'Car 5', 3, 'To small for'
'', '', 'large groups'
现在,我可以通过缺少“ID”和“years\u active”值来识别错误的行,并希望将该行的“issues”值附加到具有“ID”和“years\u active”值的上一行
我对熊猫不是很有经验,但我想到了以下代码:
for index, row in df.iterrows():
if row['years_active'] == None:
df.loc[index-1]['issues'] += row['issues']
然而,如果条件未能触发。
我想做的事情可能吗?如果是这样,有人知道我做错了什么吗 给定您的示例输入:
df = pd.DataFrame({
'ID': ['Truck1', '', '', 'Car 5', ''],
'years_active': [8, '', '', 3, ''],
'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups']
})
您可以使用:
new_df = df.groupby(df.ID.replace('', method='ffill')).agg({'years_active': 'first', 'issues': ' '.join})
这将给你:
years_active issues
ID
Car 5 3 To small for large groups
Truck1 8 In dire need of a paintjob
因此,我们在这里所做的是将非空ID向前填充到后续的空ID中,并使用这些ID对相关行进行分组。然后,我们进行聚合,以获取活动年份的第一次出现,并按照问题列出现的顺序将问题列连接在一起,以创建单个结果。以下使用
for
循环查找和添加字符串(来自JonClements答案的数据框):
输出:
ID issues years_active
0 Truck1 In dire need of a paintjob 8
3 Car 5 To small for large groups 3
在这个问题的上下文中可能值得一提的是,通过使用StringIO库,有一种经常被忽略的处理笨拙输入的方法 重要的一点是
read\u csv
可以从StringIO“文件”中读取
在这种情况下,我会舍弃单引号和多个逗号,这会混淆read\u csv
,并将第二行和后续输入行附加到第一行,形成完整的常规csv行形式read\u csv
以下是read\u csv
接收到的内容
ID years_active issues
0 Truck1 8 In dire need of a paintjob
1 Car 5 3 To small for large groups
代码很难看,但很容易理解
import pandas as pd
from io import StringIO
for_pd = StringIO()
with open('jasper.txt') as jasper:
print (jasper.readline(), file=for_pd)
line = jasper.readline()
complete_record = ''
for line in jasper:
line = ''.join(line.rstrip().replace(', ', ',').replace("'", ''))
if line.startswith(','):
complete_record += line.replace(',,', ',').replace(',', ' ')
else:
if complete_record:
print (complete_record, file=for_pd)
complete_record = line
if complete_record:
print (complete_record, file=for_pd)
for_pd.seek(0)
df = pd.read_csv(for_pd)
print (df)
另一种选择是:
df.groupby(df.ID.str.len().gt(0.cumsum()).agg({'issues':''.join,'years\u active':'first'})
import pandas as pd
from io import StringIO
for_pd = StringIO()
with open('jasper.txt') as jasper:
print (jasper.readline(), file=for_pd)
line = jasper.readline()
complete_record = ''
for line in jasper:
line = ''.join(line.rstrip().replace(', ', ',').replace("'", ''))
if line.startswith(','):
complete_record += line.replace(',,', ',').replace(',', ' ')
else:
if complete_record:
print (complete_record, file=for_pd)
complete_record = line
if complete_record:
print (complete_record, file=for_pd)
for_pd.seek(0)
df = pd.read_csv(for_pd)
print (df)