Python 在csv文件的行上迭代时,动态地将计算列添加到dataframe?
我有一个大空间分隔的输入文件Python 在csv文件的行上迭代时,动态地将计算列添加到dataframe?,python,pandas,iterator,Python,Pandas,Iterator,我有一个大空间分隔的输入文件input.csv,无法保存在内存中: ## Header # More header here A B 1 2 3 4 如果使用的iterator=True参数,那么它将返回TextFileReader/TextParser对象。这允许动态筛选文件,并仅选择列A大于2的行 但是,如何在运行中向数据帧添加第三列,而不必再次循环所有数据 具体地说,我希望columnC等于columnA乘以dictionaryd中的值,dictionary的键是column
input.csv
,无法保存在内存中:
## Header
# More header here
A B
1 2
3 4
如果使用的iterator=True
参数,那么它将返回TextFileReader
/TextParser
对象。这允许动态筛选文件,并仅选择列A
大于2的行
但是,如何在运行中向数据帧添加第三列,而不必再次循环所有数据
具体地说,我希望columnC
等于columnA
乘以dictionaryd
中的值,dictionary的键是columnB
;i、 e.C=A*d[B]
目前我有以下代码:
import pandas
d = {2: 2, 4: 3}
TextParser = pandas.read_csv('input.csv', sep=' ', iterator=True, comment='#')
df = pandas.concat([chunk[chunk['A'] > 2] for chunk in TextParser])
print(df)
将打印此输出:
A B
1 3 4
如何让它打印此输出(C=A*d[B]
):
您可以使用生成器一次处理一个块: 代码:
def on_the_fly(the_csv):
d = {2: 2, 4: 3}
chunked_csv = pd.read_csv(
the_csv, sep='\s+', iterator=True, comment='#')
for chunk in chunked_csv:
rows_idx = chunk['A'] > 2
chunk.loc[rows_idx, 'C'] = chunk[rows_idx].apply(
lambda x: x.A * d[x.B], axis=1)
yield chunk[rows_idx]
from io import StringIO
data = StringIO(u"""#
A B
1 2
3 4
4 4
""")
import pandas as pd
df = pd.concat([c for c in on_the_fly(data)])
print(df)
A B C
1 3 4 9.0
2 4 4 12.0
测试代码:
def on_the_fly(the_csv):
d = {2: 2, 4: 3}
chunked_csv = pd.read_csv(
the_csv, sep='\s+', iterator=True, comment='#')
for chunk in chunked_csv:
rows_idx = chunk['A'] > 2
chunk.loc[rows_idx, 'C'] = chunk[rows_idx].apply(
lambda x: x.A * d[x.B], axis=1)
yield chunk[rows_idx]
from io import StringIO
data = StringIO(u"""#
A B
1 2
3 4
4 4
""")
import pandas as pd
df = pd.concat([c for c in on_the_fly(data)])
print(df)
A B C
1 3 4 9.0
2 4 4 12.0
结果:
def on_the_fly(the_csv):
d = {2: 2, 4: 3}
chunked_csv = pd.read_csv(
the_csv, sep='\s+', iterator=True, comment='#')
for chunk in chunked_csv:
rows_idx = chunk['A'] > 2
chunk.loc[rows_idx, 'C'] = chunk[rows_idx].apply(
lambda x: x.A * d[x.B], axis=1)
yield chunk[rows_idx]
from io import StringIO
data = StringIO(u"""#
A B
1 2
3 4
4 4
""")
import pandas as pd
df = pd.concat([c for c in on_the_fly(data)])
print(df)
A B C
1 3 4 9.0
2 4 4 12.0