Python 我一直在尝试分离我的csv文件的列。实际上有6列,但有些列是5列,这会影响我的结果
2018-03-06 12:23:13;阅读国家5;2458291108;澳大利亚 2018-03-06 12:23:16;阅读国家7;2458291957;Reddit;澳大利亚 2018-03-06 12:23:19;阅读国家7;2458203986;亚洲 2018-03-06 12:23:25;阅读国家6;2458291958;Reddit;亚洲 2018-03-06 12:23:25;阅读国家7;2458286097;亚洲 2018-03-06 12:23:31;阅读国家5;2458291846;澳大利亚 2018-03-06 12:23:33;阅读国家5;2458291959;Reddit;亚洲 2018-03-06 12:23:38;阅读国家4;2458291960;广告词;亚洲Python 我一直在尝试分离我的csv文件的列。实际上有6列,但有些列是5列,这会影响我的结果,python,csv,jupyter-notebook,Python,Csv,Jupyter Notebook,2018-03-06 12:23:13;阅读国家5;2458291108;澳大利亚 2018-03-06 12:23:16;阅读国家7;2458291957;Reddit;澳大利亚 2018-03-06 12:23:19;阅读国家7;2458203986;亚洲 2018-03-06 12:23:25;阅读国家6;2458291958;Reddit;亚洲 2018-03-06 12:23:25;阅读国家7;2458286097;亚洲 2018-03-06 12:23:31;阅读国家5;245829
2018-03-06 12:23:42;阅读国家2;2458199318;欧洲如果以上是文件的内容,则可以使用python的基本操作
column1 = []
column2 = []
column3 = []
column4 = []
column5 = []
fopen = open("filename.csv","r")
lines = fopen.readlines()
for line in lines:
fields = line.split(";")
field[0].append(column1)
field[2].append(column2)
.....#put for other columns as well
现在您有了五个列表,您可以将其转换为系列,并创建数据帧。我假设您的输入文件如下所示
2018-03-06 12:23:13;read;country_5;2458291108;Australia
2018-03-06 12:23:16;read;country_7;2458291957;Reddit;Australia
2018-03-06 12:23:19;read;country_7;2458203986;Asia
2018-03-06 12:23:25;read;country_6;2458291958;Reddit;Asia
2018-03-06 12:23:25;read;country_7;2458286097;Asia
2018-03-06 12:23:31;read;country_5;2458291846;Australia
2018-03-06 12:23:33;read;country_5;2458291959;Reddit;Asia
2018-03-06 12:23:38;read;country_4;2458291960;AdWords;Asia
2018-03-06 12:23:42;read;country_2;2458199318;Europe
问题是最后两列
如果执行此操作(此处的文件名为input.csv
)
或更短
with open('input.csv', 'r') as file:
lines = [line[:4] + ['NO INPUT', line[4]] if len(line) == 5 else line
for line in csv.reader(file, delimiter=';')]
您将获得以下对齐的行
[
['2018-03-06 12:23:13', 'read', 'country_5', '2458291108', 'NO INPUT', 'Australia'],
['2018-03-06 12:23:16', 'read', 'country_7', '2458291957', 'Reddit', 'Australia'],
['2018-03-06 12:23:19', 'read', 'country_7', '2458203986', 'NO INPUT', 'Asia'],
['2018-03-06 12:23:25', 'read', 'country_6', '2458291958', 'Reddit', 'Asia'],
['2018-03-06 12:23:25', 'read', 'country_7', '2458286097', 'NO INPUT', 'Asia'],
['2018-03-06 12:23:31', 'read', 'country_5', '2458291846', 'NO INPUT', 'Australia'],
['2018-03-06 12:23:33', 'read', 'country_5', '2458291959', 'Reddit', 'Asia '],
['2018-03-06 12:23:38', 'read', 'country_4', '2458291960', 'AdWords', 'Asia'],
['2018-03-06 12:23:42', 'read', 'country_2', '2458199318', 'NO INPUT', 'Europe']
]
我用“无输入”
填写了缺失的第5列。当然,您可以在这里执行任何看起来合适的操作(例如None
)
如果您真的想要分离列:这个
columns = [[] for _ in range(6)]
with open('input.csv', 'r') as file:
csv_file = csv.reader(file, delimiter=';')
for line in csv_file:
if len(line) == 5: # If input for column 5 is missing: insert placeholder
line = line[:4] + ['NO INPUT', line[4]]
for i, item in enumerate(line):
columns[i].append(item)
或者更紧凑一点
with open('input.csv', 'r') as file:
lines = [line[:4] + ['NO INPUT', line[4]] if len(line) == 5 else line
for line in csv.reader(file, delimiter=';')]
columns = [[lines[i][j] for i in range(len(lines))] for j in range(6)]
给你
[
['2018-03-06 12:23:13', '2018-03-06 12:23:16', '2018-03-06 12:23:19', '2018-03-06 12:23:25', '2018-03-06 12:23:25', '2018-03-06 12:23:31', '2018-03-06 12:23:33', '2018-03-06 12:23:38', '2018-03-06 12:23:42'],
['read', 'read', 'read', 'read', 'read', 'read', 'read', 'read', 'read'],
['country_5', 'country_7', 'country_7', 'country_6', 'country_7', 'country_5', 'country_5', 'country_4', 'country_2'],
['2458291108', '2458291957', '2458203986', '2458291958', '2458286097', '2458291846', '2458291959', '2458291960', '2458199318'],
['NO INPUT', 'Reddit', 'NO INPUT', 'Reddit', 'NO INPUT', 'NO INPUT', 'Reddit', 'AdWords', 'NO INPUT'],
['Australia', 'Australia', 'Asia', 'Asia', 'Asia', 'Australia', 'Asia ', 'Asia', 'Europe']
]
[
['2018-03-06 12:23:13', '2018-03-06 12:23:16', '2018-03-06 12:23:19', '2018-03-06 12:23:25', '2018-03-06 12:23:25', '2018-03-06 12:23:31', '2018-03-06 12:23:33', '2018-03-06 12:23:38', '2018-03-06 12:23:42'],
['read', 'read', 'read', 'read', 'read', 'read', 'read', 'read', 'read'],
['country_5', 'country_7', 'country_7', 'country_6', 'country_7', 'country_5', 'country_5', 'country_4', 'country_2'],
['2458291108', '2458291957', '2458203986', '2458291958', '2458286097', '2458291846', '2458291959', '2458291960', '2458199318'],
['NO INPUT', 'Reddit', 'NO INPUT', 'Reddit', 'NO INPUT', 'NO INPUT', 'Reddit', 'AdWords', 'NO INPUT'],
['Australia', 'Australia', 'Asia', 'Asia', 'Asia', 'Australia', 'Asia ', 'Asia', 'Europe']
]