Python 我一直在尝试分离我的csv文件的列。实际上有6列,但有些列是5列,这会影响我的结果

Python 我一直在尝试分离我的csv文件的列。实际上有6列,但有些列是5列,这会影响我的结果,python,csv,jupyter-notebook,Python,Csv,Jupyter Notebook,2018-03-06 12:23:13;阅读国家5;2458291108;澳大利亚 2018-03-06 12:23:16;阅读国家7;2458291957;Reddit;澳大利亚 2018-03-06 12:23:19;阅读国家7;2458203986;亚洲 2018-03-06 12:23:25;阅读国家6;2458291958;Reddit;亚洲 2018-03-06 12:23:25;阅读国家7;2458286097;亚洲 2018-03-06 12:23:31;阅读国家5;245829

2018-03-06 12:23:13;阅读国家5;2458291108;澳大利亚 2018-03-06 12:23:16;阅读国家7;2458291957;Reddit;澳大利亚 2018-03-06 12:23:19;阅读国家7;2458203986;亚洲 2018-03-06 12:23:25;阅读国家6;2458291958;Reddit;亚洲 2018-03-06 12:23:25;阅读国家7;2458286097;亚洲 2018-03-06 12:23:31;阅读国家5;2458291846;澳大利亚 2018-03-06 12:23:33;阅读国家5;2458291959;Reddit;亚洲 2018-03-06 12:23:38;阅读国家4;2458291960;广告词;亚洲
2018-03-06 12:23:42;阅读国家2;2458199318;欧洲

如果以上是文件的内容,则可以使用python的基本操作

column1 = []
column2 = []
column3 = []
column4 = []
column5 = []
fopen = open("filename.csv","r")
lines  = fopen.readlines()
for line in lines:
   fields = line.split(";")
   field[0].append(column1)
   field[2].append(column2)
   .....#put for other columns as well

现在您有了五个列表,您可以将其转换为系列,并创建数据帧。

我假设您的输入文件如下所示

2018-03-06 12:23:13;read;country_5;2458291108;Australia
2018-03-06 12:23:16;read;country_7;2458291957;Reddit;Australia
2018-03-06 12:23:19;read;country_7;2458203986;Asia
2018-03-06 12:23:25;read;country_6;2458291958;Reddit;Asia
2018-03-06 12:23:25;read;country_7;2458286097;Asia
2018-03-06 12:23:31;read;country_5;2458291846;Australia
2018-03-06 12:23:33;read;country_5;2458291959;Reddit;Asia 
2018-03-06 12:23:38;read;country_4;2458291960;AdWords;Asia
2018-03-06 12:23:42;read;country_2;2458199318;Europe
问题是最后两列

如果执行此操作(此处的文件名为
input.csv

或更短

with open('input.csv', 'r') as file:
    lines = [line[:4] + ['NO INPUT', line[4]] if len(line) == 5 else line
             for line in csv.reader(file, delimiter=';')]
您将获得以下对齐的

[
 ['2018-03-06 12:23:13', 'read', 'country_5', '2458291108', 'NO INPUT', 'Australia'],
 ['2018-03-06 12:23:16', 'read', 'country_7', '2458291957', 'Reddit', 'Australia'],
 ['2018-03-06 12:23:19', 'read', 'country_7', '2458203986', 'NO INPUT', 'Asia'],
 ['2018-03-06 12:23:25', 'read', 'country_6', '2458291958', 'Reddit', 'Asia'],
 ['2018-03-06 12:23:25', 'read', 'country_7', '2458286097', 'NO INPUT', 'Asia'],
 ['2018-03-06 12:23:31', 'read', 'country_5', '2458291846', 'NO INPUT', 'Australia'],
 ['2018-03-06 12:23:33', 'read', 'country_5', '2458291959', 'Reddit', 'Asia '],
 ['2018-03-06 12:23:38', 'read', 'country_4', '2458291960', 'AdWords', 'Asia'],
 ['2018-03-06 12:23:42', 'read', 'country_2', '2458199318', 'NO INPUT', 'Europe']
]
我用
“无输入”
填写了缺失的第5列。当然,您可以在这里执行任何看起来合适的操作(例如
None

如果您真的想要分离列:这个

columns = [[] for _ in range(6)]
with open('input.csv', 'r') as file:
    csv_file = csv.reader(file, delimiter=';')
    for line in csv_file:
        if len(line) == 5: # If input for column 5 is missing: insert placeholder
            line = line[:4] + ['NO INPUT', line[4]]
        for i, item in enumerate(line):
            columns[i].append(item)
或者更紧凑一点

with open('input.csv', 'r') as file:
    lines = [line[:4] + ['NO INPUT', line[4]] if len(line) == 5 else line
             for line in csv.reader(file, delimiter=';')]
columns = [[lines[i][j] for i in range(len(lines))] for j in range(6)]
给你

[
 ['2018-03-06 12:23:13', '2018-03-06 12:23:16', '2018-03-06 12:23:19', '2018-03-06 12:23:25', '2018-03-06 12:23:25', '2018-03-06 12:23:31', '2018-03-06 12:23:33', '2018-03-06 12:23:38', '2018-03-06 12:23:42'],
 ['read', 'read', 'read', 'read', 'read', 'read', 'read', 'read', 'read'],
 ['country_5', 'country_7', 'country_7', 'country_6', 'country_7', 'country_5', 'country_5', 'country_4', 'country_2'],
 ['2458291108', '2458291957', '2458203986', '2458291958', '2458286097', '2458291846', '2458291959', '2458291960', '2458199318'],
 ['NO INPUT', 'Reddit', 'NO INPUT', 'Reddit', 'NO INPUT', 'NO INPUT', 'Reddit', 'AdWords', 'NO INPUT'],
 ['Australia', 'Australia', 'Asia', 'Asia', 'Asia', 'Australia', 'Asia ', 'Asia', 'Europe']
]
[
 ['2018-03-06 12:23:13', '2018-03-06 12:23:16', '2018-03-06 12:23:19', '2018-03-06 12:23:25', '2018-03-06 12:23:25', '2018-03-06 12:23:31', '2018-03-06 12:23:33', '2018-03-06 12:23:38', '2018-03-06 12:23:42'],
 ['read', 'read', 'read', 'read', 'read', 'read', 'read', 'read', 'read'],
 ['country_5', 'country_7', 'country_7', 'country_6', 'country_7', 'country_5', 'country_5', 'country_4', 'country_2'],
 ['2458291108', '2458291957', '2458203986', '2458291958', '2458286097', '2458291846', '2458291959', '2458291960', '2458199318'],
 ['NO INPUT', 'Reddit', 'NO INPUT', 'Reddit', 'NO INPUT', 'NO INPUT', 'Reddit', 'AdWords', 'NO INPUT'],
 ['Australia', 'Australia', 'Asia', 'Asia', 'Asia', 'Australia', 'Asia ', 'Asia', 'Europe']
]