Csv Python-从.dat文件中筛选列并从其他列返回给定值
我是Python新手,一直在使用示例数据进行练习。我创建了150行学生ID号、年级、年龄、班级代码、地区代码等。我尝试对数据做的不仅仅是按年级、年龄等按特定列过滤,但是,还要创建一个与该行学生ID不同的列的列表。我已经设法找到了如何隔离该列,我需要通过该列找到特定的值,但无法找到如何创建需要返回的值的列表 下面是5行数据的示例:Csv Python-从.dat文件中筛选列并从其他列返回给定值,csv,pandas,numpy,Csv,Pandas,Numpy,我是Python新手,一直在使用示例数据进行练习。我创建了150行学生ID号、年级、年龄、班级代码、地区代码等。我尝试对数据做的不仅仅是按年级、年龄等按特定列过滤,但是,还要创建一个与该行学生ID不同的列的列表。我已经设法找到了如何隔离该列,我需要通过该列找到特定的值,但无法找到如何创建需要返回的值的列表 下面是5行数据的示例: 1/A/15/13/43214 2/I/15/21/58322 3/C/17/89/68470 4/I/18/6/57362 5/I/14/4/00000 6/A/16
1/A/15/13/43214
2/I/15/21/58322
3/C/17/89/68470
4/I/18/6/57362
5/I/14/4/00000
6/A/16/23/34567
我需要一个第一列学生ID的列表,基于对第二列成绩的过滤…以及最终的第三列、第四列等。但是如果我只看到第二列的情况,我想我可以找出其他的。注意:我没有在.dat文件中使用标题
我想出了如何隔离/查看第二列
import numpy
data = numpy.genfromtxt('/testdata.dat', delimiter='/', dtype='unicode')
grades = data[:,1]
print (grades)
要打印:
['A' 'I' 'C' 'I' 'I' 'A']
但是现在,我怎么才能把对应于A,C,I的第一列拉到单独的列表中呢
所以我想看到一个列表,在a,C和I的第1列的整数之间也有逗号
list from A = [1, 6]
list from C = [3]
list from I = [2, 4, 5]
再说一次,如果我能看到第二列是怎么做的,只有一个值是A,我想我能知道如何做B,C,D等,可能还有其他列。我只需要看一个例子来说明语法是如何应用的,然后我想看看其他的
此外,我一直在使用numpy,但也阅读了有关panda、csv的内容,我认为这些库也有可能。但正如我所说,我一直在使用numpy来处理.dat文件。我不知道其他库是否更易于使用?您可以浏览该列表并创建一个布尔值来选择与特定级别匹配的数组。这可能需要一些改进
import numpy as np
grades = np.genfromtxt('data.txt', delimiter='/', skip_header=0, dtype='unicode')
res = {}
for grade in set(grades[:, 1].tolist()):
res[grade] = grades[grades[:, 1]==grade][:,0].tolist()
print res
您可以浏览该列表并生成布尔值,以选择与特定等级匹配的阵列。这可能需要一些改进
import numpy as np
grades = np.genfromtxt('data.txt', delimiter='/', skip_header=0, dtype='unicode')
res = {}
for grade in set(grades[:, 1].tolist()):
res[grade] = grades[grades[:, 1]==grade][:,0].tolist()
print res
熊猫解决方案:
import pandas as pd
df = pd.read_csv('data.txt', header=None, sep='/')
dfs = {k:v for k,v in df.groupby(1)}
因此,我们有一个数据帧字典:
In [59]: dfs.keys()
Out[59]: dict_keys(['I', 'C', 'A'])
In [60]: dfs['I']
Out[60]:
0 1 2 3 4
1 2 I 15 21 58322
3 4 I 18 6 57362
4 5 I 14 4 0
In [61]: dfs['C']
Out[61]:
0 1 2 3 4
2 3 C 17 89 68470
In [62]: dfs['A']
Out[62]:
0 1 2 3 4
0 1 A 15 13 43214
5 6 A 16 23 34567
如果要对第一列进行分组,请执行以下操作:
In [67]: dfs['I'].iloc[:, 0].tolist()
Out[67]: [2, 4, 5]
In [68]: dfs['C'].iloc[:, 0].tolist()
Out[68]: [3]
In [69]: dfs['A'].iloc[:, 0].tolist()
Out[69]: [1, 6]
熊猫解决方案:
import pandas as pd
df = pd.read_csv('data.txt', header=None, sep='/')
dfs = {k:v for k,v in df.groupby(1)}
因此,我们有一个数据帧字典:
In [59]: dfs.keys()
Out[59]: dict_keys(['I', 'C', 'A'])
In [60]: dfs['I']
Out[60]:
0 1 2 3 4
1 2 I 15 21 58322
3 4 I 18 6 57362
4 5 I 14 4 0
In [61]: dfs['C']
Out[61]:
0 1 2 3 4
2 3 C 17 89 68470
In [62]: dfs['A']
Out[62]:
0 1 2 3 4
0 1 A 15 13 43214
5 6 A 16 23 34567
如果要对第一列进行分组,请执行以下操作:
In [67]: dfs['I'].iloc[:, 0].tolist()
Out[67]: [2, 4, 5]
In [68]: dfs['C'].iloc[:, 0].tolist()
Out[68]: [3]
In [69]: dfs['A'].iloc[:, 0].tolist()
Out[69]: [1, 6]
对于这样一个简单的任务,您实际上不需要任何额外的模块。纯Python解决方案是逐行读取文件,并使用str“解析”它们。split将为您提供列表,然后您几乎可以过滤任何参数。比如:
students = {} # store for our students by grade
with open("testdata.dat", "r") as f: # open the file
for line in f: # read the file line by line
row = line.strip().split("/") # split the line into individual columns
# you can now directly filter your row, or you can store the row in a list for later
# let's split them by grade:
grade = row[1] # second column of our row is the grade
# create/append the sublist in our `students` dict keyed by the grade
students[grade] = students.get(grade, []) + [row]
# now your students dict contains all students split by grade, e.g.:
a_students = students["A"]
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']]
# if you want only to collect the A-grade student IDs, you can get a list of them as:
student_ids = [entry[0] for entry in students["A"]]
# ['1', '6']
但让我们回到几个步骤——如果您想要更通用的解决方案,您应该存储列表,然后创建一个函数,通过传递的参数对其进行过滤,因此:
# define a filter function
# filters should contain a list of filters whereas a filter would be defined as:
# [position, [values]]
# and you can define as many as you want
def filter_sublists(source, filters=None):
result = [] # store for our result
filters = filters or [] # in case no filter is returned
for element in source: # go through every element of our source data
try:
if all(element[f[0]] in f[1] for f in filters): # check if all our filters match
result.append(element) # add the element
except IndexError: # invalid filter position or data position, ignore
pass
return result # return the result
# now we can use it to filter our data, first lets load our data:
with open("testdata.dat", "r") as f: # open the file
students = [line.strip().split("/") for line in f] # store all our students as a list
# now we have all the data in the `students` list and we can filter it by any element
a_students = filter_sublists(students, [[1, ["A"]]])
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']]
# or again, if you just need the IDs:
a_student_ids = [entry[0] for entry in filter_sublists(students, [[1, ["A"]]])]
# ['1', '6']
# but you can filter by any parameter, for example:
age_15_students = filter_sublists(students, [[2, ["15"]]])
# [['1', 'A', '15', '13', '43214'], ['2', 'I', '15', '21', '58322']]
# or you can get all I-grade students aged 14 or 15:
i_students = filter_sublists(students, [[1, ["I"]], [2, ["14", "15"]]])
# [['2', 'I', '15', '21', '58322'], ['5', 'I', '14', '4', '00000']]
对于这样一个简单的任务,您实际上不需要任何额外的模块。纯Python解决方案是逐行读取文件,并使用str“解析”它们。split将为您提供列表,然后您几乎可以过滤任何参数。比如:
students = {} # store for our students by grade
with open("testdata.dat", "r") as f: # open the file
for line in f: # read the file line by line
row = line.strip().split("/") # split the line into individual columns
# you can now directly filter your row, or you can store the row in a list for later
# let's split them by grade:
grade = row[1] # second column of our row is the grade
# create/append the sublist in our `students` dict keyed by the grade
students[grade] = students.get(grade, []) + [row]
# now your students dict contains all students split by grade, e.g.:
a_students = students["A"]
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']]
# if you want only to collect the A-grade student IDs, you can get a list of them as:
student_ids = [entry[0] for entry in students["A"]]
# ['1', '6']
但让我们回到几个步骤——如果您想要更通用的解决方案,您应该存储列表,然后创建一个函数,通过传递的参数对其进行过滤,因此:
# define a filter function
# filters should contain a list of filters whereas a filter would be defined as:
# [position, [values]]
# and you can define as many as you want
def filter_sublists(source, filters=None):
result = [] # store for our result
filters = filters or [] # in case no filter is returned
for element in source: # go through every element of our source data
try:
if all(element[f[0]] in f[1] for f in filters): # check if all our filters match
result.append(element) # add the element
except IndexError: # invalid filter position or data position, ignore
pass
return result # return the result
# now we can use it to filter our data, first lets load our data:
with open("testdata.dat", "r") as f: # open the file
students = [line.strip().split("/") for line in f] # store all our students as a list
# now we have all the data in the `students` list and we can filter it by any element
a_students = filter_sublists(students, [[1, ["A"]]])
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']]
# or again, if you just need the IDs:
a_student_ids = [entry[0] for entry in filter_sublists(students, [[1, ["A"]]])]
# ['1', '6']
# but you can filter by any parameter, for example:
age_15_students = filter_sublists(students, [[2, ["15"]]])
# [['1', 'A', '15', '13', '43214'], ['2', 'I', '15', '21', '58322']]
# or you can get all I-grade students aged 14 or 15:
i_students = filter_sublists(students, [[1, ["I"]], [2, ["14", "15"]]])
# [['2', 'I', '15', '21', '58322'], ['5', 'I', '14', '4', '00000']]
所以到目前为止,我一直在使用不同的解决方案。我喜欢你的解决方案。它将res打印为一组列表。我试图查找,但我仍在搜索,但有没有办法将列表从集合中分离出来?那么,我可以从res中得到“A”等级列表,从res中得到“C”等级列表,等等吗?我发现的只是将列表添加到集合中,或者从列表中删除列表,或者从集合的子集和列表的子列表中删除列表。但我似乎找不到关于一个有多个列表的集合的任何东西,所以我一直在使用到目前为止发布的不同解决方案。我喜欢你的解决方案。它将res打印为一组列表。我试图查找,但我仍在搜索,但有没有办法将列表从集合中分离出来?那么,我可以从res中得到“A”等级列表,从res中得到“C”等级列表,等等吗?我发现的只是将列表添加到集合中,或者从列表中删除列表,或者从集合的子集和列表的子列表中删除列表。但是我似乎找不到关于一个有多个列表的集合的任何东西。