Python中数据集的组织
我有一个包含大量习惯用法的.csv数据集。每一行包含三个我想分开的元素(用逗号分隔): 1) 索引编号(0,1,2,3…) 2) 成语本身 3) 如果习语是肯定的/否定的/中性的 下面是.csv文件的一个小示例:Python中数据集的组织,python,csv,dictionary,Python,Csv,Dictionary,我有一个包含大量习惯用法的.csv数据集。每一行包含三个我想分开的元素(用逗号分隔): 1) 索引编号(0,1,2,3…) 2) 成语本身 3) 如果习语是肯定的/否定的/中性的 下面是.csv文件的一个小示例: 0,"I did touch them one time you see but of course there was nothing doing, he wanted me.",neutral 1,We find that choice theorists admit that
0,"I did touch them one time you see but of course there was nothing doing, he wanted me.",neutral
1,We find that choice theorists admit that they introduce a style of moral paternalism at odds with liberal values.,neutral
2,"Well, here I am with an olive branch.",positive
3,"Its rudder and fin were both knocked out, and a four-foot-long gash in the shell meant even repairs on the bank were out of the question.",negative
正如你所看到的,有时习语会包含引号,而有时则不会。然而,我认为这并不难分类
我认为在Python中组织这一点的最好方法是通过字典,如下所示:
example_dict = {0: ['This is an idiom.', 'neutral']}
那么,如何将每一行拆分为三个不同的字符串(基于逗号),然后将第一个字符串用作键号,最后两个作为dict中相应的列表项
我最初的想法是尝试用以下代码拆分逗号:
for line in file:
new_item = ','.join(line.split(',')[1:])
但它所做的只是删除所有内容,直到一行中的第一个逗号,我不认为通过它进行一系列迭代是有效的
我想得到一些关于这样组织数据的最佳方法的建议。Python专门致力于处理csv
文件。在本例中,您可以使用它从文件中创建列表列表。现在让我们调用您的文件idioms.csv
:
import csv
with open('idioms.csv', newline='') as idioms_file:
reader = csv.reader(idioms_file, delimiter=',', quotechar='"')
idioms_list = [line for line in reader]
# Now you have a list that looks like this:
# [[0, "I did touch them...", "neutral"],
# [1, "We find that choice...", "neutral"],
# ...
# ]
现在,您可以根据自己的喜好对数据进行排序或组织