Python 列表列表:替换和添加子列表的项
我有一个列表,让我们这样说:Python 列表列表:替换和添加子列表的项,python,nested-lists,Python,Nested Lists,我有一个列表,让我们这样说: tripInfo_csv = [['1','2',6,2], ['a','h',4,2], ['1','4',6,1], ['1','8',18,3], ['a','8',2,1]] LineT;Line;Route;Day;Start_point;End_point;Adults;Children;First_visit SM55;5055;3;Weekend;15;87;21;4;0 SM02;5002;8;Weekend;AF3;89;5;0;1 ...
tripInfo_csv = [['1','2',6,2], ['a','h',4,2], ['1','4',6,1], ['1','8',18,3], ['a','8',2,1]]
LineT;Line;Route;Day;Start_point;End_point;Adults;Children;First_visit
SM55;5055;3;Weekend;15;87;21;4;0
SM02;5002;8;Weekend;AF3;89;5;0;1
...
将子列表视为trips:[起点、终点、成人数量、儿童数量]
我的目标是得到一个列表,在这个列表中,起点和终点重合的旅行可以得到它们的第三个和第四个值。起始值和结束值应始终为1到8之间的数字。如果它们是字母,则应替换为相应的数字(a=1,b=2,依此类推)
这是我的密码。这是可行的,但我相信可以改进。对我来说,主要问题是性能。我有很多这样的列表,还有更多的子列表
dicPoints = {'a':'1','b':'2','c':'3', 'd':'4', 'e':'5', 'f':'6', 'g':'7', 'h':'8'}
def getTrips (trips):
okTrips = []
for trip in trips:
if not trip[0].isdigit():
trip[0] = dicPoints[trip[0]]
if not trip[1].isdigit():
trip[1] = dicPoints[trip[1]]
if len(okTrips) == 0:
okTrips.append(trip)
else:
for i, stop in enumerate(okTrips):
if stop[0] == trip[0] and stop[1] == trip[1]:
stop[2] += trip[2]
stop[3] += trip[3]
break
else:
if i == len(okTrips)-1:
okTrips.append(trip)
正如eguaio提到的,上面的代码有一个bug。应该是这样的:
def getTrips (trips):
okTrips = []
print datetime.datetime.now()
for trip in trips:
if not trip[0].isdigit():
trip[0] = dicPoints[trip[0]]
if not trip[1].isdigit():
trip[1] = dicPoints[trip[1]]
if len(okTrips) == 0:
okTrips.append(trip)
else:
flag = 0
for i, stop in enumerate(okTrips):
if stop[0] == trip[0] and stop[1] == trip[1]:
stop[2] += trip[2]
stop[3] += trip[3]
flag = 1
break
if flag == 0:
okTrips.append(trip)
由于eguaio的回答,我得到了一个改进的版本,我想与大家分享。这是我根据他的答案写的剧本。 现在我的数据和要求比第一次告诉我的更复杂,所以我做了一些更改。 CSV文件如下所示:
tripInfo_csv = [['1','2',6,2], ['a','h',4,2], ['1','4',6,1], ['1','8',18,3], ['a','8',2,1]]
LineT;Line;Route;Day;Start_point;End_point;Adults;Children;First_visit
SM55;5055;3;Weekend;15;87;21;4;0
SM02;5002;8;Weekend;AF3;89;5;0;1
...
脚本:
import os, csv, psycopg2
folder = "F:/route_project/routes"
# Day type
dicDay = {'Weekday':1,'Weekend':2,'Holiday':3}
# Dictionary with the start and end points of each route
# built from a Postgresql table (with coumns: line_route, start, end)
conn = psycopg2.connect (database="test", user="test", password="test", host="###.###.#.##")
cur = conn.cursor()
cur.execute('select id_linroute, start_p, end_p from route_ends')
recs = cur.fetchall()
dicPoints = {rec[0]: rec[1:] for rec in recs}
# When point labels are text, replace them with a number label in dicPoints
# Text is not important: they are special text labels for start and end
# of routes (for athletes), so we replace them with labels for start or
# the end of each route
def convert_point(line, route, point, i):
if point.isdigit():
return point
else:
return dicPoints["%s_%s" % (line,route)][i]
# Points with text labels mean athletes made the whole or part of this route,
# we keep them as adults but also keep this number as an extra value
# for further purposes
def num_athletes(start_p, end_p, adults):
if not start_p.isdigit() or not end_p.isdigit():
return adults
else:
return 0
# Data is taken for CSV files in subfolders
for root, dirs, files in os.walk(folder):
for file in files:
if file.endswith(".csv"):
file_path = (os.path.join(root, file))
with open(file_path, 'rb') as csvfile:
rows = csv.reader(csvfile, delimiter=';', quotechar='"')
# Skips the CSV header row
rows.next()
# linT is not used, yet it's found in every CSV file
# There's an unused last column in every file, I take advantage out of it
# to store the number of athletes in the generator
gen =((lin, route, dicDay[tday], convert_point(lin,route,s_point,0), convert_point(lin,route,e_point,1), adults, children, num_athletes(s_point,e_point,adults)) for linT, lin, route, tday, s_point, e_point, adults, children, athletes in rows)
dicCSV = {}
for lin, route, tday, s_point, e_point, adults, children, athletes in gen:
visitors = dicCSV.get(("%s_%s_%s" % (lin,route,s_point), "%s_%s_%s" % (lin,route,e_point), tday), (0, 0, 0))
dicCSV[("%s_%s_%s" % (lin,route,s_point), "%s_%s_%s" % (lin,route,e_point), tday)] = (visitors[0] + int(adults), visitors[1] + int(children), visitors[2] + int(athletes))
for k,v in dicCSV.iteritems():
print k, v
看看这是否有帮助
trips = [['1','2',6,2], ['a','h',4,2], ['1','2',6,1], ['1','8',18,3], ['a','h',2,1]]
# To get the equivalent value
def x(n):
if '1' <= n <= '8':
return int(n)
return ord(n) - ord('a')
# To group lists with similar start and end points
from collections import defaultdict
groups = defaultdict(list)
for trip in trips:
# Grouping based on start and end point.
groups[(x(trip[0]), x(trip[1]))].append(trip)
grouped_trips = groups.values()
result = []
for group in grouped_trips:
start = group[0][0]
end = group[0][1]
adults = group[0][2]
children = group[0][3]
for trip in group[1:]:
adults += trip [2]
children += trip [3]
result += [[start, end, adults, children]]
print result
trips=['1','2',6,2','a','h',4,2','1','2',6,1','1','8',18,3','a','h',2,1]]
#得到等值
def x(n):
如果“1”假设起点和终点在0和n值之间
然后,结果“OkTrip”最多有n^2个元素。然后,函数中的第二个循环的复杂性为O(n^2)。如果空间复杂度没有问题,可以将复杂度降低到O(n)
首先,创建包含n个列表的dict,使k'(th)子列表包含以“k”开头的trips
当您搜索是否有具有相同起点和终点的不同行程时,您只需要搜索相应的子列表,而不需要搜索所有元素
这个想法来自稀疏矩阵存储技术。
我无法检查以下代码的有效性
代码如下:
dicPoints = {'a':'1','b':'2','c':'3', 'd':'4', 'e':'5', 'f':'6', 'g':'7', 'h':'8'}
Temp = {'1':[],'2':[],'3':[],'4':[],'5':[],'6':[],'7':[],'8':[]};
def getTrips (trips):
okTrips = []
for trip in trips:
if not trip[0].isdigit():
trip[0] = dicPoints[trip[0]]
if not trip[1].isdigit():
trip[1] = dicPoints[trip[1]]
if len(Temp[trip[0]]) == 0:
Temp[trip[0]].append(trip)
else:
for i, stop in enumerate(Temp[trip[0]]):
if stop[1] == trip[1]:
stop[2] += trip[2]
stop[3] += trip[3]
break
else:
if i == len(Temp[trip[0]])-1:
Temp[trip[0]].append(trip)
print Temp
for key in Temp:
okTrips = okTrips + Temp[key];
为了更有效地处理此问题,最好按起点和终点对输入列表进行排序,以便将具有匹配起点和终点的行分组在一起。然后,我们可以轻松地使用groupby
函数高效地处理这些组
from operator import itemgetter
from itertools import groupby
tripInfo_csv = [
['1', '2', 6, 2],
['a', 'h', 4, 2],
['1', '4', 6, 1],
['1', '8', 18, 3],
['a', '8', 2, 1],
]
# Used to convert alphabetic point labels to numeric form
dicPoints = {v:str(i) for i, v in enumerate('abcdefgh', 1)}
def fix_points(seq):
return [dicPoints.get(p, p) for p in seq]
# Ensure that all point labels are numeric
for row in tripInfo_csv:
row[:2] = fix_points(row[:2])
# Sort on point labels
keyfunc = itemgetter(0, 1)
tripInfo_csv.sort(key=keyfunc)
# Group on point labels and sum corresponding adult & child numbers
newlist = []
for k, g in groupby(tripInfo_csv, key=keyfunc):
g = list(g)
row = list(k) + [sum(row[2] for row in g), sum(row[3] for row in g)]
newlist.append(row)
# Print the condensed list
for row in newlist:
print(row)
输出
['1', '2', 6, 2]
['1', '4', 6, 1]
['1', '8', 24, 6]
下面给出了比您更适合大列表合并的时间:对于tripInfo\u csv*500000
,时间分别为2秒和1分钟。我们使用dict获得几乎线性的复杂度来获得具有恒定查找时间的键。它也更优雅。请注意,tg
是一个生成器,因此在创建时不会使用有效的时间或内存
def newGetTrips(trips):
def convert(l):
return l if l.isdigit() else dicPoints[l]
tg = ((convert(a), convert(b), c, d) for a, b, c, d in trips)
okt = {}
for a, b, c, d in tg:
# a trick to get (0,0) as default if (a,b) is not a key of the dictionary yet
t = okt.get((a,b), (0,0))
okt[(a,b)] = (t[0] + c, t[1] + d)
return [[a,b,c,d] for (a,b), (c,d) in okt.iteritems()]
此外,作为一个副作用,您正在更改行程列表,而此功能使其保持不变。
还有,你有一个bug。每个(开始、结束)对(但不是第一种情况)考虑的第一项求和是两倍。我找不到原因,但在运行示例时,使用您的getTrips
我得到:
[['1', '2', 6, 2], ['1', '8', 28, 8], ['1', '4', 12, 2]]
[['1', '8', 24, 6], ['1', '2', 6, 2], ['1', '4', 6, 1]]
通过newGetTrips
我得到:
[['1', '2', 6, 2], ['1', '8', 28, 8], ['1', '4', 12, 2]]
[['1', '8', 24, 6], ['1', '2', 6, 2], ['1', '4', 6, 1]]
具体点。简化你的问题,很好。两条评论。不鼓励超过78个字符的行。你应该读政治公众人物8()。此外,如果您使用filter而不是“if”语句,则可以降低圈复杂度,代码可读性更高,并且可以使用78个字符中的更多:)而不是if file.endswith(“.csv”)
您可以编写files=filter(lambda f:f.endswith(.csv”),文件)
实际上不需要排序,由于使用了groupby,所以添加了它。排序引入了O(n*log(n))时间复杂度。此解决方案比问题中发布的要快得多,但仍然可以改进。谢谢!我从您的代码中学到了很多(主要是operator.itemgetter和itertools.groupby)。我以后会用的。我喜欢它,但我发现eguaio更快,代码行更少。不幸的是,我不能将这个答案标记为有用(没有足够的声誉)。太好了!功能强大,速度快,行数很少。不知道生成器表达式,甚至不知道生成器。它真的很有用。我的数据和需求变得更加复杂,但无论如何,我还是设法将其全部融入其中。