Python 根据字符串中匹配的日期将两个列表压缩在一起

Python 根据字符串中匹配的日期将两个列表压缩在一起,python,Python,我有两个从FTP文件夹中提取的文件列表,使用: sFiles = ftp.nlst(date+'sales.csv') oFiles = ftp.nlst(date+'orders.csv') 这将产生两个列表,如下所示: sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']

我有两个从FTP文件夹中提取的文件列表,使用:

sFiles = ftp.nlst(date+'sales.csv')
oFiles = ftp.nlst(date+'orders.csv')
这将产生两个列表,如下所示:

sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']

oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']
用我的真实数据集,比如

for sales, orders in zip(sorted(sFiles),sorted(oFiles)): 
     df = pd.concat(...)
获取所需的结果,但有时会出现问题,并且两个文件都无法进入正确的FTP文件夹,因此我希望使用一些代码来创建一个iterable对象,以便根据日期提取匹配的订单和销售文件名

以下作品。。。我不确定我会给它打多少分。可读性差,但这是一种理解,所以我想会有性能提升吗

[(sales, orders) for sales in sFiles for orders in oFiles if re.search(r'\d+',sales).group(0) == re.search(r'\d+',orders).group(0)]

您可以使用字典:

import collections
d = collections.defaultdict(dict)

sFiles = ftp.nlst(date+'sales.csv')
oFiles = ftp.nlst(date+'orders.csv')
for sale, order in zip(sFiles, oFiles):
    a, b = sale.split("_")
    a1, b2 = order.split("_")
    d[a]["sales"] = sale
    d[a1]["orders"] = order
print(dict(d))
现在,您的数据的结构格式为:
{“date”:{“sales”:“sales filename”,“orders”:“orders filename”}

输出:

{'20170828': {'sales': '20170828_sales.csv'}, '20170822': {'sales': '20170822_sales.csv', 'orders': '20170822_orders.csv'}, '20170823': {'orders': '20170823_orders.csv'}, '20170824': {'sales': '20170824_sales.csv', 'orders': '20170824_orders.csv'}, '20170825': {'sales': '20170825_sales.csv', 'orders': '20170825_orders.csv'}, '20170826': {'sales': '20170826_sales.csv', 'orders': '20170826_orders.csv'}, '20170827': {'sales': '20170827_sales.csv', 'orders': '20170827_orders.csv'}}
[{'sold': '20170822_sales.csv', 'order': '20170822_orders.csv'}, {'sold': '20170824_sales.csv', 'order': '20170824_orders.csv'}, {'sold': '20170825_sales.csv', 'order': '20170825_orders.csv'}, {'sold': '20170826_sales.csv', 'order': '20170826_orders.csv'}, {'sold': '20170827_sales.csv', 'order': '20170827_orders.csv'}]
20170822_sales.csv 20170822_orders.csv
20170824_sales.csv 20170824_orders.csv
20170825_sales.csv 20170825_orders.csv
20170826_sales.csv 20170826_orders.csv
20170827_sales.csv 20170827_orders.csv
编辑:

通过词典理解和构建建议的列表理解解决方案,您可以尝试以下方法:

import re
final_data = [{"sold":sold, "order":order} for sold in sFiles for order in oFiles if re.findall("\d+", sold)[0] == re.findall("\d+", order)[0]]
输出:

{'20170828': {'sales': '20170828_sales.csv'}, '20170822': {'sales': '20170822_sales.csv', 'orders': '20170822_orders.csv'}, '20170823': {'orders': '20170823_orders.csv'}, '20170824': {'sales': '20170824_sales.csv', 'orders': '20170824_orders.csv'}, '20170825': {'sales': '20170825_sales.csv', 'orders': '20170825_orders.csv'}, '20170826': {'sales': '20170826_sales.csv', 'orders': '20170826_orders.csv'}, '20170827': {'sales': '20170827_sales.csv', 'orders': '20170827_orders.csv'}}
[{'sold': '20170822_sales.csv', 'order': '20170822_orders.csv'}, {'sold': '20170824_sales.csv', 'order': '20170824_orders.csv'}, {'sold': '20170825_sales.csv', 'order': '20170825_orders.csv'}, {'sold': '20170826_sales.csv', 'order': '20170826_orders.csv'}, {'sold': '20170827_sales.csv', 'order': '20170827_orders.csv'}]
20170822_sales.csv 20170822_orders.csv
20170824_sales.csv 20170824_orders.csv
20170825_sales.csv 20170825_orders.csv
20170826_sales.csv 20170826_orders.csv
20170827_sales.csv 20170827_orders.csv

利用数据帧的索引:

import pandas as pd
sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']

s_dates = [pd.Timestamp.strptime(file[:8], '%Y%m%d') for file in sFiles]
s_df = pd.DataFrame({'sFiles': sFiles}, index=s_dates)

o_dates = [pd.Timestamp.strptime(file[:8], '%Y%m%d') for file in oFiles]
o_df = pd.DataFrame({'oFiles': oFiles}, index=o_dates)

df = s_df.join(o_df, how='outer')
因此:

>>> print(df)
                        sFiles               oFiles
2017-08-22  20170822_sales.csv  20170822_orders.csv
2017-08-23                 NaN  20170823_orders.csv
2017-08-24  20170824_sales.csv  20170824_orders.csv
2017-08-25  20170825_sales.csv  20170825_orders.csv
2017-08-26  20170826_sales.csv  20170826_orders.csv
2017-08-27  20170827_sales.csv  20170827_orders.csv
2017-08-28  20170828_sales.csv                  NaN

仅仅因为理解的存在并不意味着你应该在所有事情上都使用它们。这很好:

date = re.compile(r'\d+')
for sales in sFiles:
    salesDate = date.search(sales).group(0)
    for orders in oFiles:
        orderDate = date.search(orders).group(0)
        if salesDate == orderDate:
            print sales, orders
有可能使它更快吗?对但是你不需要仅仅因为你可以,就强迫它进入列表理解。有时编写更多的代码会更好,只是因为它会稍微分散复杂性

这是一个使算法O(n)的增量改进:


这将创建一个生成器,该生成器按日期顺序返回匹配对:

from collections import defaultdict

def match(sales,orders):
    # When a key is referenced for the first time, the value
    # will default to the result of the lambda.
    d = collections.defaultdict(lambda:[None,None])

    # set sales files on the first entry in the value.
    for sale in sFiles:
        d[sale[:8]][0] = sale
    # set orders files on the second entry.
    for order in oFiles:
        d[order[:8]][1] = order

    for k in sorted(d):
        # Both values need to exist.
        # If you want the singles remove the if.
        if all(v for v in d[k]):
            yield d[k]

sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']

for s,o in match(sFiles,oFiles):
    print(s,o)
输出:

{'20170828': {'sales': '20170828_sales.csv'}, '20170822': {'sales': '20170822_sales.csv', 'orders': '20170822_orders.csv'}, '20170823': {'orders': '20170823_orders.csv'}, '20170824': {'sales': '20170824_sales.csv', 'orders': '20170824_orders.csv'}, '20170825': {'sales': '20170825_sales.csv', 'orders': '20170825_orders.csv'}, '20170826': {'sales': '20170826_sales.csv', 'orders': '20170826_orders.csv'}, '20170827': {'sales': '20170827_sales.csv', 'orders': '20170827_orders.csv'}}
[{'sold': '20170822_sales.csv', 'order': '20170822_orders.csv'}, {'sold': '20170824_sales.csv', 'order': '20170824_orders.csv'}, {'sold': '20170825_sales.csv', 'order': '20170825_orders.csv'}, {'sold': '20170826_sales.csv', 'order': '20170826_orders.csv'}, {'sold': '20170827_sales.csv', 'order': '20170827_orders.csv'}]
20170822_sales.csv 20170822_orders.csv
20170824_sales.csv 20170824_orders.csv
20170825_sales.csv 20170825_orders.csv
20170826_sales.csv 20170826_orders.csv
20170827_sales.csv 20170827_orders.csv

你在这里干什么?有没有可能通过词典理解来做到这一点?在已编辑版本的my中检查所需的解决方案question@YaleNewman请查看我最近的编辑,并让我知道我的解决方案是否符合您的想法。是的,这是我最初的想法。不过,我相当有信心,利用指数可以产生最有效的解决方案。我可能错了。@YaleNewman关于效率,第一个
zip
是O(n);然而,列表理解是O(n^2)。在这种情况下,输入量越小,字典的效率就越高。但是,当按列访问数据时,Pandas会更快。@Ajax,实际上,Pandas要到数据大得多的时候才开始。我在问题下方的评论中添加了一些时间安排。
20170823_orders.csv
缺失。因此它现在已修复@Hazzles我认为这个解决方案可能是最快的,因为pandas是如何利用C或其他比python更快的东西的?对于大量项目,肯定比嵌套for循环方法更快。我怀疑大部分的加速都是因为我们已经将问题转化为一个集合式的运算,即取两个索引的并集,而C-under-the-hood加速是一个二阶效应。@Hazzles有关于大O/如何使程序更高效的可靠阅读材料吗?你的例子是O(n²)因此,对于大型数据集来说,这将是无效的。正则表达式太过分了:
sales[:8]==orders[:8]
如果命名一致就可以了。是的,我知道嵌套循环并不理想。希望有办法在zip函数中使用lambda函数。此外,文件的命名约定将始终保持一致。有趣的是,以下是原始数据和1000对日期文件的一些计时:带re:117us(仅原始数据)的listcomp,带[:8]切片的listcomp:10.2us/249ms,我的解决方案:13.5us/1.63ms,熊猫解决方案:2.41ms/50.2ms。因此,具有[:8]切片的listcomp在数据量小的情况下速度最快,但扩展性很差。pandas实际上是最差的,但是对于大数据,它只慢了20倍,而我的Python解决方案对于大数据慢了120倍,所以对于大数据集,pandas可能会更快。故事的寓意,衡量!