Python 有没有办法优化os.walk的运行时间？_Python_Optimization_Os.walk

Python 有没有办法优化os.walk的运行时间？

python optimization

Python 有没有办法优化os.walk的运行时间？,python,optimization,os.walk,Python,Optimization,Os.walk,我寻找一种方法来优化我的代码，使其在7秒内运行。目前它的运行时间为20秒。有什么线索吗 if(platform.system() == "Windows"): app = string + "." + extension for root, dirs, files in os.walk("C:\\"): if app in files: path = os.path.join(root, app) os.startfile(

我寻找一种方法来优化我的代码，使其在7秒内运行。目前它的运行时间为20秒。有什么线索吗

if(platform.system() == "Windows"):
    app = string + "." + extension
    for root, dirs, files in os.walk("C:\\"):
        if app in files:
            path = os.path.join(root, app)
        os.startfile(path)

我在网上发现了一个有趣的话题，说你可以把速度提高5-9倍：取自

您正在搜索整个'C:\`但是..如果应用程序的扩展名从未更改，为什么要检查app.endswith.exe？请将其设置为string+。+扩展很抱歉，我修复了它。可能是重复的用户特别要求使用Python。如果有人使用的是传统的Python应用程序，那么切换到另一种语言是没有帮助的。您的新答案是关于实现更快的os.walk的建议。同样，这对提出原始问题的人没有帮助。我在上面的评论中提到了一个相同的问题；真正的答案就在这里：要么升级到Python 3.5，要么安装scandir。

I've noticed that os.walk() is a lot slower than it needs to be because it
does an os.stat() call for every file/directory. It does this because it uses
listdir(), which doesn't return any file attributes.

So instead of using the file information provided by FindFirstFile() /
FindNextFile() and readdir() -- which listdir() calls -- os.walk() does a
stat() on every file to see whether it's a directory or not. Both
FindFirst/FindNext and readdir() give this information already. Using it would
basically bring the number of system calls down from O(N) to O(log N).

I've written a proof-of-concept (see [1] below) using ctypes and
FindFirst/FindNext on Windows, showing that for sizeable directory trees it
gives a 4x to 6x speedup -- so this is not a micro-optimization!

I started trying the same thing with opendir/readdir on Linux, but don't have
as much experience there, and wanted to get some feedback on the concept
first. I assume it'd be a similar speedup by using d_type & DT_DIR from
readdir().

The problem is even worse when you're calling os.walk() and then doing your
own stat() on each file, for example, to get the total size of all files in a
tree -- see [2]. It means it's calling stat() twice on every file, and I see
about a 9x speedup in this scenario using the info FindFirst/Next provide.

So there are a couple of things here:

1) The simplest thing to do would be to keep the APIs exactly the same, and
get the ~5x speedup on os.walk() -- on Windows, unsure of the exact speedup on
Linux. And on OS's where readdir() doesn't return "is directory" information,
obviously it'd fall back to using the stat on each file.

2) We could significantly improve the API by adding a listdir_stat() or
similar function, which would return a list of (filename, stat_result) tuples
instead of just the names. That might be useful in its own right, but of
course os.walk() could use it to speed itself up. Then of course it might be
good to have walk_stat() which you could use to speed up the "summing sizes"
cases.

Other related improvements to the listdir/walk APIs that could be considered
are:

* Using the wildcard/glob that FindFirst/FindNext take to do filtering -- this
would avoid fnmatch-ing and might speed up large operations, though I don't
have numbers. Obviously we'd have to simulate this with fnmatch on non-Windows
OSs, but this kind of filtering is something I've done with os.walk() many
times, so just having the API option would be useful ("glob" keyword arg?).

* Changing listdir() to yield instead of return a list (or adding yieldir?).
This fits both the FindNext/readdir APIs, and would address issues like [3].

Anyway, cutting a long story short -- do folks think 1) is a good idea? What
about some of the thoughts in 2)? In either case, what would be the best way
to go further on this?

Thanks,
Ben.