使我的awk shell脚本更高效（解析python）_Python_Bash_Shell_Awk

使我的awk shell脚本更高效（解析python）

python bash shell awk

使我的awk shell脚本更高效（解析python）,python,bash,shell,awk,Python,Bash,Shell,Awk,我最近的任务是审核我的团队在整个生产代码库中使用的所有python模块我得出了以下结论： find ~/svn/ -name *.py | xargs grep -hn "^import\|^from" | awk -F ":" '{print $2}' | awk '{if (/from/) print $2; else {$1 = ""; print $0} }' | sed 's/,\| /\n/g' | sort | uniq > /tmp/pythonpkgs.txt

我最近的任务是审核我的团队在整个生产代码库中使用的所有python模块

我得出了以下结论：

find ~/svn/ -name *.py 
| xargs grep -hn "^import\|^from"
| awk -F ":" '{print $2}' 
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
| sed 's/,\| /\n/g' 
| sort 
| uniq > /tmp/pythonpkgs.txt

按顺序，它

查找所有python文件
其中，greps表示以
```
import
```
或
```
from
```
拆分
```
：
```
字符并使用以下内容，因此不包括文件名和输出编号

如果行的格式为来自foo导入栏的

，则打印foo
，否则，如果行的格式为import foo
printfoo


对于像导入a、b、c这样的行，用换行符替换逗号和空格
对输出进行排序并获取唯一值


我自己把它拼凑起来，但我想它会更好。你会怎么做？整合awk
s？
这是一个非常聪明的设置，但是有几个地方可以清理：
1: find ~/svn/ -name *.py 
2: | xargs grep -hn "^import\|^from"
3: | awk -F ":" '{print $2}' 
4: | awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
5: | sed 's/,\| /\n/g' 
6: | sort 
7: | uniq > /tmp/pythonpkgs.txt 

第3行：您不需要第一次awk拆分/打印——只需在grep上不包含-n
，这样您就不会在输出中包含行号
time find ./<<my_large_project>> -name *.py 
| xargs grep -hn "^import\|^from" 
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
| sed 's/,\| /\n/g' 
| sort 
| uniq
~~snip~~
real    0m0.492s
user    0m0.208s
sys     0m0.116s

请注意，这将丢失中所述的多行导入。对这些导入的支持将使您的正则表达式搜索变得相对简单，因为您必须查找可选的括号并记下前面的空格。
开始时的设置非常聪明，但有几个地方可以清理：
1: find ~/svn/ -name *.py 
2: | xargs grep -hn "^import\|^from"
3: | awk -F ":" '{print $2}' 
4: | awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
5: | sed 's/,\| /\n/g' 
6: | sort 
7: | uniq > /tmp/pythonpkgs.txt 

第3行：您不需要第一次awk拆分/打印——只需在grep上不包含-n
，这样您就不会在输出中包含行号
time find ./<<my_large_project>> -name *.py 
| xargs grep -hn "^import\|^from" 
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
| sed 's/,\| /\n/g' 
| sort 
| uniq
~~snip~~
real    0m0.492s
user    0m0.208s
sys     0m0.116s

请注意，这将丢失中所述的多行导入。对这些导入的支持将使您的正则表达式搜索变得相对简单，因为您必须查找可选的括号并注意前面的空格。
为特定的构造生成源代码非常脆弱，在许多情况下可能会失败。例如，考虑：
import foo ; print 123

或
或
等等
如果您对更健壮的方法感兴趣，这就是如何使用python自己的编译器可靠地解析导入的名称：
import ast

def list_imports(source):
    for node in ast.walk(ast.parse(source)):
        if isinstance(node, ast.Import):
            for name in node.names:
                yield name.name
        if isinstance(node, ast.ImportFrom):
            yield node.module

用法：
 for name in sorted(set(list_imports(some_source))):
     print name

为特定结构生成源代码是非常脆弱的，在很多情况下可能会失败。例如，考虑：
import foo ; print 123

或
或
等等
如果您对更健壮的方法感兴趣，这就是如何使用python自己的编译器可靠地解析导入的名称：
import ast

def list_imports(source):
    for node in ast.walk(ast.parse(source)):
        if isinstance(node, ast.Import):
            for name in node.names:
                yield name.name
        if isinstance(node, ast.ImportFrom):
            yield node.module

用法：
 for name in sorted(set(list_imports(some_source))):
     print name

以下是我的综合awk：
/^[ \t]*import .* as/  {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    print $2;
    next;
}
/^[ \t]*from (.*) import/ {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    print $2;
    next;
}
/^[ \t]*import (.*)/  {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    split(substr($0,7),a,",");  # split on commas
    for (i=1;i<=length(a);i++) {
        gsub("[ \t]+","",a[i]); # trim whitespace
        print a[i];
    }
    next;
}

如前所述，它不考虑一些潜在的情况，例如用“；”连接的行或者用“\”拆分，或者用“（）”分组，但它将覆盖大部分Python代码。
以下是我的合并awk：
/^[ \t]*import .* as/  {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    print $2;
    next;
}
/^[ \t]*from (.*) import/ {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    print $2;
    next;
}
/^[ \t]*import (.*)/  {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    split(substr($0,7),a,",");  # split on commas
    for (i=1;i<=length(a);i++) {
        gsub("[ \t]+","",a[i]); # trim whitespace
        print a[i];
    }
    next;
}

如前所述，它不考虑一些潜在的情况，例如用“；”连接的行或用“\”拆分，或用“（）”分组，但它将覆盖大部分Python代码。
显示正在解析的数据示例，但从外观上看，除了查找之外，所有内容都可以在单个awk脚本中完成。请注意，对于跨多行的导入，这将失败；但也许这在代码中是不存在的。我假设您已经安全地运行了上述代码。@JID如果您愿意，我可以发布一个示例，但我想在pygal.style中检测字符串中的pygal.style
，或者中的时间导入时间@埃弗特-你说得对。我确实对一个示例进行了手动审核，我相信我不必担心这种情况。向示例演示您正在解析的数据，但从外观上看，除了查找之外，所有内容都可以在单个awk脚本中完成。请注意，对于跨多行的导入，这将失败；但也许这在代码中是不存在的。我假设您已经安全地运行了上述代码。@JID如果您愿意，我可以发布一个示例，但我想在pygal.style中检测字符串中的pygal.style
，或者中的时间导入时间@埃弗特-你说得对。我确实对一个示例进行了手动审核，我相信我不必担心这个问题。我根据~/svn
假设他正在查找正在使用和未使用的代码。不过，我同意这是最“正确”的方法。我根据~/svn
假设他在寻找正在使用和未使用的代码。我同意这是最“正确”的方法。