Shell URL部分匹配_Shell_Awk_Sed_Scripting

Shell URL部分匹配

shell awk sed scripting

Shell URL部分匹配,shell,awk,sed,scripting,Shell,Awk,Sed,Scripting,我有两个文件：文件1 文件2： www.neo.com/1/2/3/names.html http://abc.gov.cn/script.aspx http://example.com/abc/abc.html 文件2是用于文件1第2列部分匹配的搜索URL。如果有部分匹配，则必须返回第1列url和文件1第2列中的部分匹配url，如下所示：期望输出： http://www.hello.com http://neo.com/peace/development.html, http://

我有两个文件：文件1

文件2：

www.neo.com/1/2/3/names.html
http://abc.gov.cn/script.aspx
http://example.com/abc/abc.html

文件2是用于文件1第2列部分匹配的搜索URL。如果有部分匹配，则必须返回第1列url和文件1第2列中的部分匹配url，如下所示：

期望输出：

http://www.hello.com    http://neo.com/peace/development.html, http://example.com/abc/abc.html
http://news.net
http://example2.com     http://abc.gov.cn/department/1.html

我尝试了这个脚本，它可以在第2列提供精确的匹配url模式，如下所示：

awk -F '[ \t,]' '
FNR == NR {
    a[$1]
    next
}
{    o = $1
    c = 0
    for(i = 2; i <= NF; i++)
        if($i in a)
            o = o (c++ ? ", " : "\t") $i
    print o
}' file2 file1

有没有修复此问题的建议？

以下是一个awk可执行脚本：

#!/usr/bin/awk -f

function getHost( url,        host ) {
    c = split( url, uarr, /[/]|:/ )
    for(j=1;j<=c;j++ ) {
        if( index( uarr[j], "." ) ) { host=uarr[j]; break }
    }
    return( host )
}

FNR==NR { host=getHost($1); if( host!="" ) hosts[host]; next }

# file2 FS="[[:space:]]|," file1
{
    end=""
    start = $1 "\t"
    for(i=2;i<=NF;i++) {
        f1h=getHost( $i )

        for( f2h in hosts ) {
            if( length( f2h ) > length( f1h ) )
                { long_host=f2h; short_host=f1h }
            else
                { long_host=f1h; short_host=f2h }

            if( short_host!="" && index( long_host, short_host ) ) {
                if( end!="" ) end = end ", "
                end = end $i
                break
            }
        }
    }
    print start (end!="" ? end : "")
}

它至少有以下假设：

您真正想要匹配的是url中的主机元素
每个有效主机中至少有一个
```
\t
```
是第一个字段和找到的任何匹配字段之间的输出分隔符

细分：

```
getHost（）
```
函数将返回第一个元素，其中
作为主机
```
file2
```
主机被加载到
```
hosts
```
阵列中

从外部来看，在处理

file1

之前，

会使用FS=“[[：space:]]|，”


在file1
解析中，第一个字段存储在start
中，后续字段中的任何主机匹配项都会附加到end
中
在比较主机之前，将找到最长的主机并将其设置为long\u host
，最短的主机设置为short\u host
如果在long\u主机
中找到short\u主机
，则在end
后附加一些逗号
最后，打印开始
和任何结束
，这些都是通过对文件1中的每一行进行匹配而产生的


运行此命令将生成所需的输出，并警告当前总是将\t
附加到输出中的$1
（即使没有匹配项）
输出：
http://www.hello.com    http://neo.com/peace/development.html, http://example.com/abc/abc.html
http://news.net
http://example2.com     http://abc.gov.cn/department/1.html

您还可以使输出统一：
awk -f script.awk file2 file1 | column -t -s $'\t' -o '    '

请参见man列

带有列的脚本版本
：
#!/bin/sh    
awk -- '
    function gethostname(url) {
        sub(/^[a-z]+:\/+/, "", url)
        sub(/^www[.]/, "", url)
        sub(/\/.*$/, "", url)
        return url
    }
    BEGIN { FS = "[ ,\t\r]*" }
    NR == FNR {
        a[gethostname($1)]++
        next
    }
    {
        t = ""
        for (i = 2; i <= NF; ++i) {
            if (gethostname($i) in a) {
                t = length(t) ? t ", " $i : $i
            }
        }
        print length(t) ? $1 "|" t : $1
    }
' "$@" | column -t -s '|' -o '    '

我认为grep-f file2 file1
应该适用于大多数情况，除了在没有匹配项的情况下返回ing column1为什么是http://abc.gov.cn/department/1.html文件2中未显示时，在输出中显示。我想我不明白你在这里干什么。您是否正在使用文件2
搜索文件1
，如果找到一个匹配项，则返回整行，如果未找到匹配项，则仅返回第一列？@skamazin如果与文件2的URL部分匹配，则应打印文件1第2列的URL<代码>http://abc.gov.cn/department/1.html由于与http://abc.gov.cn/在文件2中。这就是我解释它的方式。好的，所以它只需要匹配URL的第一部分。。。那很难。。。
#!/usr/bin/awk -f
function gethostname(url) {
    sub(/^[a-z]+:\/+/, "", url)
    sub(/^www[.]/, "", url)
    sub(/\/.*$/, "", url)
    return url
}
BEGIN { FS = "[ ,\t\r]*" }
NR == FNR {
    a[gethostname($1)]++
    next
}
{
    t = ""
    for (i = 2; i <= NF; ++i) {
        if (gethostname($i) in a) {
            t = length(t) ? t ", " $i : $i
        }
    }
    print length(t) ? $1 "\t" t : $1
}

awk -f script.awk file2 file1

http://www.hello.com    http://neo.com/peace/development.html, http://example.com/abc/abc.html
http://news.net
http://example2.com     http://abc.gov.cn/department/1.html

awk -f script.awk file2 file1 | column -t -s $'\t' -o '    '

#!/bin/sh    
awk -- '
    function gethostname(url) {
        sub(/^[a-z]+:\/+/, "", url)
        sub(/^www[.]/, "", url)
        sub(/\/.*$/, "", url)
        return url
    }
    BEGIN { FS = "[ ,\t\r]*" }
    NR == FNR {
        a[gethostname($1)]++
        next
    }
    {
        t = ""
        for (i = 2; i <= NF; ++i) {
            if (gethostname($i) in a) {
                t = length(t) ? t ", " $i : $i
            }
        }
        print length(t) ? $1 "|" t : $1
    }
' "$@" | column -t -s '|' -o '    '

sh script.sh file2 file1