Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/amazon-web-services/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Awk 单词出现脚本_Awk - Fatal编程技术网

Awk 单词出现脚本

Awk 单词出现脚本,awk,Awk,我正在编写一个脚本,用于统计文本文档中出现的单词 { $0 = tolower($0) for ( i = 1; i <= NF; i++ ) freq[$i]++ } BEGIN { printf "%-20s %-6s\n", "Word", "Count"} END { sort = "sort -k 2nr" for (word in freq) printf "%-20s %-6s\n", word, freq[wo

我正在编写一个脚本,用于统计文本文档中出现的单词

{
        $0 = tolower($0)
        for ( i = 1; i <= NF; i++ )
        freq[$i]++
}
BEGIN { printf "%-20s %-6s\n", "Word", "Count"}
END {
sort = "sort -k 2nr"
for (word in freq)
        printf "%-20s %-6s\n", word, freq[word] | sort
close(sort)
}
{
$0=tolower($0)

对于(i=1;i您不需要编写自己的循环来扫描字段,只需设置
RS
即可使每个单词成为自己的记录:例如,
RS=[^A-Za-z]
将所有未完全由大小写字母构建的字符串作为记录分隔符

$ echo 'Hello world! I am happy123...' | awk 'BEGIN{RS="[^A-Za-z]+"}$0'
Hello
world
I
am
happy
单个
$0
匹配非空行

也许你想在文字中使用数字。只要根据你的需要调整
RS

剩下什么

转换为小写、计数、打印排序结果

文件
wfreq.awk

BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
        printf "%-20s %6s\n", "Word", "Count"
        sort = "sort -k 2nr"
        for(word in counts)
                printf "%-20s %6s\n",word,counts[word] | sort
        close(sort)
}
BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
        printf "%-20s %6s\n", "Word", "Count"
        sort = "sort -k 1"
        for(word in counts)
                printf "%-20s %6s\n",word,counts[word] | sort
        close(sort)
}
示例运行(只有前10行输出不发送垃圾邮件的答案):

但现在有些事情并不是完全不同

要按不同字段排序,只需调整
sort=“sort…”
选项

我不使用
asort()
,因为不是每个
awk
都有这个扩展名

文件
wfreq2.awk

BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
        printf "%-20s %6s\n", "Word", "Count"
        sort = "sort -k 2nr"
        for(word in counts)
                printf "%-20s %6s\n",word,counts[word] | sort
        close(sort)
}
BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
        printf "%-20s %6s\n", "Word", "Count"
        sort = "sort -k 1"
        for(word in counts)
                printf "%-20s %6s\n",word,counts[word] | sort
        close(sort)
}
示例运行(只有前10行输出不发送垃圾邮件的答案):


你能给出一些输入和输出的例子吗?PS
end
end.
将作为两个单独的单词列出。你应该删除
?,“
等。这很有帮助,雪人。如果我想按索引号对计数器数组排序,我该怎么做(而不是按计数排序#).Output将显示索引、字,然后显示计数。我不理解“按索引号对计数器数组排序”“因为计数器数组是由小写单词索引的。那么,通过按索引编号排序,您希望得到什么?按照单词第一次出现的顺序进行排序?我没有很好地解释。如果我想按单词本身(字母顺序)对数组进行排序.Ex.Index-1个单词-a Count 5.标题仅出现在第一行。我可以按索引号和频率对数组排序,但不能按单词本身排序。