Bash 如何创建文件中每个单词的频率列表?

Bash 如何创建文件中每个单词的频率列表?,bash,file-io,sed,grep,Bash,File Io,Sed,Grep,我有这样一个文件: This is a file with many words. Some of the words appear more than once. Some of the words only appear one time. 我想生成一个两列列表。第一列显示单词出现的频率,第二列显示单词出现的频率,例如: this@1 is@1 a@1 file@1 with@1 many@1 words3 some@2 of@2 the@2 only@1 appear@2 more@1

我有这样一个文件:

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
我想生成一个两列列表。第一列显示单词出现的频率,第二列显示单词出现的频率,例如:

this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1 
  • 为了简化这项工作,在处理列表之前,我将删除所有标点符号,并将所有文本更改为小写字母
  • 除非有一个简单的解决方案,
    words
    word
    可以算作两个独立的单词
到目前为止,我有:

sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
     count="$(grep -c $line file1.txt)"
     echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines
sed-i“s/\n/g”。/file1.txt#将所有单词放在新行上
读行时
做
count=“$(grep-c$行文件1.txt)”
echo$line“@”$count>>file2.txt#将单词和频率添加到文件中
完成<./file1.txt
排序-u-d#删除重复行
出于某种原因,这只是在每个单词后显示“0”


如何生成文件中出现的每个单词的列表以及频率信息?

不是
sed
grep
,而是
tr
sort
uniq
awk

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

%(
sed
grep
),但
tr
sort
sort
uniq
,和
awk

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

%(tr''\n'| sort | uniq-c | awk'{print$2'@“$1}')这可能适合您:

tr '[:upper:]' '[:lower:]' <file |
tr -d '[:punct:]' |
tr -s ' ' '\n' | 
sort |
uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/\2@\1/'

tr'[:upper:]['[:lower:]['这可能适合您:

tr '[:upper:]' '[:lower:]' <file |
tr -d '[:punct:]' |
tr -s ' ' '\n' | 
sort |
uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/\2@\1/'
tr'[:upper:]['[:lower:]['排序需要GNU AWK(
gawk
)。如果您的另一个AWK没有
asort()
,则可以轻松地进行调整,然后通过管道传输到
sort

awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile
awk'{gsub(/\./,”);for(i=1;i排序需要GNU awk(
gawk
)。如果您有另一个awk而没有
asort()
,则可以轻松地进行调整,然后通过管道输送到
sort

awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile

awk'{gsub(/\./,“”);用于(i=1;i输入文件的内容

$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
使用
sed | sort | uniq

$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
      1 a
      2 appear
      1 file
      1 is
      1 many
      1 more
      2 of
      1 once
      1 one
      1 only
      2 some
      1 than
      2 the
      1 this
      1 time
      1 with
      3 words

uniq-ic
将计数并忽略大小写,但结果列表将包含输入文件的
此内容,而不是
此内容

$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
使用
sed | sort | uniq

$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
      1 a
      2 appear
      1 file
      1 is
      1 many
      1 more
      2 of
      1 once
      1 one
      1 only
      2 some
      1 than
      2 the
      1 this
      1 time
      1 with
      3 words

uniq-ic
将计数并忽略大小写,但结果列表将包含
This
,而不是
This

uniq-c已完成所需操作,只需对输入进行排序:

echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c
输出:

  6 a
  7 d
  7 s

uniq-c已经完成了您想要的操作,只需对输入进行排序:

echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c
输出:

  6 a
  7 d
  7 s
让我们用AWK! 此函数按降序列出所提供文件中每个单词出现的频率:

函数wordfrequency(){
awk'
开始{FS=“[^a-zA-Z]+”}{
对于(i=1;我不能使用AWK!
此函数按降序列出所提供文件中每个单词出现的频率:

函数wordfrequency(){
awk'
开始{FS=“[^a-zA-Z]+”}{

对于(i=1;i让我们在Python3中实现它

"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""

# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/

import sys

# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
    lines = sys.stdin
else:
    lines = open(sys.argv[1])

D = {}
for line in lines:
    for word in line.split():
        word = ''.join(list(filter(
            lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\\|:;\"'<>,.?/",
            word)))
        word = word.lower()
        if word in D:
            D[word] += 1
        else:
            D[word] = 1

for word in sorted(D, key=D.get, reverse=True):
    print(word + ' ' + str(D[word]))
现在要在文件“content.txt”中查找频率词,请执行以下操作:

您还可以通过管道将输出传输到它:

cat content.txt | freq
甚至可以分析多个文件中的文本:

cat content.txt story.txt article.txt | freq

如果您使用的是Python2,只需替换

  • “”。将(列表(筛选器(args…))
    筛选器(args…)
  • python3
    with
    python
  • 打印(任何内容)
    使用
    打印任何内容

让我们在Python 3中实现它

"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""

# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/

import sys

# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
    lines = sys.stdin
else:
    lines = open(sys.argv[1])

D = {}
for line in lines:
    for word in line.split():
        word = ''.join(list(filter(
            lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\\|:;\"'<>,.?/",
            word)))
        word = word.lower()
        if word in D:
            D[word] += 1
        else:
            D[word] = 1

for word in sorted(D, key=D.get, reverse=True):
    print(word + ' ' + str(D[word]))
现在要在文件“content.txt”中查找频率词,请执行以下操作:

您还可以通过管道将输出传输到它:

cat content.txt | freq
甚至可以分析多个文件中的文本:

cat content.txt story.txt article.txt | freq

如果您使用的是Python2,只需替换

  • “”。将(列表(筛选器(args…))
    筛选器(args…)
  • python3
    with
    python
  • 打印(任何内容)
    使用
    打印任何内容
    • #!/usr/bin/env bash
      声明-地图
      words=“$1”
      [[-f$1]]|{echo“用法:$(basename$0 wordfile)”;退出1;}
      边读边做
      对于$line中的单词;do
      ((映射[$word]++))
      完成;
      完成<
      !/usr/bin/env bash
      声明-地图
      words=“$1”
      [[-f$1]]|{echo“用法:$(basename$0 wordfile)”;退出1;}
      边读边做
      对于$line中的单词;do
      ((映射[$word]++))
      完成;
      
      完成<您可以使用tr进行此操作,只需运行

      tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt
      

      您可以使用tr进行此操作,只需运行

      tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt
      
      awk'{
      开始{word[“”]=0;}
      {
      对于(el=1;el
      awk'{
      开始{word[“”]=0;}
      {
      
      对于(el=1;el,如果我的file.txt中有以下文本

      This is line number one
      This is Line Number Tow
      this is Line Number tow
      
      我可以使用以下命令找到每个单词的频率

       cat file.txt | tr ' ' '\n' | sort | uniq -c
      
      输出:

        3 is
        1 line
        2 Line
        1 number
        2 Number
        1 one
        1 this
        2 This
        1 tow
        1 Tow
      

      如果我的file.txt中有以下文本

      This is line number one
      This is Line Number Tow
      this is Line Number tow
      
      我可以使用以下命令找到每个单词的频率

       cat file.txt | tr ' ' '\n' | sort | uniq -c
      
      输出:

        3 is
        1 line
        2 Line
        1 number
        2 Number
        1 one
        1 this
        2 This
        1 tow
        1 Tow
      

      这是一个很好的解决方案。您可能想做的一件事是提供一种删除尾随句点的方法。也许在管道中的
      tr
      sort
      之间插入
      .sed-e的/\.$//g'
      。我考虑过这一点,但原始帖子说在这一步之前会删除标点。好吧,只是一个修改如果不删除标点符号和大写字母,请将其添加到解决方案中。此外,这会删除不必要的空白,压缩额外的空格,并首先以最高频率打印单词:
      cat file.txt | tr'[:punct:]“''tr'A-Z''A-Z''tr-s''tr''\n''sort | uniq-c | sort-rn
      这很好。有没有办法让它跳过注释?例如,
      \comment
      这是一个很好的解决方案。您可能想做的一件事是提供一种删除尾随句点的方法。可能在
      之间插入
      ;sed-e's/\$/g'/code>e> 在你的管道中排序。我考虑过了,但是最初的帖子说在这一步之前标点符号会被删除。好吧,只是对你的解决方案进行修改,删除标点符号和大写字母,以防它们不会被删除。另外,这