Linux 如何查找文本文件中多个单词的计数？_Linux_Shell

Linux 如何查找文本文件中多个单词的计数？

linux shell

Linux 如何查找文本文件中多个单词的计数？,linux,shell,Linux,Shell,我能够找到一个单词在文本文件中出现的次数，就像我们可以使用的Linux一样 cat filename|grep -c tom 我的问题是如何在文本文件中找到“tom”和“joe”等多个单词的计数好的，首先将文件拆分为单词，然后对进行排序和uniq： tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c sort filename | uniq -c 使用awk： {for (i=1;i<=NF;i++) count[$

我能够找到一个单词在文本文件中出现的次数，就像我们可以使用的Linux一样

cat filename|grep -c tom

我的问题是如何在文本文件中找到“tom”和“joe”等多个单词的计数

好的，首先将文件拆分为单词，然后对

进行排序

和

uniq

：

tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c

sort filename | uniq -c

使用awk：

{for (i=1;i<=NF;i++)
    count[$i]++
}
END {
    for (i in count)
        print count[i], i
}

顺便说一句，您不需要

cat

在您的示例中，大多数充当过滤器的程序都可以将文件名作为参数；因此，最好使用

grep -c tom filename

如果没有，人们很有可能开始向你扔东西；-）

这里有一个：

cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c

更新

shell脚本解决方案：

#!/bin/bash

file_name="$2"
string="$1"

if [ $# -ne 2 ]
  then
   echo "Usage: $0 <pattern to search> <file_name>"
   exit 1
fi

if [ ! -f "$file_name" ]
 then
  echo "file \"$file_name\" does not exist, or is not a regular file"
  exit 2
fi

line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0

# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
 do
  flag=0
  while [[ "$line" == *$string* ]]
   do
    flag=1
    line_no_list[line_no_indx]=$curr_line_indx
    line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
    total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
    line=${line/"$string"/}
  done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
  if (( flag == 1 ))
   then
    line_no_indx=$((line_no_indx+2))
  fi
  curr_line_indx=$((curr_line_indx+1))
done < "$file_name"


echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "

for ((i=0; i<line_no_indx; i=i+2))
 do
  echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done

echo

#/bin/bash
file_name=“$2”
string=“$1”
如果[$#-ne 2]
然后
echo“用法：$0”
出口1
fi
如果[！-f“$file\u name”]
然后
echo“file\”$file\u name\“不存在，或不是常规文件”
出口2
fi
行编号列表=（“”）
curr\u line\u indx=1
行编号indx=0
总发生率=0
#线路编号列表包含线路编号LOCK+1线路编号LOCK
#字符串在该行出现的次数
读行时
做
标志=0
而[[“$line”==*$string*]]
做
标志=1
行编号列表[行编号索引]=$curr\u行编号索引
行编号列表[行编号索引+1]=$（（行编号列表[行编号索引+1]+1））
总发生率=$（（总发生率+1））
#删除带有null的模式“$string”，然后重新检查
line=${line/“$string”/}
完成
#如果我们已经进入while循环，那么增加
#用于访问下一个阵列中的下一个阵列位置的行索引
#迭代
如果（（标志==1））
然后
行号indx=$（（行号indx+2））
fi
当前行索引=$（（当前行索引+1））
完成<“$file\u name”
echo-e“\n字符串\“$string\”发生\“$total\u occurrence\”次”
echo-e“字符串\“$string\”出现在\“$（（第indx/2行））行中
echo“[发生次数#：行号：此行发生次数]：”
对于（（i=0；i
您提供的示例不搜索单词“tom”。它将计算“atom”和“bottom”以及更多
Grep搜索正则表达式。匹配单词“tom”或“joe”的正则表达式是
\

你可以做regexp
 cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"

因为你有两个名字，正则表达式就是其中的一个。起初我认为这就像对joe或tom的正则表达式进行grep计数一样简单，但这并不能解释tom和joe在同一条线上（或者tom和tom在这方面）的情况
test.txt：
tom is really really cool!  joe for the win!
tom is actually lame.


$ grep -c '\<\(tom\|joe\)\>' test.txt
2

3…正确答案！希望这有帮助
我完全忘记了grep-f：
cat filename | grep -fc names

AWK解决方案：
#!/bin/bash

file_name="$2"
string="$1"

if [ $# -ne 2 ]
  then
   echo "Usage: $0 <pattern to search> <file_name>"
   exit 1
fi

if [ ! -f "$file_name" ]
 then
  echo "file \"$file_name\" does not exist, or is not a regular file"
  exit 2
fi

line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0

# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
 do
  flag=0
  while [[ "$line" == *$string* ]]
   do
    flag=1
    line_no_list[line_no_indx]=$curr_line_indx
    line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
    total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
    line=${line/"$string"/}
  done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
  if (( flag == 1 ))
   then
    line_no_indx=$((line_no_indx+2))
  fi
  curr_line_indx=$((curr_line_indx+1))
done < "$file_name"


echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "

for ((i=0; i<line_no_indx; i=i+2))
 do
  echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done

echo

假设名称位于名为名称
的文件中：
cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -

您需要grep-w

gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'

gawk程序将记录分隔符设置为任何非字母的，因此每个单词都将在单独的一行结束。然后grep计算与您想要的单词完全匹配的行数
我们使用gawk是因为POSIX awk不允许正则表达式记录分隔符
为简洁起见，您可以用1
替换'{print}'
——无论哪种方式，它都是一个Awk程序，只需打印出所有输入记录（“is1
true？is is？然后执行默认操作，即{print}
”）
即可找到所有行中的所有点击
echo "tom is really really cool!  joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3

这将把“tomtom”算作两次点击。
Whoops.下次，我把问题读对了吗？facepalmthis（分词、选择、计数）如果将非：alnum:
替换为\n
您可能需要注意语言差异，比如在cat Castilian/*.txt中| tr A-Z A-Z | tr-cs'[A-Záóúíñ]'\n'| sort | uniq-c | sort-n
“大多数充当过滤器的程序都可以将文件名作为参数“…即使没有，您仍然可以使用输入重定向（如grep-cmom
）“grep-c
不查找单词，因此你必须搜索它。你的解决方案甚至可以解释joe和tom在同一行中的情况。很好！@Travis:但是，它只错误地计算了一次tomtom
，即使我爷爷也能看到有两个tom
的存在。grep计算行数，而不是单词。一行有ode>tomtom
在它上面算一个还是两个？你到底想要什么？多个计数，你指定的每个单词一个？你指定的所有单词的计数之和？什么是“单词”-正如tchrist已经提到的，您的示例计算与regexp匹配的行数，而不是字数。我稍微修改了正则表达式以处理tomtom的情况。很好的测试用例…感谢您指出。真正困难的测试用例将涉及对原始单词的重叠匹配。：）例如，如果你想要统计的单词是cure
，core
，rely
，lysis
，island
，land
，和dish
，那么你会在不安全的和古怪的上得到2次点击，在岛屿的和上得到3次点击corelysis。一种天真的方法会将它们分别计算为一个。使用一个正则表达式并不有趣，但是使用N个正则表达式非常简单，每个单词一个。
$ echo tomorrow | grep -c tom
1

gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'

echo "tom is really really cool!  joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3