Bash 计算文件中类似行的数量

Bash 计算文件中类似行的数量,bash,Bash,我在一个论坛上有一个主题,人们可以在这里写他们的前十首歌曲。我想数一数一首歌被列出的次数。相似性必须进行比较,不区分大小写 文件结构示例: Join Date: Apr 2005 Location: bama via new orleans Age: 48 Posts: 2,369 Re: Top 10 Songs Jethro Tull oh dearrrr. the only way for all kaths to keep their last shred of sanity: fly

我在一个论坛上有一个主题,人们可以在这里写他们的前十首歌曲。我想数一数一首歌被列出的次数。相似性必须进行比较,不区分大小写

文件结构示例:

Join Date: Apr 2005
Location: bama via new orleans
Age: 48
Posts: 2,369
Re: Top 10 Songs Jethro Tull
oh dearrrr. the only way for all kaths to keep their last shred of sanity: fly through this list as quickly as possible, without stopping to think for a microsecond...
velvet green
dun ringill
skating away on the thin ice of a new day
sossity yer a woman
fat man
life's a long song
jack-a-lynn
teacher
mother goose
elegy

 03-10-2010, 02:29 AM      #5 (permalink)
Sox
Avoiding The Swan Song



Join Date: Jan 2010
Location: Derbyshire, England
Age: 43
Posts: 5,991
 Re: Top 10 Songs Jethro Tull
Wow !!!! Where do I start ?
Dun Ringill
Aqualung
With You There To Help Me
Jack Frost And The Hooded Crow
We Used To Know
Witch's Promise
Pussy Willow
Heavy Horses
My Sunday Feeling
Locomotive Breath

Join Date: Nov 2009
Posts: 1,418
 Re: Top 10 Songs Jethro Tull
Too bad they all can't make the list, but here's ten I never get tired of listening to:

Christmas Song
Witches Promise
Life's A Long Song
Living In The Past
Rainbow Blues
Sweet Dream
Minstral In The Gallery
Cup of Wonder
Rover
Something's On the Move
示例输出:

life's a long song 3
aqualung 1
...

此命令列出要重复的行和次数

sort nameFile | uniq -c 
您需要注意的是,在结构部门中,文件的“结构”有点不足,因此您必须处理过程中的一些错误

假设您在名为
input
的文件中有所有这些内容,请尝试:

tr '[A-Z]' '[a-z]' < input | \
     egrep -v "^ *(join date|age|posts|location|re):" | \
     sort | \
     uniq -c
tr'[A-Z]'[A-Z]'

第一行将所有内容小写,第二行去掉样本中类似电子邮件标题的内容,然后对唯一项目进行排序和计数。

使用
awk
如何-

awk '
/:/||/^$/{next}{a[toupper($0)]++}
END{for(i in a) print i,a[i]}' INPUT_FILE
说明: 首先,我们确定其中包含
为空的行,并忽略它们。存储的所有其他行都转换为大写并存储在数组中。
在
结束语句中
我们打印出数组中的所有内容以及找到它的次数

测试:
您可以将
-i
传递到
egrep
并跳过
tr
的调用。您还必须传递
-f
进行排序,我不确定其行为。很好。我在末尾添加了
| sort
,以查看排序结果。感谢you@Mat-说得好。
-i
将在搜索中忽略大小写,
-f
将所有内容都折叠为小写,这基本上就是
tr
为您所做的。@jordanm是的,我修复了它。感谢您在填充数组时,为什么不跳过
tr
,只需使用awk的
toupper
功能?非常好。对不起,我都没想过。我已经更新了答案。
awk '
/:/||/^$/{next}{a[toupper($0)]++}
END{for(i in a) print i,a[i]}' file1
SOX 1
CHRISTMAS SONG 1
CUP OF WONDER 1
SOSSITY YER A WOMAN 1
FAT MAN 1
PUSSY WILLOW 1
VELVET GREEN 1
WITH YOU THERE TO HELP ME 1
ELEGY 1
WE USED TO KNOW 1
TEACHER 1
MY SUNDAY FEELING 1
SWEET DREAM 1
JACK-A-LYNN 1
SOMETHING'S ON THE MOVE 1
ROVER 1
DUN RINGILL 2
AVOIDING THE SWAN SONG 1
JACK FROST AND THE HOODED CROW 1
WITCHES PROMISE 1
LIFE'S A LONG SONG 2
LIVING IN THE PAST 1
WITCH'S PROMISE 1
WOW !!!! WHERE DO I START ? 1
SKATING AWAY ON THE THIN ICE OF A NEW DAY 1
MINSTRAL IN THE GALLERY 1
RAINBOW BLUES 1
MOTHER GOOSE 1
HEAVY HORSES 1
AQUALUNG 1
LOCOMOTIVE BREATH 1