使用shell查找列中每个项的频率_Shell

使用shell查找列中每个项的频率

shell

使用shell查找列中每个项的频率,shell,Shell,我对shell/mac终端非常缺乏经验，因此任何帮助或建议都将不胜感激我有一组非常大的数据，带有制表符分隔符。下面是代码的示例 0001 User1 Tweet1 0002 User2 Tweet2 0003 User3 Tweet3 0004 User2 Tweet4 0005 User2 Tweet5 我一直在尝试将每个独特用户的列表以及他们出现/发布推文的次数导出为csv 以下是我目前对代码的尝试： cut -f 2 Twit

我对shell/mac终端非常缺乏经验，因此任何帮助或建议都将不胜感激

我有一组非常大的数据，带有制表符分隔符。下面是代码的示例

0001    User1    Tweet1
0002    User2    Tweet2
0003    User3    Tweet3
0004    User2    Tweet4
0005    User2    Tweet5

我一直在尝试将每个独特用户的列表以及他们出现/发布推文的次数导出为csv

以下是我目前对代码的尝试：

cut -f 2 Twitter_Data_1 |sort | uniq -c | wc -l > TweetFreq.csv

理想情况下，我希望导出如下所示的csv：

User1    1
User2    3
User3    1

不是最干净的，但它能工作

#!/bin/bash
mkdir tmptweet # Creation of the temp directory
while read line; do
user=`echo $line | cut -d " " -f 2` # we access the username
echo $line >> tmptweet/$user # add a line to the selected user's counter
done < Twitter_Data_1

for file in tmptweet/*; do
i=`cat $file | wc -l` # we check the lines for each user ...
echo "${file##*/} $i" >> TweetFreq.csv # ... and put this into the final file
done
rm -rf tmptweet # remove of the temp directory

不是最干净的，但它能工作

#!/bin/bash
mkdir tmptweet # Creation of the temp directory
while read line; do
user=`echo $line | cut -d " " -f 2` # we access the username
echo $line >> tmptweet/$user # add a line to the selected user's counter
done < Twitter_Data_1

for file in tmptweet/*; do
i=`cat $file | wc -l` # we check the lines for each user ...
echo "${file##*/} $i" >> TweetFreq.csv # ... and put this into the final file
done
rm -rf tmptweet # remove of the temp directory

输出：

  1 User1
  3 User2
  1 User3

输出：

  1 User1
  3 User2
  1 User3

您已经在使用

uniq

进行计数了。

wc

的目的是什么？很好，但即使我删除了它，我也只得到1个输出，而不是一整列。更新你的问题，以显示你当前的代码和输出，并再次显示它，似乎起到了作用。导致文件太大，只能部分加载。您已经在使用

uniq

进行计数了。

wc

的目的是什么？很好，但即使我删除了它，我也只得到1个输出，而不是一整列。更新你的问题，以显示你当前的代码和输出，并再次显示它，似乎起到了作用。导致文件太大，只加载了一部分。现在运行代码，需要很长时间。大量的数据和它的mac。。。您可以使用

cd

检查在

tmptweet

中创建的文件，也可以使用

tail-f TweetFreq.csv

查看第二部分中的实时馈送：-）TweetFreq.cv不是为我创建的，所以我停止了代码。它将在脚本末尾创建，这就是原因。您的

witter\u Data\u 1

文件中有多少个条目？像10000多个条目一样，有没有办法只对头部进行测试？现在运行代码，需要很长时间。大量的数据和它的mac。。。您可以使用

cd

检查在

tmptweet

中创建的文件，也可以使用

tail-f TweetFreq.csv

查看第二部分中的实时馈送：-）TweetFreq.cv不是为我创建的，所以我停止了代码。它将在脚本末尾创建，这就是原因。你的

witter\u Data\u 1

文件中有多少条条目？有没有办法只做头部测试？因为我的第三栏没有我的示例那么统一，我如何替换单词tweet？

tweet

是文件名，在你的情况下，它应该是

Twitter\u Data\u 1

，因为我的第3列没有我的示例那么统一，我如何替换tweet这个词？

tweet

是文件名，在您的情况下，它应该是

Twitter\u Data\u 1