Linux Shell程序-确定文件中的平均字长_Linux_Shell

Linux Shell程序-确定文件中的平均字长

linux shell

Linux Shell程序-确定文件中的平均字长,linux,shell,Linux,Shell,我正在尝试编写一个shell程序来确定文件中的平均字长。我假设我需要以某种方式使用wc和expr。在正确的方向上指导将是伟大的假设您的文件是ASCII并且wc确实可以读取它 chars=$(cat inputfile | wc -c) words=$(cat inputfile | wc -w) 然后是一个简单的 avg_word_size=$(( ${chars} / ${words} )) 将计算一个（四舍五入的）整数。但这将是“更错误的”，而不仅仅是舍入错误：您将在avarage字号

我正在尝试编写一个shell程序来确定文件中的平均字长。我假设我需要以某种方式使用

wc

和

expr

。在正确的方向上指导将是伟大的

假设您的文件是ASCII并且

wc

确实可以读取它

chars=$(cat inputfile | wc -c)
words=$(cat inputfile | wc -w)

然后是一个简单的

avg_word_size=$(( ${chars} / ${words} ))

将计算一个（四舍五入的）整数。但这将是“更错误的”，而不仅仅是舍入错误：您将在avarage字号中包含所有空白字符。我想你想说得更准确些

通过从乘以100的数字计算舍入整数，以下内容将提高精度：

_100x_avg_word_size=$(( $((${chars} * 100)) / ${words} ))

现在我们可以用它来告诉世界：

 echo "Avarage word size is: ${avg_word_size}.${_100x_avg_word_size: -2:2}"

为了进一步细化，我们可以假设只有1个空格字符用于分隔单词：

 chars=$(cat inputfile | wc -c)
 words=$(cat inputfile | wc -w)

 avg_word_size=$(( $(( ${chars} - $(( ${words} - 1 )) )) / ${words} ))
 _100x_avg_word_size=$(( $((${chars} * 100)) / ${words} ))

 echo "Avarage word size is: ${avg_word_size}.${_100x_avg_word_size: -2:2}"

现在，您的工作是尝试将“线”的概念包括到您的计算中…：-）

更新：清楚（希望）显示

wc

与此方法之间的差异；并修复了“太多新行”错误；还增加了更好的控制撇号在单词的结尾

如果您想把<代码> Word <代码>作为<代码> BASH Word ，那么单独使用<代码> WC是很好的。但是，如果你想把一个<代码> Word <代码>作为一个口语/书面语言中的单词，那么你不能用<代码> WC< /Cord>进行单词解析。例如

wc

认为以下内容包含1word（平均大小=112.00），
当下面的脚本显示它包含19个单词时（平均大小=4.58）

使用Kurt的脚本，下面一行显示包含7个单词（平均大小=8.14），
当下面的脚本显示它包含7个单词（平均大小=4.43）…
बे=2个字符 "बे = {Platts} ... —be-ḵẖẉabī, s.f. Sleeplessness:" 所以，如果你的口味是好的，如果不是的话，像这样的东西可能适合： # Cater for special situation words: eg 's and 't # Convert each group of anything which isn't a "character" (including '_') into a newline. # Then, convert each CHARACTER which isn't a newline into a BYTE (not character!). # This leaves one 'word' per line, each 'word' being made up of the same BYTE ('x'). # # Without any options, wc prints newline, word, and byte counts (in that order), # so we can capture all 3 values in a bash array # # Use `awk` as a floating point calculator (bash can only do integer arithmetic) count=($(sed "s/\>'s$[[:punct:]]\|$$/\1/g # ignore apostrophe-s ('s) word endings s/'t\>/xt/g # consider words ending in apostrophe-t ('t) as base word + 2 characters s/[_[:digit:][:blank:][:punct:][:cntrl:]]\+/\n/g s/^\n*//; s/\n*$//; s/[^\n]/x/g" "$file" | wc)) echo "chars / word average:" \ $(awk -vnl=${count[0]} -vch=${count[2]} 'BEGIN{ printf( "%.2f\n", (ch-nl)/nl ) }') 上面的零钱。我正在编写一个shell程序。使用sed's/\s//g'去掉行上的空白，然后减去wc-l`来计算换行符。即l=$（wc-l文件）；w=$（wc-w文件）；allc=$（sed's/\s//g'文件| wc-c）；c=$（（全部c-l））；echo$c @Kevin:l=$（wc-l文件）将无法工作。因为l 的内容不是一个数字，而是一个包含数字、文件名和一些空格的字符串……是的，我特意使用了cat 。。。我跳过了“台词”的话题，因为这个问题只要求在正确的方向上提供一些指导……谢谢。我最终选择了一条完全不同的道路，但我仍然感谢你的帮助。一个字符，在谈论单词时，意味着一个可打印的可见字符，就像在某人的母语字母表中使用的那样；例如：ळ 是Devanagari脚本中的一个字符；它在UTF-8中使用3个字节（标准unix/linux编码系统是UTF-8）ā 是扩展拉丁语脚本中的一个字符；它在UTF-8中使用2个字节wc 确实正确计数字符，但是（顺便说一句）tr 没有字符的概念，tr只理解字节。。多字节字符是一个令人惊讶的大问题！特别是在UTF-8和UTF-16这样的系统之间转换时，为什么在你的书中数字不应该成为单词的一部分？为什么每个数字都要换成新行？！你可以随意调整它！这就是我发布这个答案的原因之一。它给你控制权。这取决于你指的是词典中的单词，还是计算机编程中的行话，等等。好吧，对于我的一个示例文件，你的方法会导致每个单词平均包含个字符：29.8769 。我的方法导致“平均字数为：11.62”。我更相信我的结果，看了我的原始文件…；-）我很想看看那个文件，因为对于一个7842字的文件，我得到了非常相似的结果。你的版本：4.62。。我的版本：4.0776。。差异可能是由于不同的解析方法造成的。 # Cater for special situation words: eg 's and 't # Convert each group of anything which isn't a "character" (including '_') into a newline. # Then, convert each CHARACTER which isn't a newline into a BYTE (not character!). # This leaves one 'word' per line, each 'word' being made up of the same BYTE ('x'). # # Without any options, wc prints newline, word, and byte counts (in that order), # so we can capture all 3 values in a bash array # # Use `awk` as a floating point calculator (bash can only do integer arithmetic) count=($(sed "s/\>'s$[[:punct:]]\|$$/\1/g # ignore apostrophe-s ('s) word endings s/'t\>/xt/g # consider words ending in apostrophe-t ('t) as base word + 2 characters s/[_[:digit:][:blank:][:punct:][:cntrl:]]\+/\n/g s/^\n*//; s/\n*$//; s/[^\n]/x/g" "$file" | wc)) echo "chars / word average:" \ $(awk -vnl=${count[0]} -vch=${count[2]} 'BEGIN{ printf( "%.2f\n", (ch-nl)/nl ) }')