Regex 用于计数字符的清管器脚本
我试图写一个猪脚本,计算所有字符(特殊字符和字母),并分别给出每个字符的计数。我一直在尝试使用下面的脚本,但它只计算字母,但不包括特殊字符,如?及:。请帮忙Regex 用于计数字符的清管器脚本,regex,count,apache-pig,Regex,Count,Apache Pig,我试图写一个猪脚本,计算所有字符(特殊字符和字母),并分别给出每个字符的计数。我一直在尝试使用下面的脚本,但它只计算字母,但不包括特殊字符,如?及:。请帮忙 A = load 'pigfiles/p.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = foreach C generate flatten(TOKENIZE(RE
A = load 'pigfiles/p.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';
只需使用'(.+)'
代替'\\w+'
,它将为您提供文件中所有标点符号和字母的计数
例如:
文件:[cat a.txt
]
"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
代码:
输出:cat零件-r-00000
4 !
1 ;
3 ?
2 H
1 I
2 L
1 W
1 a
1 c
1 d
3 e
1 g
2 h
3 i
1 j
1 m
3 n
4 o
1 p
1 r
7 s
7 t
4 u
1 w
2 y
没有获得某些特殊字符的原因是使用空格、双引号(“)、逗号(,)括号(())、星形(*)作为分隔符 因此,当您在(chararray)$0上使用TOKENIZE时,令牌分隔符会丢失,并且不会被计算在内 因此,使用Ani Menon的示例数据,下面是脚本和输出 输入
"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
A = LOAD 'test5.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
PigScript
"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
A = LOAD 'test5.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
输出
"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
A = LOAD 'test5.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
这里有一个解决方案:
lines = LOAD 'p.txt' AS (line: chararray);
characters = FOREACH lines GENERATE FLATTEN(STRSPLITTOBAG(line, '')) AS character;
charGroups = GROUP characters BY character;
result = FOREACH charGroups GENERATE group, COUNT($1);
store result into 'charcount.txt';
它将产生如下输出:
谢谢,阿尼,唯一的问题是我没有得到“,”的计数。我应该怎么做来计数“,”too@user5355171在使用
(.+)时,您还应该获得,
。还是只想找到字母和逗号?在您的输出中,我还可以看到“,”计数,它在let之后,@user5355171。我不知道为什么会发生这种情况。我尝试了其他表达式,如^[a-zA-Z0-9,.!?]*$
也一样,但它仍然与“,”不匹配。因此,我专门为此提出了一个问题:.@user5355171因为在PIG中没有与“,”匹配的正则表达式。我想你的问题得到了回答,除了“,”部分,我认为这是PIG中的一个错误。请看我的答案,我已经解释了为什么一些字符不被计数以及如何计数。