Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/jquery-ui/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 用于计数字符的清管器脚本_Regex_Count_Apache Pig - Fatal编程技术网

Regex 用于计数字符的清管器脚本

Regex 用于计数字符的清管器脚本,regex,count,apache-pig,Regex,Count,Apache Pig,我试图写一个猪脚本,计算所有字符(特殊字符和字母),并分别给出每个字符的计数。我一直在尝试使用下面的脚本,但它只计算字母,但不包括特殊字符,如?及:。请帮忙 A = load 'pigfiles/p.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = foreach C generate flatten(TOKENIZE(RE

我试图写一个猪脚本,计算所有字符(特殊字符和字母),并分别给出每个字符的计数。我一直在尝试使用下面的脚本,但它只计算字母,但不包括特殊字符,如?及:。请帮忙

A = load 'pigfiles/p.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';
只需使用
'(.+)'
代替
'\\w+'
,它将为您提供文件中所有标点符号和字母的计数

例如:

文件:[
cat a.txt
]

"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
代码:

输出:
cat零件-r-00000

4       !
1       ;
3       ?
2       H
1       I
2       L
1       W
1       a
1       c
1       d
3       e
1       g
2       h
3       i
1       j
1       m
3       n
4       o
1       p
1       r
7       s
7       t
4       u
1       w
2       y

没有获得某些特殊字符的原因是使用空格、双引号(“)、逗号(,)括号(())、星形(*)作为分隔符

因此,当您在(chararray)$0上使用TOKENIZE时,令牌分隔符会丢失,并且不会被计算在内

因此,使用Ani Menon的示例数据,下面是脚本和输出

输入

"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
A = LOAD 'test5.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B  BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
PigScript

"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
A = LOAD 'test5.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B  BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
输出

"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
A = LOAD 'test5.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B  BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
这里有一个解决方案:

lines = LOAD 'p.txt' AS (line: chararray);

characters = FOREACH lines GENERATE FLATTEN(STRSPLITTOBAG(line, '')) AS character;

charGroups = GROUP characters BY character;

result = FOREACH charGroups GENERATE group, COUNT($1);

store result into 'charcount.txt';
它将产生如下输出:


谢谢,阿尼,唯一的问题是我没有得到“,”的计数。我应该怎么做来计数“,”too@user5355171在使用
(.+)时,您还应该获得
。还是只想找到字母和逗号?在您的输出中,我还可以看到“,”计数,它在let之后,@user5355171。我不知道为什么会发生这种情况。我尝试了其他表达式,如
^[a-zA-Z0-9,.!?]*$
也一样,但它仍然与“,”不匹配。因此,我专门为此提出了一个问题:.@user5355171因为在PIG中没有与“,”匹配的正则表达式。我想你的问题得到了回答,除了“,”部分,我认为这是PIG中的一个错误。请看我的答案,我已经解释了为什么一些字符不被计数以及如何计数。