Java 抓取字符串中的字数和字符数
我正在尝试编写一个超级高效的方法,该方法在两种“模式”(Java 抓取字符串中的字数和字符数,java,regex,string,tokenize,text-analysis,Java,Regex,String,Tokenize,Text Analysis,我正在尝试编写一个超级高效的方法,该方法在两种“模式”(WORD和CHARACTER)下运行,它接受字符串并告诉我其中的字数(由1+空格分隔)或字符数(非空格字符): 我知道我可以使用StringTokenizer完成WORD模式版本: StringTokenizer tokenizer = new StringTokenizer(" "); 但是对于字符模式(非空白字符的数量),我完全没有任何线索。我相信我可以用一些粗糙的东西,比如: for(int i = 0; i < toExam
WORD
和CHARACTER
)下运行,它接受字符串并告诉我其中的字数(由1+空格分隔)或字符数(非空格字符):
我知道我可以使用StringTokenizer
完成WORD
模式版本:
StringTokenizer tokenizer = new StringTokenizer(" ");
但是对于字符
模式(非空白字符的数量),我完全没有任何线索。我相信我可以用一些粗糙的东西,比如:
for(int i = 0; i < toExamine.length; i++)
if(Character.isSpace(toExamine.charAt(i)))
count++;
for(int i=0;i
但这有点难看,可能不是最有效的方法(对于
StringTokenizer
片段也是如此)。这里是否可以使用正则表达式,或者其他一些Java字符串/字符疯狂,以超高效的方式满足我的需求?我在这里研究数千万根弦。提前感谢。转换为字符数组并使用for循环进行迭代
int charCount =0;
for(int i=0; i<sentence.length(); i++) {
if(!Character.isWhitespace(sentence.charAt(i))) {
charCount++;
}
}
这并不比
for
循环快,但如果需要使用正则表达式,可以尝试以下方法:
int noSpaces=toExamine.split("\\s+").length-1;
字符数为:
int noChar=toExamine.length-noSpaces;
下面的测试程序产生以下结果。程序将输出5组这样的结果,但我在这里只显示一组。带有
/
的行是我的注释,而不是程序的输出
// Percentage of non-space over space is approximately 0.857
// Length of the full string generated is 1 075 662
0.857 1075662
// Name_of_method (Result): 15_Runs_In_Microseconds | Average_In_Microseconds
countWords_1 (131489): 20465 20240 21045 20193 20000 19972 20551 39489 19859 19971 19889 19877 20049 19900 19949 | 21429
countWords_2 (131489): 255500 258723 254543 255956 253606 263549 254096 254402 254191 254296 253752 261501 260788 261574 254178 | 256710
countWords_3 (131489): 26225 25022 24830 24829 24545 24819 25459 24625 25628 24700 24936 24794 24794 24849 25026 | 25005
countWords_4 (131489): 24537 24169 25283 24862 23863 23902 24068 23906 51472 23731 23889 23844 23832 24275 23896 | 25968
countWords_5 (131489): 81087 112095 80008 81290 81472 80581 80717 80460 79870 80557 80694 80923 145686 80564 80849 | 87123
countWords_6 (131489): 114391 114146 111946 111873 112331 167207 134117 118217 112843 112804 113533 111834 112830 112392 118181 | 118576
countChars_1 (922546): 150507 109102 150453 111352 149753 108099 153842 109034 150817 117258 149219 108194 152839 110340 149524 | 132022
countChars_2 (922546): 28779 29473 52499 27182 26519 27743 26717 27161 26451 27060 26307 27309 26350 62824 33134 | 31700
countChars_3 (922546): 25408 25127 24980 24832 24624 24671 24848 24712 24634 24622 24607 24613 24661 24765 24883 | 24799
countChars_4 (922546): 81489 82246 80906 80718 80803 81147 81113 81798 81030 81024 108508 80768 80780 80671 80753 | 82916
countChars_5 (922546): 26086 25546 24846 43734 25016 25083 24894 25530 25031 25041 25114 24935 25358 24895 43498 | 27640
countChars_6 (922546): 102559 102257 101381 101589 103432 101739 102794 129472 101305 101834 103124 101486 101254 102874 101481 | 103905
为什么您认为StringTokenizer是这样的?我认为您的character方法已经足够好了,如果您不同时在太多数据上运行它的话。(不过,我不确定在大量数据的情况下,正则表达式是否会更快)。请注意,循环和
StringTokenizer
方法的作用并不完全相同。isSpace
方法已被弃用…@supersam654:indexOf
不是asStringTokenizer
,如果我们使用默认设置(空格、制表符等),那么它就不具有可扩展性。第二个选项将替换所有空白并创建一个句子,并将计算该句子中的所有字符长度。如果要计算字数,则您的noSpaces
将以“noSpaceString”
失败。我无法理解您试图用代码做什么。
int noChar=toExamine.length-noSpaces;
// Percentage of non-space over space is approximately 0.857
// Length of the full string generated is 1 075 662
0.857 1075662
// Name_of_method (Result): 15_Runs_In_Microseconds | Average_In_Microseconds
countWords_1 (131489): 20465 20240 21045 20193 20000 19972 20551 39489 19859 19971 19889 19877 20049 19900 19949 | 21429
countWords_2 (131489): 255500 258723 254543 255956 253606 263549 254096 254402 254191 254296 253752 261501 260788 261574 254178 | 256710
countWords_3 (131489): 26225 25022 24830 24829 24545 24819 25459 24625 25628 24700 24936 24794 24794 24849 25026 | 25005
countWords_4 (131489): 24537 24169 25283 24862 23863 23902 24068 23906 51472 23731 23889 23844 23832 24275 23896 | 25968
countWords_5 (131489): 81087 112095 80008 81290 81472 80581 80717 80460 79870 80557 80694 80923 145686 80564 80849 | 87123
countWords_6 (131489): 114391 114146 111946 111873 112331 167207 134117 118217 112843 112804 113533 111834 112830 112392 118181 | 118576
countChars_1 (922546): 150507 109102 150453 111352 149753 108099 153842 109034 150817 117258 149219 108194 152839 110340 149524 | 132022
countChars_2 (922546): 28779 29473 52499 27182 26519 27743 26717 27161 26451 27060 26307 27309 26350 62824 33134 | 31700
countChars_3 (922546): 25408 25127 24980 24832 24624 24671 24848 24712 24634 24622 24607 24613 24661 24765 24883 | 24799
countChars_4 (922546): 81489 82246 80906 80718 80803 81147 81113 81798 81030 81024 108508 80768 80780 80671 80753 | 82916
countChars_5 (922546): 26086 25546 24846 43734 25016 25083 24894 25530 25031 25041 25114 24935 25358 24895 43498 | 27640
countChars_6 (922546): 102559 102257 101381 101589 103432 101739 102794 129472 101305 101834 103124 101486 101254 102874 101481 | 103905
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.Random;
import java.util.StringTokenizer;
import java.lang.reflect.Method;
import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;
import java.lang.annotation.ElementType;
class TestStringProcessing_15028652 {
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.METHOD)
private @interface Test {};
// From 0.80 - 0.90 (4:1 to 9:1 non-space:space characters ratio)
private static final double NON_SPACE_RATIO = 0.85;
private static final double NON_SPACE_RATIO_FLUCTUATION = 0.05;
// With the way the test is written, it is not going to work well with small input (1000 is NOT enough)
// Currently set to 700 000 - 1 300 000 characters
private static final int NUM_CHARS = 1000000;
private static final int NUM_CHARS_FLUCTUATION = 300000;
// Some whitespace characters
private static final char WHITESPACES[] = {' ', '\t', '\r', '\n'};
// Number of times to run all methods
private static final int NUM_OUTER = 5;
// Number of times to run each method
private static final int NUM_REPEAT = 15;
static {
for (int i = 0; i < WHITESPACES.length; i++) {
assert(Character.isWhitespace(WHITESPACES[i]));
}
}
private static Random random = new Random();
private static String generateInput() {
double nonSpaceRatio = NON_SPACE_RATIO + random.nextDouble() * 2 * NON_SPACE_RATIO_FLUCTUATION - NON_SPACE_RATIO_FLUCTUATION;
int numChars = NUM_CHARS + random.nextInt(2 * NUM_CHARS_FLUCTUATION) - NUM_CHARS_FLUCTUATION;
System.out.printf("%.3f %d\n", nonSpaceRatio, numChars);
StringBuffer output = new StringBuffer();
for (int i = 0; i < numChars; i++) {
if (random.nextDouble() < nonSpaceRatio) {
output.append((char) (random.nextInt(64) + '0'));
} else {
output.append(WHITESPACES[random.nextInt(WHITESPACES.length)]);
}
}
return output.toString();
}
private static ArrayList<Method> getTestMethods() {
Class<?> klass = null;
try {
klass = Class.forName(Thread.currentThread().getStackTrace()[1].getClassName());
} catch (Exception e) {
e.printStackTrace();
System.err.println("Something really bad happened. Bailling out...");
System.exit(1);
}
Method[] methods = klass.getMethods();
// System.out.println(klass);
// System.out.println(Arrays.toString(methods));
ArrayList<Method> testMethods = new ArrayList<Method>();
for (Method method: methods) {
if (method.isAnnotationPresent(Test.class)) {
testMethods.add(method);
}
}
return testMethods;
}
public static void runTestReflection() {
ArrayList<Method> methods = getTestMethods();
for (int t = 0; t < NUM_OUTER; t++) {
String input = generateInput();
for (Method method: methods) {
try {
System.out.print(method.getName() + " (" + method.invoke(null, input) + "): ");
} catch (Exception e) {
e.printStackTrace();
}
long sum = 0;
for (int i = 0; i < NUM_REPEAT; i++) {
long start, end;
Object result;
try {
start = System.nanoTime();
result = method.invoke(null, input);
end = System.nanoTime();
System.out.print((end - start) / 1000 + " ");
sum += (end - start) / 1000;
} catch (Exception e) {
e.printStackTrace();
}
}
System.out.println("| " + sum / NUM_REPEAT);
}
System.out.println();
}
}
public static void main(String args[]) {
runTestReflection();
}
@Test
public static int countWords_1(String input) {
// WARNING: This is NOT the same as isWhitespace, since isWhitespace
// also consider Unicode characters.
return new StringTokenizer(input).countTokens();
}
@Test
public static int countWords_2(String input) {
return input.replaceAll("\\S+", "$0 ").length() - input.length();
}
@Test
public static int countWords_3(String input) {
int count = 0;
boolean in = false;
for (int i = 0; i < input.length(); i++) {
if (!Character.isWhitespace(input.charAt(i))) {
if (!in) {
in = true;
count++;
}
} else {
in = false;
}
}
return count;
}
@Test
public static int countWords_4(String input) {
int count = 0;
for (int i = 0; i < input.length(); i++) {
if (!Character.isWhitespace(input.charAt(i))) {
do {
i++;
} while (i < input.length() && !Character.isWhitespace(input.charAt(i)));
count++;
}
}
return count;
}
@Test
public static int countWords_5(String input) {
int count = 0;
Matcher m = p.matcher(input);
while (m.find()) {
count++;
}
return count;
}
@Test
public static int countWords_6(String input) {
return input.replaceAll("\\s*+\\S++\\s*+", " ").length();
}
@Test
public static int countChars_1(String input) {
return input.replaceAll("\\s+", "").length();
}
@Test
public static int countChars_2(String input) {
int count = 0;
for (char c: input.toCharArray()) {
if (!Character.isWhitespace(c)) {
count++;
}
}
return count;
}
@Test
public static int countChars_3(String input) {
int count = 0;
for (int i = 0; i < input.length(); i++) {
if (!Character.isWhitespace(input.charAt(i))) {
count++;
}
}
return count;
}
private static Pattern p = Pattern.compile("\\S+");
@Test
public static int countChars_4(String input) {
Matcher m = p.matcher(input);
int count = 0;
while (m.find()) {
count += m.end() - m.start();
}
return count;
}
@Test
public static int countChars_5(String input) {
int count = input.length();
for (int i = 0; i < input.length(); i++) {
if (Character.isWhitespace(input.charAt(i))) {
count--;
}
}
return count;
}
@Test
public static int countChars_6(String input) {
return input.length() - input.replaceAll("\\S+", "").length();
}
}