Java 从文本文件中提取单词_Java_Text

Java 从文本文件中提取单词

java text

Java 从文本文件中提取单词,java,text,Java,Text,假设您有这样一个文本文件：有人有好的算法或开源代码从文本文件中提取单词吗？如何获取所有单词，同时避免使用特殊字符，并保留“it’s”等内容我在Java工作。感谢您，您可以使用您创建的模式尝试regex，并运行一次计数，计算找到该模式的次数。伪代码如下所示： create words, a list of words, by splitting the input by whitespace for every word, strip out whitespace and punctua

假设您有这样一个文本文件：

有人有好的算法或开源代码从文本文件中提取单词吗？如何获取所有单词，同时避免使用特殊字符，并保留“it’s”等内容

我在Java工作。

感谢您，您可以使用您创建的模式尝试regex，并运行一次计数，计算找到该模式的次数。

伪代码如下所示：

create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right

words = input.split()
words = [word.strip(PUNCTUATION) for word in words]

python代码如下所示：

create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right

words = input.split()
words = [word.strip(PUNCTUATION) for word in words]

在哪里

或任何其他要删除的字符

我相信Java在String类中有等价的函数：.split（）

在链接中提供的文本上运行此代码的输出：

>>> print words[:100]
['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis', 
'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for', 
'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 
'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 
'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 
... etc etc.

基本上，你想要匹配

（[A-Za-z]）+（'（[A-Za-z]）*）

对吗？

这听起来像是正则表达式的正确工作。以下是一些Java代码，如果您不知道如何开始，可以给您一个想法：

String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);

while ( m.find() ) {
    System.out.println(input.substring(m.start(), m.end()));
}

模式

[\w']+

多次匹配所有单词字符和撇号。示例字符串将逐字打印。请查看以了解更多信息。

以下是解决问题的好方法：此函数接收文本作为输入，并返回给定文本中所有单词的数组

private ArrayList<String> get_Words(String SInput){

    StringBuilder stringBuffer = new StringBuilder(SInput);
    ArrayList<String> all_Words_List = new ArrayList<String>();

    String SWord = "";
    for(int i=0; i<stringBuffer.length(); i++){
        Character charAt = stringBuffer.charAt(i);
        if(Character.isAlphabetic(charAt) || Character.isDigit(charAt)){
            SWord = SWord + charAt;
        }
        else{
            if(!SWord.isEmpty()) all_Words_List.add(new String(SWord));
            SWord = "";
        }

    }

    return all_Words_List;

}

private ArrayList get_单词（字符串输入）{
StringBuilder stringBuffer=新的StringBuilder（SInput）；
ArrayList all_Words_List=新建ArrayList（）；
弦剑=”；
对于（int i=0；与正则表达式相比，这段代码的优点在于它可以简单地在一次传递中完成。是的，Java有一个“split”方法，但它没有与“strip”方法等效的方法。我必须稍微更改regexp，使其不包含数字、下划线和以引号开头的单词，但除此之外，很好！我必须esc模仿这样的模式：Pattern.compile（“[\\w']+”）；
这有点离题，但如何排除单词开头或结尾的单引号？@ScrollerBlaster您可以使用单词边界来实现这一点。Pattern.compile（\\b[\\w']+\\b”）；
Pattern.compile（\\w[\\w-]+（'\\w+））
将支持连字符甚至多个连字符的单词（sous-vide
，mise en scène
）以及撇号，但不支持在单词的开头，后面必须有更多的单词字母（I've
，sous-vide'n
）（即包括复数所有格），在这种情况下使用Pattern.compile（\\w[\\w-]+（'\\w*）？）
。