Java-从PDF文件中提取不重复的单词_Java_Pdfbox_Full Text Indexing

Java-从PDF文件中提取不重复的单词

java

Java-从PDF文件中提取不重复的单词,java,pdfbox,full-text-indexing,Java,Pdfbox,Full Text Indexing,我用Java编写了一个简单的程序，使用PDFBox从PDF文件中提取单词。它从PDF中读取文本并逐字提取 public class Main { public static void main(String[] args) throws Exception { try (PDDocument document = PDDocument.load(new File("C:\\my.pdf"))) { if (!document.isEncrypt

我用Java编写了一个简单的程序，使用PDFBox从PDF文件中提取单词。它从PDF中读取文本并逐字提取

public class Main {

    public static void main(String[] args) throws Exception {
        try (PDDocument document = PDDocument.load(new File("C:\\my.pdf"))) {

            if (!document.isEncrypted()) {

                PDFTextStripper tStripper = new PDFTextStripper();
                String pdfFileInText = tStripper.getText(document);
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }
        } catch (IOException e){
            System.err.println("Exception while trying to read pdf document - " + e);
        }
    }

}

有没有一种方法可以提取不重复的单词

按

空格将每行拆分

行。拆分（“”）

维护一个

HashSet

来保存这些单词，并不断向其中添加所有单词 HashSet的性质将忽略重复项

HashSet<String> uniqueWords = new HashSet<>();

for (String line : lines) {
  String[] words = line.split(" ");

  for (String word : words) {
    uniqueWords.add(word);
  }
}

HashSet uniqueWords=newhashset（）；
用于（字符串行：行）{
String[]words=line.split（“”）；
for（字符串字：字）{
添加（单词）；
}
}

按

空格将每行拆分

行。拆分（“”）

维护一个

HashSet

来保存这些单词，并不断向其中添加所有单词 HashSet的性质将忽略重复项

HashSet<String> uniqueWords = new HashSet<>();

for (String line : lines) {
  String[] words = line.split(" ");

  for (String word : words) {
    uniqueWords.add(word);
  }
}

HashSet uniqueWords=newhashset（）；
用于（字符串行：行）{
String[]words=line.split（“”）；
for（字符串字：字）{
添加（单词）；
}
}

如果您的目标是删除重复项，那么可以通过将数组添加到

java.util.Set

中来实现。所以现在，你只需要做以下几点：

Set<String> noDuplicates = new HashSet<>( Arrays.asList( lines ) );

Set noDuplicates=newhashset（Arrays.asList（line））；

不再有重复项。

如果您的目标是删除重复项，那么可以通过将数组添加到

java.util.Set

中来实现。所以现在，你只需要做以下几点：

HashSet<String> uniqueWords = new HashSet<>();

for (String line : lines) {
  String[] words = line.split(" ");

  for (String word : words) {
    uniqueWords.add(word);
  }
}

Set<String> noDuplicates = new HashSet<>( Arrays.asList( lines ) );

Set noDuplicates=newhashset（Arrays.asList（line））；

不再复制。

所以我需要创建一个？那么如何将单词提取到Hashset呢？当我尝试打印uniqueWords时，我仍然可以在每个键中看到重复的单词。在Hashset中存储之后，是否可以将这些“单词”存储在MYSQL之类的数据库中，以便进行全文索引？所以我需要创建一个？那么如何将单词提取到Hashset呢？当我尝试打印uniqueWords时，我仍然可以在每个键中看到重复的单词。在Hashset中存储之后，是否可以将这些“单词”存储到MYSQL之类的数据库中进行全文索引？通常，您可以使用一个集合来实现这一点，类似这样的东西：Set words=new Hashset（）；然后，您可以将每个单词添加到集合中。添加（word），它将忽略重复的单词，然后您可以再次遍历集合以获取所有非重复单词。@NoEm代码中会是什么样子？//保留所有非重复单词集uniqueWords=new HashSet（）；for（String line:lines）{String[]words=line.split（“”）；for（String word:words）{uniqueWords.add（word.trim（）；}}//打印所有不重复的单词System.out.println（“不重复的单词：”）；迭代器it=uniqueWords.Iterator（）；虽然（it.hasNext（））{System.out.println（it.next（））；}您可以将其作为标题中的答案发布，但您可以使用一个集合来实现这一点，如下所示：Set words=new HashSet（）；然后，您可以将每个单词添加到集合中。添加（word），它将忽略重复的单词，然后您可以再次遍历集合以获取所有非重复单词。@NoEm代码中会是什么样子？//保留所有非重复单词集uniqueWords=new HashSet（）；for（String line:lines）{String[]words=line.split（“”）；for（String word:words）{uniqueWords.add（word.trim（）；}}//打印所有不重复的单词System.out.println（“不重复的单词：”）；迭代器it=uniqueWords.Iterator（）；虽然（it.hasNext（））{System.out.println（it.next（））；}您可以将其作为答案发布，而不是我如何将这些单词存储在哈希表中到MySQL表中？这是另一个问题。我如何将这些单词存储在哈希表中到MySQL表中？这是另一个问题。

HashSet<String> uniqueWords = new HashSet<>();

for (String line : lines) {
  String[] words = line.split(" ");

  for (String word : words) {
    uniqueWords.add(word);
  }
}