Java 查找word文档中的标题/段落_Java_Apache Poi

Java 查找word文档中的标题/段落

java

Java 查找word文档中的标题/段落,java,apache-poi,Java,Apache Poi,我试图在word文档中发现段落/标题。我使用ApachePOI来实现这一点。我使用的一个例子是： fs = new POIFSFileSystem(new FileInputStream(filesname)); HWPFDocument doc = new HWPFDocument(fs); WordExtractor we = new WordExtractor(doc); ArrayList

我试图在word文档中发现段落/标题。
我使用ApachePOI来实现这一点。
我使用的一个例子是：

            fs = new POIFSFileSystem(new FileInputStream(filesname));
            HWPFDocument doc = new HWPFDocument(fs);
            WordExtractor we = new WordExtractor(doc);
            ArrayList titles = new ArrayList();

            try {
                for (int i = 0; i < we.getText().length() - 1; i++) {
                    int startIndex = i;
                    int endIndex = i + 1;
                    Range range = new Range(startIndex, endIndex, doc);
                    CharacterRun cr = range.getCharacterRun(0);

                    if (cr.isBold() || cr.isItalic() || cr.getUnderlineCode() != 0) {
                        while (cr.isBold() || cr.isItalic() || cr.getUnderlineCode() != 0) {
                            i++;
                            endIndex += 1;
                            range = new Range(endIndex, endIndex + 1, doc);
                            cr = range.getCharacterRun(0);
                        }
                        range = new Range(startIndex, endIndex - 1, doc);
                        titles.add(range.text());
                    }

                }
            }
            catch (IndexOutOfBoundsException iobe) {
                //sometimes this happens have to find out why.
            }`enter code here`

fs=new-poifsffilesystem（new-FileInputStream（filename））；
HWPF文件文件=新的HWPF文件（fs）；
WordExtractor we=新的WordExtractor（文档）；
ArrayList titles=新的ArrayList（）；
试一试{
for（int i=0；i


这适用于所有粗体、斜体或带下划线的文本。

但是我想要的是发现最常用的字体。然后发现与该字体样式相比的变化 有人有想法吗？
好吧，有些想法是尝试以下几点：

cr.getFontSize（）
可用于段落开头，查看范围是否更改字体大小。加上粗体、斜体或下划线将是一个很好的标识符
cr.getFontName（）
还可用于确定字体在给定范围内的更改时间和位置
cr.getColor（）
将是另一种帮助识别用户是否对字体使用不同颜色的方法

我想每次文本特征发生变化时，我都会迭代这个范围并创建多个CharacterRun
项。然后根据段落中的位置以及上述所有特征（尺寸、颜色、名称、粗体、斜体等）评估每个项目。也许可以根据最常见的值创建某种加权比例
创建Title
对象并存储每组特征的值，以帮助优化同一文档中以后字符运行时的搜索，这可能也很有价值。您可能想看看Tika的WordExtractor中的buildParagraphTagAndStyle方法：

对于HWPF（.doc），要将其命名为：
      StyleDescription style = 
         document.getStyleSheet().getStyleDescription(p.getStyleIndex());
      TagAndStyle tas = buildParagraphTagAndStyle(
            style.getName(), (parentTableLevel>0)
      );

      XWPFStyle style = styles.getStyle(paragraph.getStyleID());

      TagAndStyle tas = WordExtractor.buildParagraphTagAndStyle(
            style.getName(), paragraph.getPartType() == BodyType.TABLECELL
      );

对于XWPF（.docx），您需要执行以下操作：
      StyleDescription style = 
         document.getStyleSheet().getStyleDescription(p.getStyleIndex());
      TagAndStyle tas = buildParagraphTagAndStyle(
            style.getName(), (parentTableLevel>0)
      );

      XWPFStyle style = styles.getStyle(paragraph.getStyleID());

      TagAndStyle tas = WordExtractor.buildParagraphTagAndStyle(
            style.getName(), paragraph.getPartType() == BodyType.TABLECELL
      );

如果您将数据转换为段落来处理数据，会更容易
 WordExtractor we = new WordExtractor(doc);
 String[] para = we.getParagraphText();

然后按段落顺序进行操作。如果您的代码已经无法理解标题，那么您可以检查每个段落中是否有粗体和下划线
这些段落的作用如下：
for(int i=0;i<para.length;i++)
{
System.out.println("Length of paragraph "+(i+1)+": "+ para[i].length());
    System.out.println(para[i].toString());
}

for（int i=0；i