用Java解析文本文件以获取字段的哈希映射_Java

用Java解析文本文件以获取字段的哈希映射

java

用Java解析文本文件以获取字段的哈希映射,java,Java,我试图解析多个文件，并将它们拆分为HashMap中的一组字段。这是一个样本文件 COCONUT OIL CONTRACT TO CHANGE - DUTCH TRADERS ROTTERDAM, March 18 - Contract terms for trade in coconut oil are to be changed from long tons to tonnes with effect from the Aug/Sep contract onwards, Dutch

我试图解析多个文件，并将它们拆分为HashMap中的一组字段。这是一个样本文件

COCONUT OIL CONTRACT TO CHANGE - DUTCH TRADERS

    ROTTERDAM, March 18 - Contract terms for trade in coconut
oil are to be changed from long tons to tonnes with effect from
the Aug/Sep contract onwards, Dutch vegetable oil traders said.
    Operators have already started to take account of the
expected change and reported at least one trade in tonnes for
Aug/Sept shipment yesterday.

我需要该程序将该文档解析为自定义文档类中的字段，该类具有键、文件名、文件标题、位置、日期、作者、内容和类别

这就是我试图做的

public static Document parse(String filename) {

        File f = new File(filename);

        if (f.isFile()){



            String fileId;
            if (filename.indexOf(".") > 0) {
                fileId = filename.substring(0, filename.lastIndexOf("."));
            }
            String category = f.getParent();

            InputStream in = new FileInputStream(f);

            byte buf[] = new byte[1024];
            int len = in.read(buf);
            while(len > 0){
               ..........
            }
            in.close();
        }


        return null;
    }

以下代码可能会帮助您：

try {
        FileInputStream fstream = new FileInputStream("myFile.txt");
        DataInputStream in = new DataInputStream(fstream);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
        StringBuffer contentBuffer = new StringBuffer();
        String line = null;
        boolean foundTitle = false;
        boolean foundPlaceAndDate = false;
        String date = "";
        while ((line = br.readLine()) != null) {
            if (line.matches("^[a-z-A-Z0-9].*") && !foundTitle) {
                // If line starts with a letter or number and has no title yet, that's the title
                System.out.println("Title: " + line);
                foundTitle = true;
            } else if (line.matches("^[\\ \t].*") && !foundPlaceAndDate) {
                // If line starts with a space or tab and it's out first paragraph, then this paragraph has place and date
                System.out.println("Place: " + line.trim().substring(0, line.trim().indexOf(",")));
                date = line.trim().substring(line.trim().indexOf(",") + 1, line.trim().indexOf("-")).trim();
                System.out.println("Date: " + date);
                foundPlaceAndDate = true;
            }
            contentBuffer.append(line);
        }

        String content = contentBuffer.toString().substring(contentBuffer.toString().indexOf(date) + date.length() + 2).trim();
        System.out.println("Content: " + content);

        br.close();
        fstream.close();
    } catch (Exception e) {
        System.err.println("Oh no! I got the following error: " + e.getMessage());
    }

输出将是：

标题：椰子油合同将发生变化-荷兰贸易商

地点：鹿特丹

日期：3月18日

内容：荷兰植物油交易商表示，从8月/9月合同开始，椰油贸易的合同条款将从长吨改为吨。运营商已经开始考虑预期的变化，并在昨天报告了UG/9月装运的至少一个以吨为单位的贸易。

很抱歉，您试图在这里实现什么欧威尔，这是一个开始，但很难以同样的方式继续下去。如果我是你，我现在就停止编写代码，首先要弄清楚需要采取哪些高级步骤。把步骤写在一张纸上<代码>1。将文件完全读入字符串。2.提取文件标题…等等。然后您可以开始一步一步地对其进行编码，在每一步之后测试结果。这确实让我开始了，但我需要将该文件解析为一个类似以下内容的文档类。公共类document{private HashMap map；public document（）{map=new HashMap（）；}public void setField（FieldNames fn，String…o）{map.put（fn，o）；}public String[]getField（FieldNames fn）{return map.get（fn）；}}}现在需要做的就是填充文档类的字段。例如：

Document Document=new Document（）；Document.setField（“title”，title）；