Java 多字节字符模式匹配

Java 多字节字符模式匹配,java,Java,我正在读取Shift-JIS编码的XML文件并将其存储在ByteBuffer中,然后将其转换为字符串,并尝试通过Pattern&Matcher查找字符串的开头和结尾。从这两个位置,我尝试将缓冲区写入文件。它在没有多字节字符时工作。如果有一个多字节字符,我会在末尾漏掉一些文本,因为end的值几乎没有变化 static final Pattern startPattern = Pattern.compile("<\\?xml "); static final Pattern endPatter

我正在读取Shift-JIS编码的XML文件并将其存储在ByteBuffer中,然后将其转换为字符串,并尝试通过Pattern&Matcher查找字符串的开头和结尾。从这两个位置,我尝试将缓冲区写入文件。它在没有多字节字符时工作。如果有一个多字节字符,我会在末尾漏掉一些文本,因为end的值几乎没有变化

static final Pattern startPattern = Pattern.compile("<\\?xml ");
static final Pattern endPattern = Pattern.compile("</doc>\n");

 public static void main(String[] args) throws Exception {
    File f = new File("20121114000606JA.xml");
    FileInputStream fis = new FileInputStream(f);
    FileChannel fci = fis.getChannel();
    ByteBuffer data_buffer = ByteBuffer.allocate(65536);
    while (true) {
      int read = fci.read(data_buffer);
      if (read == -1)
        break;
    }

    ByteBuffer cbytes = data_buffer.duplicate();
    cbytes.flip();
    Charset data_charset = Charset.forName("UTF-8");
    String request = data_charset.decode(cbytes).toString();

    Matcher start = startPattern.matcher(request);
    if (start.find()) {
      Matcher end = endPattern.matcher(request);

      if (end.find()) {

        int i0 = start.start();
        int i1 = end.end();

        String str = request.substring(i0, i1);

        String filename = "test.xml";
        FileChannel fc = new FileOutputStream(new File(filename), false).getChannel();

        data_buffer.position(i0);
        data_buffer.limit(i1 - i0);

        long offset = fc.position();
        long sz = fc.write(data_buffer);

        fc.close();
      }
    }
    System.out.println("OK");
  }

static final Pattern startPattern=Pattern。编译(使用字符串索引i0和i1作为字节位置

data_buffer.position(i0);
data_buffer.limit(i1 - i0);
是错误的。由于UTF-8没有给出唯一的编码,
ĉ
被写成两个字符
c
+组合变音符号
^
,字符和字节之间的来回转换不仅代价高昂,而且容易出错(在特定数据的情况下)

或者使用CharBuffer,它实现CharSequence


不是写入FileChannel fc,而是:

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(
        new File(filename)), "UTF-8"));
try {
    out.write(str);
} finally {
    out.close();
}

CharBuffer版本需要更多的重写,还涉及到模式匹配。

您这里的问题似乎是字节缓冲区的解码。您正在使用UTF-8字符集对Shift-JIS ByteBuffer进行解码。您需要将其更改为Shift-JIS字符集。以下是示例

虽然我没有要测试的Shift JIS文件,但您应该尝试将CharSet.forName行更改为:

Charset data_charset = Charset.forName("Shift_JIS");
此外,您的正则表达式逻辑有点不正确。我不会使用第二个匹配器,因为这会导致搜索重新开始,并可能导致范围反转。相反,请尝试获取当前匹配的位置,然后更改匹配器使用的模式:

Matcher matcher = startPattern.matcher(request);
if (matcher.find()) {
  int i0 = matcher.start();
  matcher.usePattern(endPattern);

  if (matcher.find()) {

    int i1 = matcher.end();

由于Shift JIS是一个简单的模式,它应该清晰地映射到Java UTF-8字符中“只需使用组来获取数据。

要正确地转换此文件,您应该使用Java的XML API。虽然有几种方法可以做到这一点,但这里有一个使用javax.XML.transform包的解决方案。首先,我们确实需要文档中引用的djnml-1.0b.dtd文件(如果它包含实体引用)由于缺少此项,此解决方案使用从提供的输入生成的DTD,使用:


将此文件写入“djnml-1.0b.dtd”后,我们需要使用XSLT创建标识转换。您可以使用TransformerFactory上的newTransformer()方法进行此操作,但此转换的结果没有得到很好的指定。使用XSLT将产生更清晰的结果。我们将使用此文件作为标识转换:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" omit-xml-declaration="no"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

将上述XSLT文件另存为“identity.xsl”。现在我们有了DTD和identity转换,可以使用以下代码对文件进行转码:

import java.io.Closeable;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;

import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;

...

File inFile = new File("20121114000606JA.xml");
File outputFile = new File("test.xml");
final File dtdFile = new File("djnml-1.0b.dtd");
File identityFile = new File("identity.xsl");

final List<Closeable> closeables = new ArrayList<Closeable>();
try {
  // We are going to use a SAXSource for input, so that we can specify the
  // location of the DTD with an EntityResolver.
  InputStream in = new FileInputStream(inFile);
  closeables.add(in);
  InputSource fileSource = new InputSource();
  fileSource.setByteStream(in);
  fileSource.setSystemId(inFile.toURI().toString());

  SAXSource source = new SAXSource();
  XMLReader reader = XMLReaderFactory.createXMLReader();
  reader.setEntityResolver(new EntityResolver() {
    public InputSource resolveEntity(String publicId, String systemId)
        throws SAXException, IOException {
      if (systemId != null && systemId.endsWith("/djnml-1.0b.dtd")) {
        InputStream dtdIn = new FileInputStream(dtdFile);
        closeables.add(dtdIn);

        InputSource inputSource = new InputSource();
        inputSource.setByteStream(dtdIn);
        inputSource.setEncoding("UTF-8");

        return inputSource;
      }
      return null;
    }
  });

  source.setXMLReader(reader);
  source.setInputSource(fileSource);

  // Now we need to create a StreamResult.
  OutputStream out = new FileOutputStream(outputFile);
  closeables.add(out);
  StreamResult result = new StreamResult();
  result.setOutputStream(out);
  result.setSystemId(outputFile);

  // Create a templates object for the identity transform.  If you are going
  // to transform a lot of documents, you should do this once and
  // reuse the Templates object.
  InputStream identityIn = new FileInputStream(identityFile);
  closeables.add(identityIn);
  StreamSource identitySource = new StreamSource();
  identitySource.setSystemId(identityFile);
  identitySource.setInputStream(identityIn);
  TransformerFactory factory = TransformerFactory.newInstance();
  Templates templates = factory.newTemplates(identitySource);

  // Finally we need to create the transformer and do the transformation.
  Transformer transformer = templates.newTransformer();
  transformer.transform(source, result);

} finally {
  // Some older XML processors are bad at cleaning up input and output streams,
  // so we will do this manually.
  for (Closeable closeable : closeables) {
    if (closeable != null) {
      try {
        closeable.close();
      } catch (Exception e) {
      }
    }
  }
}
import java.io.Closeable;
导入java.io.File;
导入java.io.FileInputStream;
导入java.io.FileNotFoundException;
导入java.io.FileOutputStream;
导入java.io.IOException;
导入java.io.InputStream;
导入java.io.OutputStream;
导入java.util.ArrayList;
导入java.util.List;
导入javax.xml.transform.Templates;
导入javax.xml.transform.Transformer;
导入javax.xml.transform.TransformerException;
导入javax.xml.transform.TransformerFactory;
导入javax.xml.transform.sax.SAXSource;
导入javax.xml.transform.stream.StreamResult;
导入javax.xml.transform.stream.StreamSource;
导入org.xml.sax.EntityResolver;
导入org.xml.sax.InputSource;
导入org.xml.sax.SAXException;
导入org.xml.sax.XMLReader;
导入org.xml.sax.helpers.XMLReaderFactory;
...
File infle=新文件(“201211140000606ja.xml”);
File outputFile=新文件(“test.xml”);
最终文件dtdFile=新文件(“djnml-1.0b.dtd”);
File identityFile=新文件(“identity.xsl”);
最终列表可关闭项=新的ArrayList();
试一试{
//我们将使用SAXSource进行输入,以便指定
//DTD与实体解决方案的位置。
InputStream in=新文件InputStream(infle);
closeables.add(in);
InputSource fileSource=新的InputSource();
setByteStream(在中);
fileSource.setSystemId(infle.toURI().toString());
SAXSource=新SAXSource();
XMLReader=XMLReaderFactory.createXMLReader();
reader.setEntityResolver(新的EntityResolver(){
public InputSource resolveEntity(字符串publicId、字符串systemId)
抛出SAXException,IOException{
if(systemId!=null&&systemId.endsWith(“/djnml-1.0b.dtd”)){
InputStream dtdIn=新文件InputStream(dtdFile);
可关闭。添加(dtdIn);
InputSource InputSource=新的InputSource();
inputSource.setByteStream(dtdIn);
inputSource.setEncoding(“UTF-8”);
返回输入源;
}
返回null;
}
});
source.setXMLReader(reader);
setInputSource(文件源);
//现在我们需要创建一个StreamResult。
OutputStream out=新文件OutputStream(outputFile);
可关闭。添加(输出);
StreamResult=新的StreamResult();
结果:setOutputStream(输出);
result.setSystemId(输出文件);
//为标识转换创建模板对象。如果要
//要转换大量文档,您应该一次性地执行此操作
//重用模板对象。
InputStream identityIn=新文件InputStream(identityFile);
closeables.add(identityIn);
StreamSource identitySource=新的StreamSource();
identitySource.setSystemId(identityFile);
identitySource.setInputStream(identityIn);
TransformerFactory=TransformerFactory.newInstance();
Templates Templates=factory.newTemplates(identitySource);
//最后,我们需要创建转换器并进行转换。
Transformer Transformer=templates.newTransformer();
变换(源、结果);
}最后{
//一些旧的XML处理器不善于清理输入和输出流,
//因此,我们将手动执行此操作。
用于(可关闭:可关闭){
如果(可关闭!=null){
试一试{
closeable.close();
}catc
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" omit-xml-declaration="no"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>
import java.io.Closeable;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;

import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;

...

File inFile = new File("20121114000606JA.xml");
File outputFile = new File("test.xml");
final File dtdFile = new File("djnml-1.0b.dtd");
File identityFile = new File("identity.xsl");

final List<Closeable> closeables = new ArrayList<Closeable>();
try {
  // We are going to use a SAXSource for input, so that we can specify the
  // location of the DTD with an EntityResolver.
  InputStream in = new FileInputStream(inFile);
  closeables.add(in);
  InputSource fileSource = new InputSource();
  fileSource.setByteStream(in);
  fileSource.setSystemId(inFile.toURI().toString());

  SAXSource source = new SAXSource();
  XMLReader reader = XMLReaderFactory.createXMLReader();
  reader.setEntityResolver(new EntityResolver() {
    public InputSource resolveEntity(String publicId, String systemId)
        throws SAXException, IOException {
      if (systemId != null && systemId.endsWith("/djnml-1.0b.dtd")) {
        InputStream dtdIn = new FileInputStream(dtdFile);
        closeables.add(dtdIn);

        InputSource inputSource = new InputSource();
        inputSource.setByteStream(dtdIn);
        inputSource.setEncoding("UTF-8");

        return inputSource;
      }
      return null;
    }
  });

  source.setXMLReader(reader);
  source.setInputSource(fileSource);

  // Now we need to create a StreamResult.
  OutputStream out = new FileOutputStream(outputFile);
  closeables.add(out);
  StreamResult result = new StreamResult();
  result.setOutputStream(out);
  result.setSystemId(outputFile);

  // Create a templates object for the identity transform.  If you are going
  // to transform a lot of documents, you should do this once and
  // reuse the Templates object.
  InputStream identityIn = new FileInputStream(identityFile);
  closeables.add(identityIn);
  StreamSource identitySource = new StreamSource();
  identitySource.setSystemId(identityFile);
  identitySource.setInputStream(identityIn);
  TransformerFactory factory = TransformerFactory.newInstance();
  Templates templates = factory.newTemplates(identitySource);

  // Finally we need to create the transformer and do the transformation.
  Transformer transformer = templates.newTransformer();
  transformer.transform(source, result);

} finally {
  // Some older XML processors are bad at cleaning up input and output streams,
  // so we will do this manually.
  for (Closeable closeable : closeables) {
    if (closeable != null) {
      try {
        closeable.close();
      } catch (Exception e) {
      }
    }
  }
}