Java 多字节字符模式匹配
我正在读取Shift-JIS编码的XML文件并将其存储在ByteBuffer中,然后将其转换为字符串,并尝试通过Pattern&Matcher查找字符串的开头和结尾。从这两个位置,我尝试将缓冲区写入文件。它在没有多字节字符时工作。如果有一个多字节字符,我会在末尾漏掉一些文本,因为end的值几乎没有变化Java 多字节字符模式匹配,java,Java,我正在读取Shift-JIS编码的XML文件并将其存储在ByteBuffer中,然后将其转换为字符串,并尝试通过Pattern&Matcher查找字符串的开头和结尾。从这两个位置,我尝试将缓冲区写入文件。它在没有多字节字符时工作。如果有一个多字节字符,我会在末尾漏掉一些文本,因为end的值几乎没有变化 static final Pattern startPattern = Pattern.compile("<\\?xml "); static final Pattern endPatter
static final Pattern startPattern = Pattern.compile("<\\?xml ");
static final Pattern endPattern = Pattern.compile("</doc>\n");
public static void main(String[] args) throws Exception {
File f = new File("20121114000606JA.xml");
FileInputStream fis = new FileInputStream(f);
FileChannel fci = fis.getChannel();
ByteBuffer data_buffer = ByteBuffer.allocate(65536);
while (true) {
int read = fci.read(data_buffer);
if (read == -1)
break;
}
ByteBuffer cbytes = data_buffer.duplicate();
cbytes.flip();
Charset data_charset = Charset.forName("UTF-8");
String request = data_charset.decode(cbytes).toString();
Matcher start = startPattern.matcher(request);
if (start.find()) {
Matcher end = endPattern.matcher(request);
if (end.find()) {
int i0 = start.start();
int i1 = end.end();
String str = request.substring(i0, i1);
String filename = "test.xml";
FileChannel fc = new FileOutputStream(new File(filename), false).getChannel();
data_buffer.position(i0);
data_buffer.limit(i1 - i0);
long offset = fc.position();
long sz = fc.write(data_buffer);
fc.close();
}
}
System.out.println("OK");
}
static final Pattern startPattern=Pattern。编译(使用字符串索引i0和i1作为字节位置:
data_buffer.position(i0);
data_buffer.limit(i1 - i0);
是错误的。由于UTF-8没有给出唯一的编码,ĉ
被写成两个字符c
+组合变音符号^
,字符和字节之间的来回转换不仅代价高昂,而且容易出错(在特定数据的情况下)
或者使用CharBuffer,它实现CharSequence
不是写入FileChannel fc,而是:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(
new File(filename)), "UTF-8"));
try {
out.write(str);
} finally {
out.close();
}
CharBuffer版本需要更多的重写,还涉及到模式匹配。您这里的问题似乎是字节缓冲区的解码。您正在使用UTF-8字符集对Shift-JIS ByteBuffer进行解码。您需要将其更改为Shift-JIS字符集。以下是示例
虽然我没有要测试的Shift JIS文件,但您应该尝试将CharSet.forName行更改为:
Charset data_charset = Charset.forName("Shift_JIS");
此外,您的正则表达式逻辑有点不正确。我不会使用第二个匹配器,因为这会导致搜索重新开始,并可能导致范围反转。相反,请尝试获取当前匹配的位置,然后更改匹配器使用的模式:
Matcher matcher = startPattern.matcher(request);
if (matcher.find()) {
int i0 = matcher.start();
matcher.usePattern(endPattern);
if (matcher.find()) {
int i1 = matcher.end();
由于Shift JIS是一个简单的模式,它应该清晰地映射到Java UTF-8字符中“只需使用组来获取数据。要正确地转换此文件,您应该使用Java的XML API。虽然有几种方法可以做到这一点,但这里有一个使用javax.XML.transform包的解决方案。首先,我们确实需要文档中引用的djnml-1.0b.dtd文件(如果它包含实体引用)由于缺少此项,此解决方案使用从提供的输入生成的DTD,使用:
将此文件写入“djnml-1.0b.dtd”后,我们需要使用XSLT创建标识转换。您可以使用TransformerFactory上的newTransformer()方法进行此操作,但此转换的结果没有得到很好的指定。使用XSLT将产生更清晰的结果。我们将使用此文件作为标识转换:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" omit-xml-declaration="no"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
将上述XSLT文件另存为“identity.xsl”。现在我们有了DTD和identity转换,可以使用以下代码对文件进行转码:
import java.io.Closeable;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;
...
File inFile = new File("20121114000606JA.xml");
File outputFile = new File("test.xml");
final File dtdFile = new File("djnml-1.0b.dtd");
File identityFile = new File("identity.xsl");
final List<Closeable> closeables = new ArrayList<Closeable>();
try {
// We are going to use a SAXSource for input, so that we can specify the
// location of the DTD with an EntityResolver.
InputStream in = new FileInputStream(inFile);
closeables.add(in);
InputSource fileSource = new InputSource();
fileSource.setByteStream(in);
fileSource.setSystemId(inFile.toURI().toString());
SAXSource source = new SAXSource();
XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(new EntityResolver() {
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {
if (systemId != null && systemId.endsWith("/djnml-1.0b.dtd")) {
InputStream dtdIn = new FileInputStream(dtdFile);
closeables.add(dtdIn);
InputSource inputSource = new InputSource();
inputSource.setByteStream(dtdIn);
inputSource.setEncoding("UTF-8");
return inputSource;
}
return null;
}
});
source.setXMLReader(reader);
source.setInputSource(fileSource);
// Now we need to create a StreamResult.
OutputStream out = new FileOutputStream(outputFile);
closeables.add(out);
StreamResult result = new StreamResult();
result.setOutputStream(out);
result.setSystemId(outputFile);
// Create a templates object for the identity transform. If you are going
// to transform a lot of documents, you should do this once and
// reuse the Templates object.
InputStream identityIn = new FileInputStream(identityFile);
closeables.add(identityIn);
StreamSource identitySource = new StreamSource();
identitySource.setSystemId(identityFile);
identitySource.setInputStream(identityIn);
TransformerFactory factory = TransformerFactory.newInstance();
Templates templates = factory.newTemplates(identitySource);
// Finally we need to create the transformer and do the transformation.
Transformer transformer = templates.newTransformer();
transformer.transform(source, result);
} finally {
// Some older XML processors are bad at cleaning up input and output streams,
// so we will do this manually.
for (Closeable closeable : closeables) {
if (closeable != null) {
try {
closeable.close();
} catch (Exception e) {
}
}
}
}
import java.io.Closeable;
导入java.io.File;
导入java.io.FileInputStream;
导入java.io.FileNotFoundException;
导入java.io.FileOutputStream;
导入java.io.IOException;
导入java.io.InputStream;
导入java.io.OutputStream;
导入java.util.ArrayList;
导入java.util.List;
导入javax.xml.transform.Templates;
导入javax.xml.transform.Transformer;
导入javax.xml.transform.TransformerException;
导入javax.xml.transform.TransformerFactory;
导入javax.xml.transform.sax.SAXSource;
导入javax.xml.transform.stream.StreamResult;
导入javax.xml.transform.stream.StreamSource;
导入org.xml.sax.EntityResolver;
导入org.xml.sax.InputSource;
导入org.xml.sax.SAXException;
导入org.xml.sax.XMLReader;
导入org.xml.sax.helpers.XMLReaderFactory;
...
File infle=新文件(“201211140000606ja.xml”);
File outputFile=新文件(“test.xml”);
最终文件dtdFile=新文件(“djnml-1.0b.dtd”);
File identityFile=新文件(“identity.xsl”);
最终列表可关闭项=新的ArrayList();
试一试{
//我们将使用SAXSource进行输入,以便指定
//DTD与实体解决方案的位置。
InputStream in=新文件InputStream(infle);
closeables.add(in);
InputSource fileSource=新的InputSource();
setByteStream(在中);
fileSource.setSystemId(infle.toURI().toString());
SAXSource=新SAXSource();
XMLReader=XMLReaderFactory.createXMLReader();
reader.setEntityResolver(新的EntityResolver(){
public InputSource resolveEntity(字符串publicId、字符串systemId)
抛出SAXException,IOException{
if(systemId!=null&&systemId.endsWith(“/djnml-1.0b.dtd”)){
InputStream dtdIn=新文件InputStream(dtdFile);
可关闭。添加(dtdIn);
InputSource InputSource=新的InputSource();
inputSource.setByteStream(dtdIn);
inputSource.setEncoding(“UTF-8”);
返回输入源;
}
返回null;
}
});
source.setXMLReader(reader);
setInputSource(文件源);
//现在我们需要创建一个StreamResult。
OutputStream out=新文件OutputStream(outputFile);
可关闭。添加(输出);
StreamResult=新的StreamResult();
结果:setOutputStream(输出);
result.setSystemId(输出文件);
//为标识转换创建模板对象。如果要
//要转换大量文档,您应该一次性地执行此操作
//重用模板对象。
InputStream identityIn=新文件InputStream(identityFile);
closeables.add(identityIn);
StreamSource identitySource=新的StreamSource();
identitySource.setSystemId(identityFile);
identitySource.setInputStream(identityIn);
TransformerFactory=TransformerFactory.newInstance();
Templates Templates=factory.newTemplates(identitySource);
//最后,我们需要创建转换器并进行转换。
Transformer Transformer=templates.newTransformer();
变换(源、结果);
}最后{
//一些旧的XML处理器不善于清理输入和输出流,
//因此,我们将手动执行此操作。
用于(可关闭:可关闭){
如果(可关闭!=null){
试一试{
closeable.close();
}catc
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" omit-xml-declaration="no"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
import java.io.Closeable;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;
...
File inFile = new File("20121114000606JA.xml");
File outputFile = new File("test.xml");
final File dtdFile = new File("djnml-1.0b.dtd");
File identityFile = new File("identity.xsl");
final List<Closeable> closeables = new ArrayList<Closeable>();
try {
// We are going to use a SAXSource for input, so that we can specify the
// location of the DTD with an EntityResolver.
InputStream in = new FileInputStream(inFile);
closeables.add(in);
InputSource fileSource = new InputSource();
fileSource.setByteStream(in);
fileSource.setSystemId(inFile.toURI().toString());
SAXSource source = new SAXSource();
XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(new EntityResolver() {
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {
if (systemId != null && systemId.endsWith("/djnml-1.0b.dtd")) {
InputStream dtdIn = new FileInputStream(dtdFile);
closeables.add(dtdIn);
InputSource inputSource = new InputSource();
inputSource.setByteStream(dtdIn);
inputSource.setEncoding("UTF-8");
return inputSource;
}
return null;
}
});
source.setXMLReader(reader);
source.setInputSource(fileSource);
// Now we need to create a StreamResult.
OutputStream out = new FileOutputStream(outputFile);
closeables.add(out);
StreamResult result = new StreamResult();
result.setOutputStream(out);
result.setSystemId(outputFile);
// Create a templates object for the identity transform. If you are going
// to transform a lot of documents, you should do this once and
// reuse the Templates object.
InputStream identityIn = new FileInputStream(identityFile);
closeables.add(identityIn);
StreamSource identitySource = new StreamSource();
identitySource.setSystemId(identityFile);
identitySource.setInputStream(identityIn);
TransformerFactory factory = TransformerFactory.newInstance();
Templates templates = factory.newTemplates(identitySource);
// Finally we need to create the transformer and do the transformation.
Transformer transformer = templates.newTransformer();
transformer.transform(source, result);
} finally {
// Some older XML processors are bad at cleaning up input and output streams,
// so we will do this manually.
for (Closeable closeable : closeables) {
if (closeable != null) {
try {
closeable.close();
} catch (Exception e) {
}
}
}
}