Java 使用外部DTD中的实体将大型XML从ISO-8859-1转换为UTF-8 我有： ISO-8859-1中的2.2Gb未压缩XML，从_Java_Xml

Java 使用外部DTD中的实体将大型XML从ISO-8859-1转换为UTF-8 我有： ISO-8859-1中的2.2Gb未压缩XML，从

java xml

Java 使用外部DTD中的实体将大型XML从ISO-8859-1转换为UTF-8 我有： ISO-8859-1中的2.2Gb未压缩XML，从,java,xml,Java,Xml,对应的DTD定义实体如下：无法将解析的XML放入RAM的计算机我想要将XML导入到ApacheSolr中，ApacheSolr已经建立并运行。 Solr/Java会（正确地）抱怨扩展实体太多，我可以通过为JVM设置-DentityExpansionLimit=2000000来提高扩展实体的数量，但我必须编辑导入程序以使用System:：setProperty提高限制我试过了 xmllint xmllint--stream--loaddtd--encode utf8--output

对应的DTD定义实体如下：

无法将解析的XML放入RAM的计算机

我想要将XML导入到ApacheSolr中，ApacheSolr已经建立并运行。 Solr/Java会（正确地）抱怨扩展实体太多，我可以通过为JVM设置

-DentityExpansionLimit=2000000

来提高扩展实体的数量，但我必须编辑导入程序以使用

System:：setProperty

提高限制

我试过了 xmllint

xmllint--stream--loaddtd--encode utf8--output dblp.utf8.xml dblp-2018-07-01.xml

如果没有

--stream

，进程将被内核终止，因为它试图将结构解析到内存中

使用

--stream

它不会写入输出文件，我怀疑它只是根据DTD验证XML

编辑XML，python 我不知道如何用python导入DTD并与解析器一起使用，因此我将实体放入

中，然后

import xml.etree.ElementTree
f=打开（'dblp-2018-07-01.xml'））
out=open（'dblp.utf8.xml'，'wb'）
xml.etree.ElementTree.parse（f）.write（out，encoding='UTF-8'）

这将消耗大约11Gib的内存，对我来说很有效，但是：

细节我想让其他人复制我正在做的事情，所以我想要的是：

无需手动编辑XML以插入实体
可以转换编码的脚本或编译程序
使用尽可能少的内存，尽量保持在6 GiB以下
另外一个好处是读取和写入gzip文件以节省空间，但这不是必需的

我更喜欢Java作为编程解决方案（这样我就可以将导入过程合并到Solr中），但我很乐意选择其他任何解决方案（我希望避免使用JavaScript）

如果您想自己处理XML，文件位于以下位置：

（使用最新的dtd）
（有关更多信息）
（对于许可证）

gzip文件的大小约为430MiB，扩展到2.2GiB的XML

谢谢大家!

我自己找到了一个解决方案，速度有点慢（~11-12分钟），但我没意见：

import javax.xml.stream.*;
import java.io.*;
import java.util.zip.*;

public class ConvertToUtf8 {

  public static void main(String[] args) {
    System.setProperty("entityExpansionLimit", "10000000");
    XMLInputFactory inputFactory = XMLInputFactory.newFactory();
    XMLOutputFactory outputFactory = XMLOutputFactory.newFactory();

    try (
        FileInputStream ifs = new FileInputStream("dblp-2018-08-01.xml.gz"); 
        GZIPInputStream gzIn = new GZIPInputStream(ifs);
        FileOutputStream ofs = new FileOutputStream("dblp_utf8.xml.gz");
        GZIPOutputStream gzOut = new GZIPOutputStream(ofs, true);
        ) 
    {
      XMLEventReader inEvt = inputFactory.createXMLEventReader(gzIn);
      XMLEventWriter outEvt = outputFactory.createXMLEventWriter(gzOut, "UTF-8");
      outEvt.add(inEvent);
    } catch (IOException | XMLStreamException e) {
      e.printStackTrace();
    }
  }
}

使用GZIP in/out将显著加快进程（在我的机器上快6倍），因为从磁盘读取将使系统的其余部分陷入瓶颈。如果要复制，请确保DTD位于工作目录中，否则不会替换实体。Java将在XML中插入一条注释，说明它无法找到DTD

基于@janbrohl的答案：

#! python3
import re
import gzip
from lxml import etree

# read the DTD with the lxml parser
dtd = etree.DTD('dblp-2017-08-29.dtd')
# build a dict with it for lookup
replacements = {x.name: x.content for x in dtd.entities()}

entity_re=re.compile('&(\w+);')

def resolve_entity(m):
    """
    This will replace the defined entities with their expansions from the DTD:
    '&Ouml;' will be replaced with '&#214;'.
    The entities that are already escaped with '&#[0-9]+;' should not be expanded,
    Ex: if some of the escapes produced the character '<', the XML would no longer be well formed.

    If the matched entity is not in the replacements, use the match as default
    """
    return replacements.get(m.group(1),f'&{m.group(1)};')

def expand_line(line):
    return entity_re.sub(resolve_entity,line)

def recode_file(src,dst):
    with gzip.open(src,mode='rt', encoding='ISO-8859-1', newline='\n') as src_file:
        # discard first line with wrong encoding
        print('discard: ' + src_file.readline())  
        with gzip.open(dst, mode='wt', encoding='UTF-8', newline='\n') as dst_file:
            # replace with correct encoding statement
            dst_file.write('<?xml version="1.0"?>\n')  
            for line in src_file:
                dst_file.write(expand_line(line))

recode_file('dblp-2018-08-01.xml.gz','dblp_recode.xml.gz')

#！蟒蛇3
进口稀土
导入gzip
从lxml导入etree
#使用lxml解析器读取DTD
dtd=etree.dtd（'dblp-2017-08-29.dtd'）
#用它构建一个dict进行查找
replacements={x.name:x.content for dtd.entities（）中的x
实体_re=re.compile（'&（\w+）；'）
def解析实体（m）：
"""
这将用DTD的扩展替换已定义的实体：
“Ö；”将替换为“&214；”。
已用“&#[0-9]+；”转义的实体不应展开，
例：如果一些转义在我的机器上产生了字符“3分钟”
#! python3
import re
import gzip
import html.entities

entities={k:v for k,v in html.entities.entitydefs.items() if v not in "&'\"<>"}

entity_re=re.compile("&([^;]+);")    

def resolve_entity(m):
    try:
        return entities[m.group(1)]
    except KeyError:
        return m.group(0)    

def expand_line(line):
    return entity_re.sub(resolve_entity,line)

def recode_file(src,dst):
    with gzip.open(src,mode="rt", encoding="ISO-8859-1", newline="\n") as src_file:
        with gzip.open(dst, mode="wt", encoding="UTF-8", newline="\n") as dst_file:
            first_line=src_file.readline()
            recoded_first_line=first_line.replace("ISO-8859-1","UTF-8")
            if first_line==recoded_first_line:
                raise ValueError("Source file seems to not be encoded in ISO-8859-1") 
            dst_file.write(recoded_first_line)
            for line in src_file:
                dst_file.write(expand_line(line))


recode_file("D:/Downloads/dblp-2018-08-01.xml.gz","D:/Downloads/dblp.xml.gz")

#！蟒蛇3
进口稀土
导入gzip
导入html.entities
entities={k:v代表k，如果v不在“&'\”}中，则在html.entities.entitydefs.items（）中为v
实体_re=re.compile（&（[^；]+）；）
def解析实体（m）：
尝试：
返回实体[m.group（1）]
除KeyError外：
返回m.group（0）
def展开_线（线）：
返回实体子项（解析实体，行）
def记录文件（src、dst）：
使用gzip.open（src，mode=“rt”，encoding=“ISO-8859-1”，newline=“\n”）作为src\u文件：
使用gzip.open（dst，mode=“wt”，encoding=“UTF-8”，newline=“\n”）作为dst\U文件：
first_line=src_file.readline（）
重新编码的第一行=第一行。替换（“ISO-8859-1”、“UTF-8”）
如果第一行==重新编码的第一行：
raise VALUERROR（“源文件似乎未在ISO-8859-1中编码”）
dst_file.write（第一行重新编码）
对于src_文件中的行：
dst_file.write（展开_行（行））
记录文件（“D:/Downloads/dblp-2018-08-01.xml.gz”，“D:/Downloads/dblp.xml.gz”）
您是否尝试过读取、解码到unicode、通过正则表达式替换实体、编码和写入？我曾想过逐行读取/写入，但我没有尝试。主要是因为我不确定我是否正确地进行了反编码。XML非常挑剔，我不想处理这个问题。我也不确定在这里使用正则表达式是否是“正确的事情”，我有一种预感，它不是：如果使用“换行符”行尾字符并将其原封不动地写回，那么害怕行是没有问题的，尽管这样行可能相当长。如果实体扩展本身实现正确，那么解析后的结果将保持不变。使用成熟的XML解析器是一个安全的选择。我已经提出了一些修改，所以DTD中定义的实体也将被替换。否则ö和朋友将继续。我还替换了第一行来声明正确的编码，XMLReaders正在强制执行声明的编码。我将尝试将生成的XML导入Solr，以查看是否存在任何差异。我会回来报告的。我以前完全误读了DTD。编号实体，如和#100