C# 如何写一个“";“过滤器”;XML的流包装器?
我有一些大型XML提要文件,其中包含非法字符(0x1等)。这些文件是第三方的,我无法更改写入过程 我想使用C# 如何写一个“";“过滤器”;XML的流包装器?,c#,xml,filter,stream,wrapper,C#,Xml,Filter,Stream,Wrapper,我有一些大型XML提要文件,其中包含非法字符(0x1等)。这些文件是第三方的,我无法更改写入过程 我想使用XmlReader来处理这些文件,但它会在这些非法字符上爆炸 我可以读取文件,过滤掉坏字符,保存它们,然后处理它们。。。但这是大量的I/O,而且似乎不必要 我想做的是这样的: using(var origStream = File.OpenRead(fileName)) using(var cleanStream = new CleansedXmlStream(origStream)) us
XmlReader
来处理这些文件,但它会在这些非法字符上爆炸
我可以读取文件,过滤掉坏字符,保存它们,然后处理它们。。。但这是大量的I/O,而且似乎不必要
我想做的是这样的:
using(var origStream = File.OpenRead(fileName))
using(var cleanStream = new CleansedXmlStream(origStream))
using(var streamReader = new StreamReader(cleanStream))
using(var xmlReader = XmlReader.Create(streamReader))
{
//do stuff with reader
}
[TestMethod]
public void CleanTextReaderCleans()
{
//arrange
var originalString = "The quick brown fox jumped over the lazy dog.";
var badChars = new string(new[] {(char) 0x1});
var concatenated = string.Concat(badChars, originalString);
//act
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(concatenated)))
{
using (var reader = new CleanTextReader(stream))
{
var newString = reader.ReadToEnd().Trim();
//assert
Assert.IsTrue(originalString.Equals(newString));
}
}
}
using(var origStream = File.OpenRead(fileName))
using(var streamReader = new CleanTextReader(origStream))
using(var xmlReader = XmlReader.Create(streamReader))
{
//do stuff with reader
}
我尝试从流继承
,但当我开始实现读取(byte[]buffer,int offset,int count)
时,我失去了一些信心。毕竟,我正计划删除字符,因此计数似乎会被关闭,我必须将每个字节转换为一个char
,这似乎很昂贵(尤其是在大型文件上),我不清楚这将如何与Unicode编码一起工作,但我的问题的答案并不直观
在谷歌搜索“c#stream wrapper”或“c#filter stream”时,我没有得到令人满意的结果。有可能我用了错误的词或描述了错误的概念,所以我希望so社区能让我明白过来
使用上面的示例,cleanedxmlstream
看起来像什么
以下是我的第一次尝试:
public class CleansedXmlStream : Stream
{
private readonly Stream _baseStream;
public CleansedXmlStream(Stream stream)
{
this._baseStream = stream;
}
public new void Dispose()
{
if (this._baseStream != null)
{
this._baseStream.Dispose();
}
base.Dispose();
}
public override bool CanRead
{
get { return this._baseStream.CanRead; }
}
public override bool CanSeek
{
get { return this._baseStream.CanSeek; }
}
public override bool CanWrite
{
get { return this._baseStream.CanWrite; }
}
public override long Length
{
get { return this._baseStream.Length; }
}
public override long Position
{
get { return this._baseStream.Position; }
set { this._baseStream.Position = value; }
}
public override void Flush()
{
this._baseStream.Flush();
}
public override int Read(byte[] buffer, int offset, int count)
{
//what does this look like?
throw new NotImplementedException();
}
public override long Seek(long offset, SeekOrigin origin)
{
return this._baseStream.Seek(offset, origin);
}
public override void SetLength(long value)
{
this._baseStream.SetLength(value);
}
public override void Write(byte[] buffer, int offset, int count)
{
throw new NotSupportedException();
}
}
受到@CharlesMager评论的启发,我最终没有制作一个
流
,而是制作了一个流阅读器
,就像这样:
public class CleanTextReader : StreamReader
{
private readonly ILog _logger;
public CleanTextReader(Stream stream, ILog logger) : base(stream)
{
this._logger = logger;
}
public CleanTextReader(Stream stream) : this(stream, LogManager.GetLogger<CleanTextReader>())
{
//nothing to do here.
}
/// <summary>
/// Reads a specified maximum of characters from the current stream into a buffer, beginning at the specified index.
/// </summary>
/// <returns>
/// The number of characters that have been read, or 0 if at the end of the stream and no data was read. The number
/// will be less than or equal to the <paramref name="count" /> parameter, depending on whether the data is available
/// within the stream.
/// </returns>
/// <param name="buffer">
/// When this method returns, contains the specified character array with the values between
/// <paramref name="index" /> and (<paramref name="index + count - 1" />) replaced by the characters read from the
/// current source.
/// </param>
/// <param name="index">The index of <paramref name="buffer" /> at which to begin writing. </param>
/// <param name="count">The maximum number of characters to read. </param>
/// <exception cref="T:System.ArgumentException">
/// The buffer length minus <paramref name="index" /> is less than
/// <paramref name="count" />.
/// </exception>
/// <exception cref="T:System.ArgumentNullException"><paramref name="buffer" /> is null. </exception>
/// <exception cref="T:System.ArgumentOutOfRangeException">
/// <paramref name="index" /> or <paramref name="count" /> is
/// negative.
/// </exception>
/// <exception cref="T:System.IO.IOException">An I/O error occurs, such as the stream is closed. </exception>
public override int Read(char[] buffer, int index, int count)
{
try
{
var rVal = base.Read(buffer, index, count);
var filteredBuffer = buffer.Select(x => XmlConvert.IsXmlChar(x) ? x : ' ').ToArray();
Buffer.BlockCopy(filteredBuffer, 0, buffer, 0, count);
return rVal;
}
catch (Exception ex)
{
this._logger.Error("Read(char[], int, int)", ex);
throw;
}
}
/// <summary>
/// Reads a maximum of <paramref name="count" /> characters from the current stream, and writes the data to
/// <paramref name="buffer" />, beginning at <paramref name="index" />.
/// </summary>
/// <returns>
/// The position of the underlying stream is advanced by the number of characters that were read into
/// <paramref name="buffer" />.The number of characters that have been read. The number will be less than or equal to
/// <paramref name="count" />, depending on whether all input characters have been read.
/// </returns>
/// <param name="buffer">
/// When this method returns, this parameter contains the specified character array with the values
/// between <paramref name="index" /> and (<paramref name="index" /> + <paramref name="count" /> -1) replaced by the
/// characters read from the current source.
/// </param>
/// <param name="index">The position in <paramref name="buffer" /> at which to begin writing.</param>
/// <param name="count">The maximum number of characters to read. </param>
/// <exception cref="T:System.ArgumentNullException"><paramref name="buffer" /> is null. </exception>
/// <exception cref="T:System.ArgumentException">
/// The buffer length minus <paramref name="index" /> is less than
/// <paramref name="count" />.
/// </exception>
/// <exception cref="T:System.ArgumentOutOfRangeException">
/// <paramref name="index" /> or <paramref name="count" /> is
/// negative.
/// </exception>
/// <exception cref="T:System.ObjectDisposedException">The <see cref="T:System.IO.TextReader" /> is closed. </exception>
/// <exception cref="T:System.IO.IOException">An I/O error occurs. </exception>
public override int ReadBlock(char[] buffer, int index, int count)
{
try
{
var rVal = base.ReadBlock(buffer, index, count);
var filteredBuffer = buffer.Select(x => XmlConvert.IsXmlChar(x) ? x : ' ').ToArray();
Buffer.BlockCopy(filteredBuffer, 0, buffer, 0, count);
return rVal;
}
catch (Exception ex)
{
this._logger.Error("ReadBlock(char[], in, int)", ex);
throw;
}
}
/// <summary>
/// Reads the stream from the current position to the end of the stream.
/// </summary>
/// <returns>
/// The rest of the stream as a string, from the current position to the end. If the current position is at the end of
/// the stream, returns an empty string ("").
/// </returns>
/// <exception cref="T:System.OutOfMemoryException">
/// There is insufficient memory to allocate a buffer for the returned
/// string.
/// </exception>
/// <exception cref="T:System.IO.IOException">An I/O error occurs. </exception>
public override string ReadToEnd()
{
var chars = new char[4096];
int len;
var sb = new StringBuilder(4096);
while ((len = Read(chars, 0, chars.Length)) != 0)
{
sb.Append(chars, 0, len);
}
return sb.ToString();
}
}
。。。用法如下所示:
using(var origStream = File.OpenRead(fileName))
using(var cleanStream = new CleansedXmlStream(origStream))
using(var streamReader = new StreamReader(cleanStream))
using(var xmlReader = XmlReader.Create(streamReader))
{
//do stuff with reader
}
[TestMethod]
public void CleanTextReaderCleans()
{
//arrange
var originalString = "The quick brown fox jumped over the lazy dog.";
var badChars = new string(new[] {(char) 0x1});
var concatenated = string.Concat(badChars, originalString);
//act
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(concatenated)))
{
using (var reader = new CleanTextReader(stream))
{
var newString = reader.ReadToEnd().Trim();
//assert
Assert.IsTrue(originalString.Equals(newString));
}
}
}
using(var origStream = File.OpenRead(fileName))
using(var streamReader = new CleanTextReader(origStream))
using(var xmlReader = XmlReader.Create(streamReader))
{
//do stuff with reader
}
如果有人提出改进建议,我很乐意听取。我尝试了@JeremyHolovacs流实现,但仍然不足以满足我的用例:
使用(var fstream=File.OpenRead(dlpath))
{
使用(var zstream=new GZipStream(fstream,CompressionMode.Decompress))
{
使用(var xstream=newcleanTextReader(zstream))
{
var ser=新的XmlSerializer(typeof(MyType));
prods=ser.Deserialize(XmlReader.Create(xstream,newxmlreadersettings(){CheckCharacters=false}))作为MyType;
}
}
}
不知何故,并非所有相关的重载都已实现。
我对课程进行了如下调整,效果如预期:
公共类CleanTextReader:StreamReader
{
公共CleanTextReader(流):基本(流)
{
}
公共覆盖int Read()
{
var val=base.Read();
返回XmlConvert.IsXmlChar((char)val)?val:(char)';
}
公共重写整型读取(字符[]缓冲区,整型索引,整型计数)
{
var ret=base.Read(缓冲区、索引、计数);
for(int i=0;i0x01是SOH。流类默认为ASCII编码。我会将您的流类设置为UTF8。请尝试以下操作:StreamReader Stream=new StreamReader(filename,encoding.UTF8);@jdweng per,new StreamReader(Stream)
默认为UTF8,因此这没有什么区别。也许您需要在更高的抽象级别上工作。流
是二进制数据,而无效字符是解码该二进制数据的结果。也许您需要一个装饰性文本阅读器
而不是装饰性流
?@CharlesMager perha我会调查这个想法。