C# 分批从StreamReader读取
在尝试通过StreamReader将800MB文本文件加载到数据表时,我遇到了OutOfMemory异常。我想知道是否有一种方法可以从内存流中批量加载DataTable,即从StreamReader读取文本文件的前10000行,创建DataTable,对DataTable执行一些操作,然后将接下来的10000行加载到StreamReader中,依此类推 我的谷歌在这里并没有什么帮助,但似乎应该有一个简单的方法来做到这一点。最终,我将使用SqlBulkCopy将数据表写入MS SQL db,因此,如果有比我所描述的更简单的方法,我将感谢有一个指向正确方向的快速指针 编辑-以下是我正在运行的代码:C# 分批从StreamReader读取,c#,streamreader,C#,Streamreader,在尝试通过StreamReader将800MB文本文件加载到数据表时,我遇到了OutOfMemory异常。我想知道是否有一种方法可以从内存流中批量加载DataTable,即从StreamReader读取文本文件的前10000行,创建DataTable,对DataTable执行一些操作,然后将接下来的10000行加载到StreamReader中,依此类推 我的谷歌在这里并没有什么帮助,但似乎应该有一个简单的方法来做到这一点。最终,我将使用SqlBulkCopy将数据表写入MS SQL db,因此,
public static DataTable PopulateDataTableFromText(DataTable dt, string txtSource)
{
StreamReader sr = new StreamReader(txtSource);
DataRow dr;
int dtCount = dt.Columns.Count;
string input;
int i = 0;
while ((input = sr.ReadLine()) != null)
{
try
{
string[] stringRows = input.Split(new char[] { '\t' });
dr = dt.NewRow();
for (int a = 0; a < dtCount; a++)
{
string dataType = dt.Columns[a].DataType.ToString();
if (stringRows[a] == "" && (dataType == "System.Int32" || dataType == "System.Int64"))
{
stringRows[a] = "0";
}
dr[a] = Convert.ChangeType(stringRows[a], dt.Columns[a].DataType);
}
dt.Rows.Add(dr);
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
i++;
}
return dt;
}
公共静态数据表填充DataTableFromText(DataTable dt,string txtSource)
{
StreamReader sr=新的StreamReader(txtSource);
数据行dr;
int dtCount=dt.Columns.Count;
字符串输入;
int i=0;
而((input=sr.ReadLine())!=null)
{
尝试
{
string[]stringRows=input.Split(新字符[]{'\t'});
dr=dt.NewRow();
对于(int a=0;a
下面是返回的错误:
“System.OutOfMemoryException:引发了类型为'System.OutOfMemoryException'的异常。在System.String.Split(字符[]分隔符、Int32计数、StringSplitOptions选项)处
在System.String.Split(字符[]分隔符}
在C中的Harvester.Config.PopulatedDataTableFromText(DataTable dt,String txtSource)处:…” 关于将数据直接加载到SQL中的建议——对于C#,我有点不知所措,但我认为这基本上就是我正在做的事情?SqlBulkCopy.WriteToServer获取我从文本文件创建的数据表并将其导入SQL。有没有更简单的方法来完成这项工作,而我却没有
编辑:哦,我忘了提到-此代码将不会与SQL server在同一台服务器上运行。数据文本文件位于服务器B上,需要写入服务器A中的表中。这是否排除使用bcp?您是否考虑过将数据直接加载到SQL server,然后在数据库中对其进行操作?数据基本引擎已被设计为以高效的方式执行大量数据的操作。这可能会产生更好的总体结果,并允许您利用数据库和SQL语言的功能来完成繁重的工作。这是旧的“更聪明地工作,而不是更努力地工作”原则 有许多,因此您可能需要检查这些,看看是否有合适的。如果您使用的是SQLServer2005或更高版本,并且确实需要对C#中的数据进行一些操作,则始终可以使用 需要注意的是,
OutOfMemoryException
有点误导..您可能用完的是可寻址内存。这是一件非常不同的事情
当您将一个大文件加载到内存中并将其转换为数据表时,可能需要远远超过800Mb才能表示相同的数据。由于32位.NET进程的可寻址内存限制在2Gb以下,您可能永远无法在单个批处理中处理这一数量的数据
您可能需要做的是以流式方式处理数据。换句话说,不要尝试将数据全部加载到数据表中,然后批量插入SQLServer。而是将文件分块处理,在处理完之前的一组行后清除它们
现在,如果您可以访问具有大量内存的64位计算机(以避免虚拟机抖动),并且
作为64位.NET运行时的副本,您可能可以不做任何更改就可以运行代码。但我建议您无论如何都要进行必要的更改,因为即使在那种环境下,它也可能会提高此操作的性能。SqlBulkCopy.WriteToServer有一个重载,它接受IDataReader。您可以实现自己的IDataReader作为StreamReader的包装器,Read()方法将使用StreamReader中的一行。这样,数据将“流”到数据库中,而不是首先尝试将其作为数据表构建在内存中。
希望这会有所帮助。您真的需要逐行处理数据吗?或者您可以逐行处理数据吗?在后一种情况下,我认为Linq在这里非常有用,因为它使跨方法的“管道”流式传输数据变得非常容易。这样您就不需要一次加载大量数据,一次只需加载一行数据
首先,您需要使您的StreamReader
可枚举。这可以通过扩展方法轻松实现:
public static class TextReaderExtensions
{
public static IEnumerable<string> Lines(this TextReader reader)
{
string line;
while((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
使用这些元素,您可以轻松地将文件中的每一行投影到数据行,并对其执行所需的操作:
using (var reader = new StreamReader(fileName))
{
var rows = reader.Lines().Select(ParseDataRow);
foreach(DataRow row in rows)
{
// Do something with the DataRow
}
}
(请注意,您可以在不使用Linq的情况下通过一个简单的循环执行类似的操作,但我认为Linq使代码更具可读性…作为对其他答案的更新,我也在研究这个问题,并遇到了一些问题
代码的关键在这个循环中:
//Of note: it's faster to read all the lines we are going to act on and
//then process them in parallel instead of reading and processing line by line.
//Code source: http://cc.davelozinski.com/code/c-sharp-code/read-lines-in-batches-process-in-parallel
while (blnFileHasMoreLines)
{
batchStartTime = DateTime.Now; //Reset the timer
//Read in all the lines up to the BatchCopy size or
//until there's no more lines in the file
while (intLineReadCounter < BatchSize && !tfp.EndOfData)
{
CurrentLines[intLineReadCounter] = tfp.ReadFields();
intLineReadCounter += 1;
BatchCount += 1;
RecordCount += 1;
}
batchEndTime = DateTime.Now; //record the end time of the current batch
batchTimeSpan = batchEndTime - batchStartTime; //get the timespan for stats
//Now process each line in parallel.
Parallel.For(0, intLineReadCounter, x =>
//for (int x=0; x < intLineReadCounter; x++) //Or the slower single threaded version for debugging
{
List<object> values = null; //so each thread gets its own copy.
if (tfp.TextFieldType == FieldType.Delimited)
{
if (CurrentLines[x].Length != CurrentRecords.Columns.Count)
{
//Do what you need to if the number of columns in the current line
//don't match the number of expected columns
return; //stop now and don't add this record to the current collection of valid records.
}
//Number of columns match so copy over the values into the datatable
//for later upload into a database
values = new List<object>(CurrentRecords.Columns.Count);
for (int i = 0; i < CurrentLines[x].Length; i++)
values.Add(CurrentLines[x][i].ToString());
//OR do your own custom processing here if not using a database.
}
else if (tfp.TextFieldType == FieldType.FixedWidth)
{
//Implement your own processing if the file columns are fixed width.
}
//Now lock the data table before saving the results so there's no thread bashing on the datatable
lock (oSyncLock)
{
CurrentRecords.LoadDataRow(values.ToArray(), true);
}
values.Clear();
}
); //Parallel.For
//If you're not using a database, you obviously won't need this next piece of code.
if (BatchCount >= BatchSize)
{ //Do the SQL bulk copy and save the info into the database
sbc.BatchSize = CurrentRecords.Rows.Count;
sbc.WriteToServer(CurrentRecords);
BatchCount = 0; //Reset these values
CurrentRecords.Clear(); // "
}
if (CurrentLines[intLineReadCounter] == null)
blnFileHasMoreLines = false; //we're all done, so signal while loop to stop
intLineReadCounter = 0; //reset for next pass
Array.Clear(CurrentLines, 0, CurrentLines.Length);
} //while blnhasmorelines
//值得注意的是:阅读我们将要执行的所有行和
//然后并行处理它们,而不是逐行读取和处理。
//代码来源:http://cc.davelozinski.com/code/c-sharp-code/read-lines-in-bat
//Of note: it's faster to read all the lines we are going to act on and
//then process them in parallel instead of reading and processing line by line.
//Code source: http://cc.davelozinski.com/code/c-sharp-code/read-lines-in-batches-process-in-parallel
while (blnFileHasMoreLines)
{
batchStartTime = DateTime.Now; //Reset the timer
//Read in all the lines up to the BatchCopy size or
//until there's no more lines in the file
while (intLineReadCounter < BatchSize && !tfp.EndOfData)
{
CurrentLines[intLineReadCounter] = tfp.ReadFields();
intLineReadCounter += 1;
BatchCount += 1;
RecordCount += 1;
}
batchEndTime = DateTime.Now; //record the end time of the current batch
batchTimeSpan = batchEndTime - batchStartTime; //get the timespan for stats
//Now process each line in parallel.
Parallel.For(0, intLineReadCounter, x =>
//for (int x=0; x < intLineReadCounter; x++) //Or the slower single threaded version for debugging
{
List<object> values = null; //so each thread gets its own copy.
if (tfp.TextFieldType == FieldType.Delimited)
{
if (CurrentLines[x].Length != CurrentRecords.Columns.Count)
{
//Do what you need to if the number of columns in the current line
//don't match the number of expected columns
return; //stop now and don't add this record to the current collection of valid records.
}
//Number of columns match so copy over the values into the datatable
//for later upload into a database
values = new List<object>(CurrentRecords.Columns.Count);
for (int i = 0; i < CurrentLines[x].Length; i++)
values.Add(CurrentLines[x][i].ToString());
//OR do your own custom processing here if not using a database.
}
else if (tfp.TextFieldType == FieldType.FixedWidth)
{
//Implement your own processing if the file columns are fixed width.
}
//Now lock the data table before saving the results so there's no thread bashing on the datatable
lock (oSyncLock)
{
CurrentRecords.LoadDataRow(values.ToArray(), true);
}
values.Clear();
}
); //Parallel.For
//If you're not using a database, you obviously won't need this next piece of code.
if (BatchCount >= BatchSize)
{ //Do the SQL bulk copy and save the info into the database
sbc.BatchSize = CurrentRecords.Rows.Count;
sbc.WriteToServer(CurrentRecords);
BatchCount = 0; //Reset these values
CurrentRecords.Clear(); // "
}
if (CurrentLines[intLineReadCounter] == null)
blnFileHasMoreLines = false; //we're all done, so signal while loop to stop
intLineReadCounter = 0; //reset for next pass
Array.Clear(CurrentLines, 0, CurrentLines.Length);
} //while blnhasmorelines