C# 使用C中的线程将大型文本文件（500万条记录）并行拆分为较小的文件#_C#_Multithreading_Parallel Processing_Task Parallel Library_.net Framework Version

C# 使用C中的线程将大型文本文件（500万条记录）并行拆分为较小的文件#

c# multithreading parallel-processing

C# 使用C中的线程将大型文本文件（500万条记录）并行拆分为较小的文件#,c#,multithreading,parallel-processing,task-parallel-library,.net-framework-version,C#,Multithreading,Parallel Processing,Task Parallel Library,.net Framework Version,我有一个包含500万条记录（5列500万行）的大型文本文件。文件的图像如下所示对于拆分，我使用了线程的概念。我创建了10个线程来分割较大的文件。我在读取较大的文件时使用了字符串数组来存储值。代码如下所示 class Program { const string sourceFileName = @"C:\Users\Public\TestFolder\ThreadingExp\NewMarketData.txt"; const string destinationFileNa

我有一个包含500万条记录（5列500万行）的大型文本文件。文件的图像如下所示

对于拆分，我使用了线程的概念。我创建了10个线程来分割较大的文件。我在读取较大的文件时使用了字符串数组来存储值。代码如下所示

class Program
{
    const string sourceFileName = @"C:\Users\Public\TestFolder\ThreadingExp\NewMarketData.txt";
    const string destinationFileName = @"C:\Users\Public\TestFolder\ThreadingExp\NewMarketData-Part-{0}.txt";

    static void Main(string[] args)
    {
        int[] index = new int[20];
        index[0] = 0;
        for(int i=1;i<11;i++)
        {
            index[i] = index[i-1]+500000;
        }

        //Reading Part
        var sourceFile = new StreamReader(sourceFileName);
        string[] ListLines = new string[5000000];
        for (int i = 0; i < 5000000; i++)
        {
            ListLines[i] = sourceFile.ReadLine();
        }            

        //Creating array of threads
        Thread[] ArrayofThreads = new Thread[10];
        for (int i = 0; i < ArrayofThreads.Length; i++)
        {
            ArrayofThreads[i] = new Thread(() => Writing(ListLines,index[i], index[i+1]));
            ArrayofThreads[i].Start();
        }

        for (int i = 0; i < ArrayofThreads.Length; i++)
        {
            ArrayofThreads[i].Join();
        }
    }
    static void Writing(string[] array, int a, int b)
    {
        //Getting the thread number
        int id= Thread.CurrentThread.ManagedThreadId;

        var destinationFile = new StreamWriter(string.Format(destinationFileName,id));

        string line;
        for (int i = a; i< b;i++ )
        {
            line = array[i];
            destinationFile.WriteLine(line);
        }

        destinationFile.Close();         
    }

}

类程序
{
常量字符串sourceFileName=@“C:\Users\Public\TestFolder\ThreadingExp\NewMarketData.txt”；
const string destinationFileName=@“C:\Users\Public\TestFolder\ThreadingExp\NewMarketData Part-{0}.txt”；
静态void Main（字符串[]参数）
{
int[]索引=新的int[20]；
指数[0]=0；
对于（inti=1；i写入（列表行，索引[i]，索引[i+1]）；
ArrayofThreads[i].Start（）；
}
for（int i=0；i


代码运行良好。在这里，写入不同的文件是并行的。但对于阅读，我将整个内容存储在一个数组中，然后通过不同的线程使用索引进行写作。我想使用线程并行执行这两项任务（读取较大的文件和写入不同的小文件）。
使用单个线程几乎肯定会更好
首先，必须按顺序读取文本文件。没有捷径可以让你跳过前面的第500000行而不首先阅读前面的499999行
其次，即使您可以这样做，磁盘驱动器一次也只能处理一个请求。它不可能同时在两个地方阅读。因此，当您读取文件的一部分时，想要读取文件另一部分的线程只是坐在那里等待磁盘驱动器
最后，除非您的输出文件位于不同的驱动器上，否则您会遇到与读取相同的问题：磁盘驱动器一次只能做一件事
所以你最好从简单的事情开始：
const int maxLinesPerFile = 5000000;
int fileNumber = 0;
var destinationFile = File.CreateText("outputFile"+fileNumber);

int linesRead = 0;
foreach (var line in File.ReadLines(inputFile))
{
    ++linesRead;
    if (linesRead > maxLinesPerFile)
    {
        destinationFile.Close();
        ++fileNumber;
        destinationFile = File.CreateText("outputFile"+fileNumber);
    }
    destinationFile.WriteLine(line);
}
destinationFile.Close();

如果您的输入和输出文件位于不同的驱动器上，您可以通过使用两个线程来节省一点时间：一个用于输入，一个用于输出。他们将使用阻止集合
进行通信。基本上，输入线程将把行放到队列上，输出线程将从队列中读取并输出文件。理论上，这将使阅读时间与写作时间重叠，但事实是，队列已满，读者最终不得不等待写作线程。性能有所提高，但与预期相差甚远。
我已经尝试过这种方法，效果很好。它使用单线程，不并行执行所需的任务。它使用循环。但是，我想同时完成阅读和写作任务。所以，我想如果我能在这里使用线程的概念。Thanks@PraveenKumar：您是否阅读了我指出多线程可能不会使其更快的部分，并解释了原因？您可以使用任务而不是线程来让TPL确定最佳执行计划（这是动态完成的，因此它将在不同的机器上执行良好）。如果你想要一个例子，请告诉我是的。如果你能提供一些例子以便更好地理解，那就更好了。请看下面的例子。如果你不从一个物理磁盘读取数据并写入另一个物理磁盘，那么并行运行读写作业很可能会因为单个磁盘上的IO负载更高而降低性能。