C# 如何拆分和合并此数据流管道?

C# 如何拆分和合并此数据流管道?,c#,asynchronous,async-await,task-parallel-library,tpl-dataflow,C#,Asynchronous,Async Await,Task Parallel Library,Tpl Dataflow,我正在尝试使用tpl以以下形式创建数据流: -> LoadDataBlock1 -> ProcessDataBlock1 -> GetInputPathsBlock -> LoadDataBlock2 -> ProcessDataBlock2 -> MergeDataBlock -> SaveDataBlock -> LoadDataBlock3 -> Pr

我正在尝试使用tpl以以下形式创建数据流:

                    -> LoadDataBlock1 -> ProcessDataBlock1 ->  
GetInputPathsBlock  -> LoadDataBlock2 -> ProcessDataBlock2 -> MergeDataBlock -> SaveDataBlock
                    -> LoadDataBlock3 -> ProcessDataBlock3 ->
                    ...                             
                    -> LoadDataBlockN -> ProcessDataBlockN ->
其思想是,
GetInputPathsBlock
是一个块,它找到要加载的输入数据的路径,然后将路径发送到每个
LoadDataBlock
。LoadDataBlock都是相同的(除了它们各自从GetInputPath接收到一个唯一的inputPath字符串)。然后将加载的数据发送到执行一些简单处理的
ProcessDataBlock
。然后将来自每个
ProcessDataBlock
的数据发送到
MergeDataBlock
,后者将其合并并发送到
SaveDataBlock
,然后将其保存到文件中

将其视为每个月都需要运行的数据流。首先,找到每天数据的路径。每天的数据都会被加载和处理,然后在整个月份合并并保存。每个月都可以并行运行,一个月中每天的数据可以并行加载和并行处理(在加载单个日期的数据之后),并且一旦加载和处理了该月的所有内容,就可以合并和保存

我尝试的

据我所知,
TransformManyBlock
可以用来进行拆分(
GetInputPathsBlock
),并且可以链接到一个普通的
TransformBlock
LoadDataBlock
),然后从那里链接到另一个
TransformBlock
ProcessDataBlock
),但我不知道如何将它合并回单个块

我看到的

我发现,它使用
TransformManyBlock
IEnumerable
转到
,但我不完全理解它,而且我无法将
TransformBlock
ProcessDataBlock
)链接到
TransformBlock,ProcessedData>
,所以我不知道如何使用它

我也看到了答案,它建议使用
JoinBlock
,但是输入文件的数量N不同,并且文件都以相同的方式加载

还有,这似乎是我想要的,但我不完全理解它,我不知道字典的设置将如何转移到我的案例中

如何拆分和合并数据流?

  • 有没有我缺少的块类型
  • 我可以使用
    TransformManyBlock
    两次吗
  • tpl对拆分/合并有意义吗?还是有更简单的异步/等待方式

我会使用嵌套块来避免拆分每月数据,然后再次合并它们。下面是两个嵌套的
TransformBlock
s的示例,它们处理2020年的所有日期:

var monthlyBlock = new TransformBlock<int, List<string>>(async (month) =>
{
    var dailyBlock = new TransformBlock<int, string>(async (day) =>
    {
        await Task.Delay(100); // Simulate async work
        return day.ToString();
    }, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 4 });

    foreach (var day in Enumerable.Range(1, DateTime.DaysInMonth(2020, month)))
        await dailyBlock.SendAsync(day);
    dailyBlock.Complete();

    var dailyResults = await dailyBlock.ToListAsync();
    return dailyResults;
}, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 1 });

foreach (var month in Enumerable.Range(1, 12))
    await monthlyBlock.SendAsync(month);
monthlyBlock.Complete();

你的问题的答案是:不,你不需要另一个块类型,是的,你可以使用TransformManyBlock两次,是的,它确实有意义。我写了一些代码来证明它,在底部,还有一些关于它如何工作的注释,在那之后

正如您所描述的,该代码使用先拆分后合并的管道。至于您正在努力解决的问题:将单个文件的数据重新合并到一起可以通过在列表中添加已处理的项来完成。然后,如果列表具有预期的最终项数,我们只会将列表传递到下一个块。这是可以完成的使用相当简单的TransformMany块返回零或一项。此块无法并行化,因为列表不是线程安全的

一旦你有了这样一个管道,你就可以通过使用传递给块的选项来测试并行化和排序。下面的代码将每个块的并行化设置为无界,并让数据流代码对其进行排序。在我的机器上,它最大化了所有核心/逻辑处理器,并且CPU受限,这就是我们想要的。排序是启用的,但关闭它并没有多大区别:同样,我们是CPU受限的

最后,我不得不说这是一项非常酷的技术,但实际上,使用PLINQ可以更简单地解决这个问题,只需几行代码就能以同样快的速度获得某些东西。最大的缺点是,如果这样做,您无法轻松地将快速到达的消息增量添加到管道中:PLINQ更适合于一个大批量过程。不过,PLINQ可能是更适合您的用例的解决方案

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Threading;
using System.Threading.Tasks.Dataflow;

namespace ParallelDataFlow
{
    class Program
    {
        static void Main(string[] args)
        {
            new Program().Run();
            Console.ReadLine();
        }

        private void Run()
        {
            Stopwatch s = new Stopwatch();
            s.Start();

            // Can  experiment with parallelization of blocks by changing MaxDegreeOfParallelism
            var options = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded };
            var getInputPathsBlock = new TransformManyBlock<(int, int), WorkItem>(date => GetWorkItemWithInputPath(date), options);
            var loadDataBlock = new TransformBlock<WorkItem, WorkItem>(workItem => LoadDataIntoWorkItem(workItem), options);
            var processDataBlock = new TransformBlock<WorkItem, WorkItem>(workItem => ProcessDataForWorkItem(workItem), options);
            var waitForProcessedDataBlock = new TransformManyBlock<WorkItem, List<WorkItem>>(workItem => WaitForWorkItems(workItem));  // Can't parallelize this block
            var mergeDataBlock = new TransformBlock<List<WorkItem>, List<WorkItem>>(list => MergeWorkItemData(list), options);
            var saveDataBlock = new ActionBlock<List<WorkItem>>(list => SaveWorkItemData(list), options);

            var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
            getInputPathsBlock.LinkTo(loadDataBlock, linkOptions);
            loadDataBlock.LinkTo(processDataBlock, linkOptions);
            processDataBlock.LinkTo(waitForProcessedDataBlock, linkOptions);
            waitForProcessedDataBlock.LinkTo(mergeDataBlock, linkOptions);
            mergeDataBlock.LinkTo(saveDataBlock, linkOptions);

            // We post individual tuples of (year, month) to our pipeline, as many as we want
            getInputPathsBlock.Post((1903, 2));  // Post one month and date
            var dates = from y in Enumerable.Range(2015, 5) from m in Enumerable.Range(1, 12) select (y, m);
            foreach (var date in dates) getInputPathsBlock.Post(date);  // Post a big sequence         

            getInputPathsBlock.Complete();
            saveDataBlock.Completion.Wait();
            s.Stop();
            Console.WriteLine($"Completed in {s.ElapsedMilliseconds}ms on {ThreadAndTime()}");
        }

        private IEnumerable<WorkItem> GetWorkItemWithInputPath((int year, int month) date)
        {
            List<WorkItem> processedWorkItems = new List<WorkItem>();  // Will store merged results
            return GetInputPaths(date.year, date.month).Select(
                path => new WorkItem
                {
                    Year = date.year,
                    Month = date.month,
                    FilePath = path,
                    ProcessedWorkItems = processedWorkItems
                });
        }

        // Get filepaths of form e.g. Files/20191101.txt  These aren't real files, they just show how it could work.
        private IEnumerable<string> GetInputPaths(int year, int month) =>
            Enumerable.Range(0, GetNumberOfFiles(year, month)).Select(i => $@"Files/{year}{Pad(month)}{Pad(i + 1)}.txt");

        private int GetNumberOfFiles(int year, int month) => DateTime.DaysInMonth(year, month);

        private WorkItem LoadDataIntoWorkItem(WorkItem workItem) {
            workItem.RawData = LoadData(workItem.FilePath);
            return workItem;
        }

        // Simulate loading by just concatenating to path: in real code this could open a real file and return the contents
        private string LoadData(string path) => "This is content from file " + path;

        private WorkItem ProcessDataForWorkItem(WorkItem workItem)
        {
            workItem.ProcessedData = ProcessData(workItem.RawData);
            return workItem;
        }

        private string ProcessData(string contents)
        {
            Thread.SpinWait(11000000); // Use 11,000,000 for ~50ms on Windows .NET Framework.  1,100,000 on Windows .NET Core.
            return $"Results of processing file with contents '{contents}' on {ThreadAndTime()}";
        }

        // Adds a processed WorkItem to its ProcessedWorkItems list.  Then checks if the list has as many processed WorkItems as we 
        // expect to see overall.  If so the list is returned to the next block, if not we return an empty array, which passes nothing on.
        // This isn't threadsafe for the list, so has to be called with MaxDegreeOfParallelization = 1
        private IEnumerable<List<WorkItem>> WaitForWorkItems(WorkItem workItem)
        {
            List<WorkItem> itemList = workItem.ProcessedWorkItems;
            itemList.Add(workItem);
            return itemList.Count == GetNumberOfFiles(workItem.Year, workItem.Month) ? new[] { itemList } : new List<WorkItem>[0];
        }

        private List<WorkItem> MergeWorkItemData(List<WorkItem> processedWorkItems)
        {
            string finalContents = "";
            foreach (WorkItem workItem in processedWorkItems)
            {
                finalContents = MergeData(finalContents, workItem.ProcessedData);
            }
            // Should really create a new data structure and return that, but let's cheat a bit
            processedWorkItems[0].MergedData = finalContents;
            return processedWorkItems;
        }

        // Just concatenate the output strings, separated by newlines, to merge our data
        private string MergeData(string output1, string output2) => output1 != "" ? output1 + "\n" + output2 : output2;

        private void SaveWorkItemData(List<WorkItem> workItems)
        {
            WorkItem result = workItems[0];
            SaveData(result.MergedData, result.Year, result.Month);
            // Code to show it's worked...
            Console.WriteLine($"Saved data block for {DateToString((result.Year, result.Month))} on {ThreadAndTime()}." +
                              $"  File contents:\n{result.MergedData}\n");
        }
        private void SaveData(string finalContents, int year, int month)
        {
            // Actually save, although don't really need to in this test code
            new DirectoryInfo("Results").Create();
            File.WriteAllText(Path.Combine("Results", $"results{year}{Pad(month)}.txt"), finalContents);
        }

        // Helper methods
        private string DateToString((int year, int month) date) => date.year + Pad(date.month);
        private string Pad(int number) => number < 10 ? "0" + number : number.ToString();
        private string ThreadAndTime() => $"thread {Pad(Thread.CurrentThread.ManagedThreadId)} at {DateTime.Now.ToString("hh:mm:ss.fff")}";
    }

    public class WorkItem
    {
        public int Year { get; set; }
        public int Month { get; set; }
        public string FilePath { get; set; }
        public string RawData { get; set; }
        public string ProcessedData { get; set; }
        public List<WorkItem> ProcessedWorkItems { get; set; }
        public string MergedData { get; set; }
    }
}
使用系统;
使用System.Collections.Generic;
使用系统诊断;
使用System.IO;
使用System.Linq;
使用系统线程;
使用System.Threading.Tasks.Dataflow;
命名空间并行数据流
{
班级计划
{
静态void Main(字符串[]参数)
{
新程序().Run();
Console.ReadLine();
}
私家车
{
秒表s=新秒表();
s、 Start();
//可以通过更改MaxDegreeOfParallelism来试验块的并行化
var options=new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism=DataflowBlockOptions.Unbounded};
var getInputPathsBlock=new TransformManyBlock(日期=>GetWorkItemWithInputPath(日期),选项);
var loadDataBlock=新转换块(工作项=>LoadDataIntoWorkItem(工作项),选项);
var processDataBlock=newtransformblock(workItem=>ProcessDataForWorkItem(workItem),选项);
var waitForProcessedDataBlock=new TransformManyBlock(workItem=>WaitForWorkItems(workItem));//无法并行化此块
var mergeDataBlock=newtransformblock(列表=>MergeWorkItemData(列表),选项);
var saveDataBlock=newactionblock(列表=>SaveWorkItemData(列表),选项);
var linkOptions=newdataflowlinkoptions{PropagateCompletion=true};
LinkTo(loadDataBlock,linkOptions);
LinkTo(processDataBlock,linkOptions);
processDataBlock.LinkTo(waitForProcessedDat
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Threading;
using System.Threading.Tasks.Dataflow;

namespace ParallelDataFlow
{
    class Program
    {
        static void Main(string[] args)
        {
            new Program().Run();
            Console.ReadLine();
        }

        private void Run()
        {
            Stopwatch s = new Stopwatch();
            s.Start();

            // Can  experiment with parallelization of blocks by changing MaxDegreeOfParallelism
            var options = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded };
            var getInputPathsBlock = new TransformManyBlock<(int, int), WorkItem>(date => GetWorkItemWithInputPath(date), options);
            var loadDataBlock = new TransformBlock<WorkItem, WorkItem>(workItem => LoadDataIntoWorkItem(workItem), options);
            var processDataBlock = new TransformBlock<WorkItem, WorkItem>(workItem => ProcessDataForWorkItem(workItem), options);
            var waitForProcessedDataBlock = new TransformManyBlock<WorkItem, List<WorkItem>>(workItem => WaitForWorkItems(workItem));  // Can't parallelize this block
            var mergeDataBlock = new TransformBlock<List<WorkItem>, List<WorkItem>>(list => MergeWorkItemData(list), options);
            var saveDataBlock = new ActionBlock<List<WorkItem>>(list => SaveWorkItemData(list), options);

            var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
            getInputPathsBlock.LinkTo(loadDataBlock, linkOptions);
            loadDataBlock.LinkTo(processDataBlock, linkOptions);
            processDataBlock.LinkTo(waitForProcessedDataBlock, linkOptions);
            waitForProcessedDataBlock.LinkTo(mergeDataBlock, linkOptions);
            mergeDataBlock.LinkTo(saveDataBlock, linkOptions);

            // We post individual tuples of (year, month) to our pipeline, as many as we want
            getInputPathsBlock.Post((1903, 2));  // Post one month and date
            var dates = from y in Enumerable.Range(2015, 5) from m in Enumerable.Range(1, 12) select (y, m);
            foreach (var date in dates) getInputPathsBlock.Post(date);  // Post a big sequence         

            getInputPathsBlock.Complete();
            saveDataBlock.Completion.Wait();
            s.Stop();
            Console.WriteLine($"Completed in {s.ElapsedMilliseconds}ms on {ThreadAndTime()}");
        }

        private IEnumerable<WorkItem> GetWorkItemWithInputPath((int year, int month) date)
        {
            List<WorkItem> processedWorkItems = new List<WorkItem>();  // Will store merged results
            return GetInputPaths(date.year, date.month).Select(
                path => new WorkItem
                {
                    Year = date.year,
                    Month = date.month,
                    FilePath = path,
                    ProcessedWorkItems = processedWorkItems
                });
        }

        // Get filepaths of form e.g. Files/20191101.txt  These aren't real files, they just show how it could work.
        private IEnumerable<string> GetInputPaths(int year, int month) =>
            Enumerable.Range(0, GetNumberOfFiles(year, month)).Select(i => $@"Files/{year}{Pad(month)}{Pad(i + 1)}.txt");

        private int GetNumberOfFiles(int year, int month) => DateTime.DaysInMonth(year, month);

        private WorkItem LoadDataIntoWorkItem(WorkItem workItem) {
            workItem.RawData = LoadData(workItem.FilePath);
            return workItem;
        }

        // Simulate loading by just concatenating to path: in real code this could open a real file and return the contents
        private string LoadData(string path) => "This is content from file " + path;

        private WorkItem ProcessDataForWorkItem(WorkItem workItem)
        {
            workItem.ProcessedData = ProcessData(workItem.RawData);
            return workItem;
        }

        private string ProcessData(string contents)
        {
            Thread.SpinWait(11000000); // Use 11,000,000 for ~50ms on Windows .NET Framework.  1,100,000 on Windows .NET Core.
            return $"Results of processing file with contents '{contents}' on {ThreadAndTime()}";
        }

        // Adds a processed WorkItem to its ProcessedWorkItems list.  Then checks if the list has as many processed WorkItems as we 
        // expect to see overall.  If so the list is returned to the next block, if not we return an empty array, which passes nothing on.
        // This isn't threadsafe for the list, so has to be called with MaxDegreeOfParallelization = 1
        private IEnumerable<List<WorkItem>> WaitForWorkItems(WorkItem workItem)
        {
            List<WorkItem> itemList = workItem.ProcessedWorkItems;
            itemList.Add(workItem);
            return itemList.Count == GetNumberOfFiles(workItem.Year, workItem.Month) ? new[] { itemList } : new List<WorkItem>[0];
        }

        private List<WorkItem> MergeWorkItemData(List<WorkItem> processedWorkItems)
        {
            string finalContents = "";
            foreach (WorkItem workItem in processedWorkItems)
            {
                finalContents = MergeData(finalContents, workItem.ProcessedData);
            }
            // Should really create a new data structure and return that, but let's cheat a bit
            processedWorkItems[0].MergedData = finalContents;
            return processedWorkItems;
        }

        // Just concatenate the output strings, separated by newlines, to merge our data
        private string MergeData(string output1, string output2) => output1 != "" ? output1 + "\n" + output2 : output2;

        private void SaveWorkItemData(List<WorkItem> workItems)
        {
            WorkItem result = workItems[0];
            SaveData(result.MergedData, result.Year, result.Month);
            // Code to show it's worked...
            Console.WriteLine($"Saved data block for {DateToString((result.Year, result.Month))} on {ThreadAndTime()}." +
                              $"  File contents:\n{result.MergedData}\n");
        }
        private void SaveData(string finalContents, int year, int month)
        {
            // Actually save, although don't really need to in this test code
            new DirectoryInfo("Results").Create();
            File.WriteAllText(Path.Combine("Results", $"results{year}{Pad(month)}.txt"), finalContents);
        }

        // Helper methods
        private string DateToString((int year, int month) date) => date.year + Pad(date.month);
        private string Pad(int number) => number < 10 ? "0" + number : number.ToString();
        private string ThreadAndTime() => $"thread {Pad(Thread.CurrentThread.ManagedThreadId)} at {DateTime.Now.ToString("hh:mm:ss.fff")}";
    }

    public class WorkItem
    {
        public int Year { get; set; }
        public int Month { get; set; }
        public string FilePath { get; set; }
        public string RawData { get; set; }
        public string ProcessedData { get; set; }
        public List<WorkItem> ProcessedWorkItems { get; set; }
        public string MergedData { get; set; }
    }
}