Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/algorithm/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
C# 在大数据集中寻找重复_C#_Algorithm_Matlab_Sequence_Data Mining - Fatal编程技术网

C# 在大数据集中寻找重复

C# 在大数据集中寻找重复,c#,algorithm,matlab,sequence,data-mining,C#,Algorithm,Matlab,Sequence,Data Mining,我有一个控制系统故障数据集。这些数据具有以下结构: TYPE OF FAILURE (string), START DATE (dd/mm/yyyy), START TIME (hh/mm/ss), DURATION (ss), LOCALIZATION (string), WORKING TEAM (A,B,C), SHIFT (morning, afternoon, night) 包含数据的表有555000行。 首先,我想分析是否存在与开始日期参数相关的重复故障序列。基本上,我想找到这样的

我有一个控制系统故障数据集。这些数据具有以下结构:

TYPE OF FAILURE (string), START DATE (dd/mm/yyyy), START TIME (hh/mm/ss), DURATION (ss), LOCALIZATION (string), WORKING TEAM (A,B,C), SHIFT (morning, afternoon, night)
包含数据的表有555000行。 首先,我想分析是否存在与开始日期参数相关的重复故障序列。基本上,我想找到这样的东西:

故障1出现在3月10日。失败2出现在3月15日。他们之间有5天的时间。然后故障1出现在4月10日和4月15日,这两天之间也有5天。在5月10日和5月15日之间也有5天出现了故障1。然而,故障1也可能在不同的日期出现,但对我来说,有趣的是,有更大的可能性,故障2将在故障1后5天出现,这些事件之间(F1->F2)为一个月

我不知道我的解释是否足够清楚。然而,我正在寻找合适的方法/算法,通过这些方法/算法,我将能够从上述数据描述中提取此类序列。你能告诉我一些方法吗?或者干脆让我们一起集思广益:)。谢谢你的帮助

PS:我计划在C#或MATLAB中实现这一点(取决于合适的方法)
谢谢

您的文件看起来像一个大的CSV,因为matlab在数据存储方面有很好的实现

并具有以下用于处理大型文件的工具:

同时也要注意工作

在您的情况下,您可以这样做:

示例文件airlinessmall.csv(123524行)

使用data store,tou可以将数据作为表格使用,并获取所需的变量,例如,要获取到达延迟的平均值:

>> ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
>> ds.MissingValue = 0;
>> ds.SelectedVariableNames = 'ArrDelay';
>> data = preview(ds)

data = 

    ArrDelay
    ________

     8      
     8      
    21      
    13      
     4      
    59      
     3      
    11      

>> data % this is a table

data = 

    ArrDelay
    ________

     8      
     8      
    21      
    13      
     4      
    59      
     3      
    11      

>> sums = [];
counts = [];
while hasdata(ds)
    T = read(ds); % this is a table, but this is not all loaded in memory

    sums(end+1) = sum(T.ArrDelay);
    counts(end+1) = length(T.ArrDelay);
end

>> avgArrivalDelay = sum(sums)/sum(counts)

avgArrivalDelay =

    6.9670
让我们来看看你的样品。检查此文件:

sample.csv

TYPE OF FAILURE, START DATE, START TIME, DURATION, LOCALIZATION, WORKING TEAM, SHIFT
failure 1, 06/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 2, 06/01/2017, 12/13/20, 300,  Area 1, A, night
failure 3, 06/01/2017, 12/13/20, 400,  Area 1, A, afternoon
failure 1, 08/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 2, 09/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 3, 09/01/2017, 12/13/20, 300,  Area 1, A, night
failure 3, 09/01/2017, 14/13/20, 200,  Area 1, A, morning
failure 1, 10/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 1, 12/01/2017, 12/13/20, 300,  Area 1, A, afternoon
failure 2, 12/01/2017, 12/13/20, 500,  Area 1, A, morning
failure 1, 14/01/2017, 12/13/20, 300,  Area 1, A, night
您可以看到故障1是每两天一次让我们看看:

>> ds = tabularTextDatastore('sample.csv')
Warning: Variable names were modified to make them valid MATLAB identifiers. 

ds = 

  TabularTextDatastore with properties:

                      Files: {
                             '/home/anquegi/learn/matlab/stackoverflow/sample.csv'
                             }
               FileEncoding: 'UTF-8'
          ReadVariableNames: true
              VariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%q', '%q', '%q' ... and 4 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
            SelectedFormats: {'%q', '%q', '%q' ... and 4 more}
                   ReadSize: 20000 rows

>> ds.SelectedVariableNames = {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME', 'DURATION', 'LOCALIZATION', 'WORKINGTEAM', 'SHIFT'}

ds = 

  TabularTextDatastore with properties:

                      Files: {
                             '/home/anquegi/learn/matlab/stackoverflow/sample.csv'
                             }
               FileEncoding: 'UTF-8'
          ReadVariableNames: true
              VariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%q', '%q', '%q' ... and 4 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
            SelectedFormats: {'%q', '%q', '%q' ... and 4 more}
                   ReadSize: 20000 rows

>> reset(ds)
accum = [];
while hasdata(ds)
    T = read(ds);
    accum = datetime(T(strcmp(T.TYPEOFFAILURE,'failure 1'),:).STARTDATE, 'InputFormat','dd/MM/yyyy');
    mean(diff(accum))
end

ans = 

   48:00:00

%每48小时一次,然后你可以尝试你想要的任何东西,你的文件看起来像一个大的CSV,因为matlab有一个很好的数据存储实现

并具有以下用于处理大型文件的工具:

同时也要注意工作

在您的情况下,您可以这样做:

示例文件airlinessmall.csv(123524行)

使用data store,tou可以将数据作为表格使用,并获取所需的变量,例如,要获取到达延迟的平均值:

>> ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
>> ds.MissingValue = 0;
>> ds.SelectedVariableNames = 'ArrDelay';
>> data = preview(ds)

data = 

    ArrDelay
    ________

     8      
     8      
    21      
    13      
     4      
    59      
     3      
    11      

>> data % this is a table

data = 

    ArrDelay
    ________

     8      
     8      
    21      
    13      
     4      
    59      
     3      
    11      

>> sums = [];
counts = [];
while hasdata(ds)
    T = read(ds); % this is a table, but this is not all loaded in memory

    sums(end+1) = sum(T.ArrDelay);
    counts(end+1) = length(T.ArrDelay);
end

>> avgArrivalDelay = sum(sums)/sum(counts)

avgArrivalDelay =

    6.9670
让我们来看看你的样品。检查此文件:

sample.csv

TYPE OF FAILURE, START DATE, START TIME, DURATION, LOCALIZATION, WORKING TEAM, SHIFT
failure 1, 06/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 2, 06/01/2017, 12/13/20, 300,  Area 1, A, night
failure 3, 06/01/2017, 12/13/20, 400,  Area 1, A, afternoon
failure 1, 08/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 2, 09/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 3, 09/01/2017, 12/13/20, 300,  Area 1, A, night
failure 3, 09/01/2017, 14/13/20, 200,  Area 1, A, morning
failure 1, 10/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 1, 12/01/2017, 12/13/20, 300,  Area 1, A, afternoon
failure 2, 12/01/2017, 12/13/20, 500,  Area 1, A, morning
failure 1, 14/01/2017, 12/13/20, 300,  Area 1, A, night
您可以看到故障1是每两天一次让我们看看:

>> ds = tabularTextDatastore('sample.csv')
Warning: Variable names were modified to make them valid MATLAB identifiers. 

ds = 

  TabularTextDatastore with properties:

                      Files: {
                             '/home/anquegi/learn/matlab/stackoverflow/sample.csv'
                             }
               FileEncoding: 'UTF-8'
          ReadVariableNames: true
              VariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%q', '%q', '%q' ... and 4 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
            SelectedFormats: {'%q', '%q', '%q' ... and 4 more}
                   ReadSize: 20000 rows

>> ds.SelectedVariableNames = {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME', 'DURATION', 'LOCALIZATION', 'WORKINGTEAM', 'SHIFT'}

ds = 

  TabularTextDatastore with properties:

                      Files: {
                             '/home/anquegi/learn/matlab/stackoverflow/sample.csv'
                             }
               FileEncoding: 'UTF-8'
          ReadVariableNames: true
              VariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%q', '%q', '%q' ... and 4 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
            SelectedFormats: {'%q', '%q', '%q' ... and 4 more}
                   ReadSize: 20000 rows

>> reset(ds)
accum = [];
while hasdata(ds)
    T = read(ds);
    accum = datetime(T(strcmp(T.TYPEOFFAILURE,'failure 1'),:).STARTDATE, 'InputFormat','dd/MM/yyyy');
    mean(diff(accum))
end

ans = 

   48:00:00

%每48小时一次,然后你可以尝试你想要的任何东西

谢谢你的提示:)。我将详细介绍Matlab的数据存储。然而,我也在寻找解决问题的算法部分。你能告诉我一些关于这个的事情吗?:)当然,我会尝试粘贴一个示例文件,5-6行,以及应用于该文件的示例,编辑以使用示例数据,如果有帮助,请不要忘记向上投票或赠送vcorrect答案谢谢您的提示:)。我将详细介绍Matlab的数据存储。然而,我也在寻找解决问题的算法部分。你能告诉我一些关于这个的事情吗?:)当然,我会尝试粘贴一个示例文件,5-6行,以及应用于该文件的示例,编辑以使用示例数据,如果有帮助,请不要忘记向上投票或赠送正确答案