Python 分割数据集_Python_Pandas_Numpy_Data Mining

Python 分割数据集

python pandas numpy

Python 分割数据集,python,pandas,numpy,data-mining,Python,Pandas,Numpy,Data Mining,给定一个带有日期和值的CSV数据集，我想尝试创建一个新的CSV数据集，其中输出由图形已更改的点组成：增加、减少或根本没有更改。下面是数据示例，以及所需的输出。（CSV追溯到1999年）输出应为： StartDate EndDate StartValue EndValue 03/04/2014 07/04/2014 137876 137209 31/03/2014 03/04/2014 137589 137876 27/03/2014 31/03/201

给定一个带有日期和值的CSV数据集，我想尝试创建一个新的CSV数据集，其中输出由图形已更改的点组成：增加、减少或根本没有更改。下面是数据示例，以及所需的输出。（CSV追溯到1999年）

输出应为：

StartDate   EndDate   StartValue   EndValue
03/04/2014  07/04/2014  137876      137209
31/03/2014  03/04/2014  137589      137876
27/03/2014  31/03/2014  138114      137589
26/03/2014  27/03/2014  138129      138114
25/03/2014  26/03/2014  137945      138129

我试图解决这个问题，包括一个自写的

Stretch

类，该类在添加数据时管理数据的拆分：

from enum import Enum

class Direction(Enum):
    NA = None 
    Up = 1 
    Stagnant = 0 
    Down = -1

    @staticmethod
    def getDir(a,b):
        """Gets two numbers and returns a Direction result by comparing them."""
        if a < b:   return Direction.Up
        elif a > b: return Direction.Down
        else:       return Direction.Stagnant

class Stretch:
    """Accepts tuples of (insignificant, float). Adds tuples to internal data struct
    while they have the same trend (down, up, stagnant). See add() for details."""

    def __init__(self,dp=None):
        self.data = []
        if dp:
            self.data.append(dp)
        self.dir = Direction.NA  


    def add(self,dp):
        """Adds dp to self if it follows a given trend (or it holds less then 2 datapts).
        Returns (True,None) if the datapoint was added to this Stretch instance,
        returns (False, new_stretch) if it broke the trend. The new_stretch
        contains the new last value of the self.data as well as the new dp."""
        if not self.data:
            self.data.append(dp)
            return True, None
        if len(self.data) == 1:
            self.dir = Direction.getDir(self.data[-1][1],dp[1]) 
            self.data.append(dp)
            return True, None
        if Direction.getDir(self.data[-1][1],dp[1]) == self.dir:
            self.data.append(dp)
            return True, None
        else:
            k = Stretch(self.data[-1])
            k.add(dp)
            return False, k

用法：

data_stretches = []

with open("d.txt") as r:
    S = Stretch()
    for line in r:
        try:
            date,value = line.strip().split()
            value = float(value)
        except (IndexError, ValueError) as e:
            print("Illegal line: '{}'".format(line))
            continue

        b, newstretch = S.add( (date,value) )
        if not b:
            data_stretches.append(S)
            S = newstretch
data_stretches.append(S)

for s in data_stretches:
    data = s.data
    direc = s.dir


    print(data[0][0], data[-1][0], data[0][1],data[-1][-1], s.dir)

输出：

# EndDate  StartDate  EndV     StartV   (reversed b/c I inverted dates)  
07/04/2014 03/04/2014 137209.0 137876.0 Direction.Up
03/04/2014 31/03/2014 137876.0 137589.0 Direction.Down
31/03/2014 26/03/2014 137589.0 138129.0 Direction.Up
26/03/2014 25/03/2014 138129.0 137945.0 Direction.Down

除了基于“从何时到何时”评估数据的方向混乱之外，我的输出与你的不同。。。因为你把一个统一的序列分成两部分，没有明显的原因：

我试图解决这个问题，包括一个自写的

Stretch

类，该类在添加数据时管理数据的拆分：

from enum import Enum

class Direction(Enum):
    NA = None 
    Up = 1 
    Stagnant = 0 
    Down = -1

    @staticmethod
    def getDir(a,b):
        """Gets two numbers and returns a Direction result by comparing them."""
        if a < b:   return Direction.Up
        elif a > b: return Direction.Down
        else:       return Direction.Stagnant

class Stretch:
    """Accepts tuples of (insignificant, float). Adds tuples to internal data struct
    while they have the same trend (down, up, stagnant). See add() for details."""

    def __init__(self,dp=None):
        self.data = []
        if dp:
            self.data.append(dp)
        self.dir = Direction.NA  


    def add(self,dp):
        """Adds dp to self if it follows a given trend (or it holds less then 2 datapts).
        Returns (True,None) if the datapoint was added to this Stretch instance,
        returns (False, new_stretch) if it broke the trend. The new_stretch
        contains the new last value of the self.data as well as the new dp."""
        if not self.data:
            self.data.append(dp)
            return True, None
        if len(self.data) == 1:
            self.dir = Direction.getDir(self.data[-1][1],dp[1]) 
            self.data.append(dp)
            return True, None
        if Direction.getDir(self.data[-1][1],dp[1]) == self.dir:
            self.data.append(dp)
            return True, None
        else:
            k = Stretch(self.data[-1])
            k.add(dp)
            return False, k

用法：

data_stretches = []

with open("d.txt") as r:
    S = Stretch()
    for line in r:
        try:
            date,value = line.strip().split()
            value = float(value)
        except (IndexError, ValueError) as e:
            print("Illegal line: '{}'".format(line))
            continue

        b, newstretch = S.add( (date,value) )
        if not b:
            data_stretches.append(S)
            S = newstretch
data_stretches.append(S)

for s in data_stretches:
    data = s.data
    direc = s.dir


    print(data[0][0], data[-1][0], data[0][1],data[-1][-1], s.dir)

输出：

# EndDate  StartDate  EndV     StartV   (reversed b/c I inverted dates)  
07/04/2014 03/04/2014 137209.0 137876.0 Direction.Up
03/04/2014 31/03/2014 137876.0 137589.0 Direction.Down
31/03/2014 26/03/2014 137589.0 138129.0 Direction.Up
26/03/2014 25/03/2014 138129.0 137945.0 Direction.Down

除了基于“从何时到何时”评估数据的方向混乱之外，我的输出与你的不同。。。因为你把一个统一的序列分成两部分，没有明显的原因：

您可以从

numpy

中使用

sign

，并将其应用于“Value”列上的

diff

，以查看图形的趋势在哪里变化，然后使用

shift

和

cumsum

为每组趋势创建增量值：

ser_sign = np.sign(df.Value.diff(-1).ffill())
ser_gr = ser_gr =(ser_sign.shift() != ser_sign).cumsum()

现在您知道了这些组，要获得每个组的开始和结束，您可以在

ser\u gr

上使用

groupby

，

加入最后一个（在shift
之后，ser\u gr
中的值，因为每个组的最后一个是下一个组的第一个）和第一个
df_new = (df.groupby(ser_gr.shift().bfill(),as_index=False).last()
            .join(df.groupby(ser_gr,as_index=False).first(),lsuffix='_start',rsuffix='_end'))

print (df_new)
   Date_start  Value_start    Date_end  Value_end
0  03/04/2014     137876.0  07/04/2014   137209.0
1  31/03/2014     137589.0  03/04/2014   137876.0
2  26/03/2014     138129.0  31/03/2014   137589.0
3  25/03/2014     137945.0  26/03/2014   138129.0

现在，如果需要对列重新排序并重命名，可以使用以下方法：
df_new.columns = ['StartDate', 'StartValue', 'EndDate', 'EndValue']
df_new = df_new[['StartDate','EndDate','StartValue','EndValue']]

print (df_new)
    StartDate     EndDate  StartValue  EndValue
0  03/04/2014  07/04/2014    137876.0  137209.0
1  31/03/2014  03/04/2014    137589.0  137876.0
2  26/03/2014  31/03/2014    138129.0  137589.0
3  25/03/2014  26/03/2014    137945.0  138129.0

这两个操作可以在使用rename
创建df_new
的同时完成。您可以使用sign
fromnumpy
并将其应用于列“Value”上的diff
，查看图形的趋势在何处发生变化，然后使用shift
和cumsum
：
ser_sign = np.sign(df.Value.diff(-1).ffill())
ser_gr = ser_gr =(ser_sign.shift() != ser_sign).cumsum()

现在您知道了这些组，要获得每个组的开始和结束，您可以在ser\u gr
上使用groupby
，加入最后一个（在shift
之后，ser\u gr
中的值，因为每个组的最后一个是下一个组的第一个）和第一个
df_new = (df.groupby(ser_gr.shift().bfill(),as_index=False).last()
            .join(df.groupby(ser_gr,as_index=False).first(),lsuffix='_start',rsuffix='_end'))

print (df_new)
   Date_start  Value_start    Date_end  Value_end
0  03/04/2014     137876.0  07/04/2014   137209.0
1  31/03/2014     137589.0  03/04/2014   137876.0
2  26/03/2014     138129.0  31/03/2014   137589.0
3  25/03/2014     137945.0  26/03/2014   138129.0

现在，如果需要对列重新排序并重命名，可以使用以下方法：
df_new.columns = ['StartDate', 'StartValue', 'EndDate', 'EndValue']
df_new = df_new[['StartDate','EndDate','StartValue','EndValue']]

print (df_new)
    StartDate     EndDate  StartValue  EndValue
0  03/04/2014  07/04/2014    137876.0  137209.0
1  31/03/2014  03/04/2014    137589.0  137876.0
2  26/03/2014  31/03/2014    138129.0  137589.0
3  25/03/2014  26/03/2014    137945.0  138129.0

这两个操作可以在使用rename
创建df_new
的同时完成
您是如何编写解决方案的？你是根据什么进行分组的？纯增长、纯下降还是纯停滞？您的csv数据是否已排序？你有什么重复的数据点（f.e.3/4/发生在纯减少/纯增加拉伸的起点和终点的两倍）？#我做了一次清理，去掉了空行，并用所需的4列创建了空的新CSV。是的，我将纯粹的增加和减少进行分组。CSV输入按日期列排序，如上面的示例所示，它可以追溯到1999年。有趣的谜题-熊猫或numpy中可能有一些东西可以在4行中处理，不幸的是，您没有使用这些标记，因此ppl good在其中不会看到这一点。如果我是你，我会删除数据-*
标记，它们对你没有好处，最好添加numpy/pandas。是的，谢谢你让我知道，我的主要用途是pandas和numpy。你是如何编写解决方案的？你是根据什么进行分组的？纯增长、纯下降还是纯停滞？您的csv数据是否已排序？你有什么重复的数据点（f.e.3/4/发生在纯减少/纯增加拉伸的起点和终点的两倍）？#我做了一次清理，去掉了空行，并用所需的4列创建了空的新CSV。是的，我将纯粹的增加和减少进行分组。CSV输入按日期列排序，如上面的示例所示，它可以追溯到1999年。有趣的谜题-熊猫或numpy中可能有一些东西可以在4行中处理，不幸的是，您没有使用这些标记，因此ppl good在其中不会看到这一点。如果我是你，我会删除数据-*
标记，它们对你没有好处，最好添加numpy/pandas。是的，谢谢你让我知道，我的主要用途是pandas和numpy。正如我所想。。。不完全是4班轮而是6。。整洁——至少你得到了和我一样的结果；）关于grouping@PatrickArtner事实上，几行就够了：）谢谢。事实上，我很高兴你之前回答了，因为我得到了相同的分组，我对这个结果更有信心。非常感谢！我想Numpy的符号选项正是我所需要的，而这正是我的过程中缺少的步骤。谢谢你@PatrickArtner的丰富方法。正如我所想。。。不完全是4班轮而是6。。整洁——至少你得到了和我一样的结果；）关于grouping@PatrickArtner事实上，几行就够了：）谢谢。事实上，我很高兴你之前回答了，因为我得到了相同的分组，我对这个结果更有信心。非常感谢！我想Numpy的符号选项正是我所需要的，而这正是我的过程中缺少的步骤。感谢@PatrickArtner为您提供的丰富方法。非常感谢！这是解决我问题的非常有趣的方法，我学到了很多新东西。非常感谢！这是解决我问题的非常有趣的方法，我学到了不少新东西。