在Haskell中定义数据结构的建议_Haskell

在Haskell中定义数据结构的建议

haskell

在Haskell中定义数据结构的建议,haskell,Haskell,我在Haskell中建模数据结构时遇到问题。假设我是经营一个动物研究机构，我想跟踪我的胡扯。我想追踪老鼠被分配到笼子和实验。我还想记录我的老鼠的体重我的笼子的体积，并记录我的实验在SQL中，我可能会执行以下操作： create table cages (id integer primary key, volume double); create table experiments (id integer primary key, notes text) create table rat

我在Haskell中建模数据结构时遇到问题。假设我是经营一个动物研究机构，我想跟踪我的胡扯。我想追踪老鼠被分配到笼子和实验。我还想记录我的老鼠的体重我的笼子的体积，并记录我的实验

在SQL中，我可能会执行以下操作：

create table cages (id integer primary key, volume double);
create table experiments (id integer primary key, notes text)
create table rats (
    weight double,
    cage_id integer references cages (id),
    experiment_id integer references experiments (id)
);

（我意识到这允许我分配两个来自不同地点的老鼠在同一个笼子里做实验。这是有意的。我实际上并没有做实验动物研究设施。）

两种操作必须是可能的：（1）给一只老鼠，找出它笼子的体积；（2）给一只老鼠，得到它所属实验的笔记

在SQL中，这些是

select cages.volume from rats
  inner join cages on cages.id = rats.cage_id
  where rats.id = ...; -- (1)
select experiments.notes from rats
  inner join experiments on experiments.id = rats.experiment_id
  where rats.id = ...; -- (2)

我如何在Haskell中对该数据结构建模

一种方法是

type Weight = Double
type Volume = Double

data Rat = Rat Cage Experiment Weight
data Cage = Cage Volume
data Experiment = Experiment String

data ResearchFacility = ResearchFacility [Rat]

ratCageVolume :: Rat -> Volume
ratCageVolume (Rat (Cage volume) _ _) = volume

ratExperimentNotes :: Rat -> String
ratExperimentNotes (Rat _ (Experiment notes) _) = notes

但是这种结构不会引入一堆

框架

s和

实验

s的副本吗？还是我不必担心，希望优化器能解决这个问题？

更自然的Haskell模型表示法是，笼子中包含实际的老鼠对象，而不是它们的ID：

data Rat = Rat RatId Weight
data Cage = Cage [Rat] Volume
data Experiment = Experiment [Rat] String

然后，您将使用智能构造函数创建

researchfacture

对象，以确保它们遵循规则。它可以看起来像：

research_facility :: [Rat] -> Map Rat Cage -> Map Rat Experiment -> ResearchFacility
research_facility rats cage_assign experiment_assign = ...

其中

cage\u-assign

和

experiment\u-assign

是包含与sql中的

cage\u-id

和

experiment\u-id

外键相同信息的映射。

以下是我用于测试的短文件：

type Weight = Double
type Volume = Double

data Rat = Rat Cage Experiment Weight deriving (Eq, Ord, Show, Read)
data Cage = Cage Volume               deriving (Eq, Ord, Show, Read)
data Experiment = Experiment String   deriving (Eq, Ord, Show, Read)

volume     = 30
name       = "foo"
weight     = 15
cage       = Cage volume
experiment = Experiment name
rat        = Rat cage experiment weight

然后，我启动了ghci并导入了可从令人愉快的软件包获得的

System.Vacuum.Cairo

（我不太清楚为什么这张图中有折叠箭头，但你可以忽略/折叠它们。）

如上所述，经验法则是在调用构造函数时创建新对象；否则，如果仅命名已创建的对象，则不会创建新对象。在Haskell中这样做是安全的，因为它是一种不可变的语言。

我在日常工作中大部分时间都使用Haskell，我遇到了这个问题。我的经验是，这与其说是创建了多少数据结构副本的问题，不如说是涉及到数据依赖性的问题。我们使用类似的数据结构来帮助与存储实际数据的关系数据库接口。这意味着我们有这样的疑问

getCageById       :: IdType -> IO (Maybe Cage)
getRatById        :: IdType -> IO (Maybe Rat)
getExperimentById :: IdType -> IO (Maybe Experiment)

我们一开始构建的数据结构与您的类似，其中包含链接的数据结构。结果证明这是一个巨大的错误。问题是，如果你对老鼠使用以下定义

data Rat = Rat Cage Experiment Weight

…然后getRatById函数必须运行三个数据库查询才能返回结果。起初，这似乎是一种很方便的方法，但最终会造成巨大的性能问题，特别是当我们希望查询返回一系列结果时。数据结构迫使我们进行连接，即使我们只需要rat表中的行。问题在于额外的数据库查询，而不是RAM中可能存在的额外对象

现在，我们的策略是，当我们创建与数据库表对应的数据结构时，我们总是像表一样对它们进行非规范化。所以你的例子会变成这样：

type IdType = Int
type Weight = Double
type Volume = Double

data Rat = Rat
    { ratId        :: IdType
    , cageId       :: IdType
    , experimentId :: IdType
    , weight       :: Weight
    }
data Cage = Cage IdType Volume
data Experiment = Experiment IdType String

（您甚至可能希望使用新类型来区分不同的ID。）获取整个结构需要更多的工作，但它允许您高效地获取结构的某些部分。当然，如果您永远不需要获取结构的各个部分，那么我的建议可能不合适。但我的经验是，部分查询非常常见，我不想人为地让它们变慢。如果您想要一个方便的函数为您进行连接，那么您当然可以编写一个。但是，不要使用将您锁定在这种使用模式中的数据模型。

第一个观察：您应该学会使用记录。Haskell中的记录字段名被视为函数，因此这些定义至少会减少键入的次数：

data Rat = Rat { getCage       :: Cage
               , getExperiment :: Experiment
               , getWeight     :: Weight }

data Cage = Cage { getVolume :: Volume }

-- Now this function is so trivial to define that you might actually not bother:
ratCageVolume :: Rat -> Volume
ratCageVolume = getVolume . getCage

至于数据表示，我可能会沿着以下思路：

type Weight = Double
type Volume = Double

-- Rats and Cages have identity that goes beyond their properties;
-- two distinct rats of the same weight can be in the same cage, and
-- two cages can have same volume.
-- 
-- So should we give each Rat and Cage an additional field to
-- represent its key?  We could do that, or we could abstract that out
-- into this:

data Identity i a = Identity { getId  :: i
                             , getVal :: a }
            deriving Show

instance Eq i => Eq (Identity i a) where
    a == b = getId a == getId b

instance Ord i => Ord (Identity i a) where
    a `compare` b = getId a `compare` getId b


-- And to simplify a common case:
type Id a = Identity Int a


-- Rats' only real intrinsic property is their weight.  Cage and Experiment?
-- Situational, I say.
data Rat = Rat { getWeight :: Weight  }

data Cage = Cage { getVolume :: Volume }

data Experiment = Experiment { getNotes :: String }
                  deriving (Eq, Show)

-- The data that you're manipulating is really this:
type RatData = (Id Rat, Id Cage, Id Experiment)

type ResearchFacility = [RatData]

你打算在你的老鼠身上做什么样的实验？你需要知道你要用这些数据做什么。@KarolisJuodelė：好的。在GHC中，如果你写

foo=3；bar=3

您可能有两个指针指向数字的两个副本

，而如果您写入

foo=3；bar=foo

您将有两个指向共享

的指针。这就是你要找的信息吗？如果是这样的话，我可以把它具体化为一个关于老鼠和实验的答案。@DanielWagner:这正是我想要的，但当我问这个问题时，我没有意识到。（注：自从Daniel回答后，我编辑了一点Haskell代码。他在回答修订版1。）这会引入一堆老鼠的副本吗，还是我应该信任优化器？现在这是一个不同的问题。为了回答这个问题，将不会有额外的副本，同一笼子中的老鼠将共享笼子对象。很抱歉更改了这么多。问题仍然是一样的（“我可以如何在Haskell中建模此数据结构？”），但我完全更改了Haskell代码。首先，我可能应该忽略我对解决方案的尝试。是的，我的答案更多地是关于满足您从使用sql获得的约束。

getCageById       :: IdType -> IO (Maybe Cage)
getRatById        :: IdType -> IO (Maybe Rat)
getExperimentById :: IdType -> IO (Maybe Experiment)

data Rat = Rat Cage Experiment Weight

type IdType = Int
type Weight = Double
type Volume = Double

data Rat = Rat
    { ratId        :: IdType
    , cageId       :: IdType
    , experimentId :: IdType
    , weight       :: Weight
    }
data Cage = Cage IdType Volume
data Experiment = Experiment IdType String

data Rat = Rat { getCage       :: Cage
               , getExperiment :: Experiment
               , getWeight     :: Weight }

data Cage = Cage { getVolume :: Volume }

-- Now this function is so trivial to define that you might actually not bother:
ratCageVolume :: Rat -> Volume
ratCageVolume = getVolume . getCage

type Weight = Double
type Volume = Double

-- Rats and Cages have identity that goes beyond their properties;
-- two distinct rats of the same weight can be in the same cage, and
-- two cages can have same volume.
-- 
-- So should we give each Rat and Cage an additional field to
-- represent its key?  We could do that, or we could abstract that out
-- into this:

data Identity i a = Identity { getId  :: i
                             , getVal :: a }
            deriving Show

instance Eq i => Eq (Identity i a) where
    a == b = getId a == getId b

instance Ord i => Ord (Identity i a) where
    a `compare` b = getId a `compare` getId b


-- And to simplify a common case:
type Id a = Identity Int a


-- Rats' only real intrinsic property is their weight.  Cage and Experiment?
-- Situational, I say.
data Rat = Rat { getWeight :: Weight  }

data Cage = Cage { getVolume :: Volume }

data Experiment = Experiment { getNotes :: String }
                  deriving (Eq, Show)

-- The data that you're manipulating is really this:
type RatData = (Id Rat, Id Cage, Id Experiment)

type ResearchFacility = [RatData]