使用xml管道如何解析mediawiki转储文件

使用xml管道如何解析mediawiki转储文件,xml,haskell,streaming,xml-conduit,Xml,Haskell,Streaming,Xml Conduit,我正在尝试使用xml管道解析mediawiki转储文件。 有两个我感兴趣的标签,SiteInfo和Page。 下面是一个示例xml: 我有零碎的东西,但不知道如何把它们联系在一起,得到想要的结果。 我不知道如何获取名称空间tag:Media中的标记示例中是否有多个属性 我希望最后的结果同时包含siteinfo和wikidoc。您应该使用应用程序模式来组合via、parserOne和parserTwo。这是我昨天写的一个完整的例子: {-# LANGUAGE OverloadedStrings #

我正在尝试使用xml管道解析mediawiki转储文件。 有两个我感兴趣的标签,SiteInfo和Page。 下面是一个示例xml:

我有零碎的东西,但不知道如何把它们联系在一起,得到想要的结果。 我不知道如何获取名称空间tag:Media中的标记示例中是否有多个属性


我希望最后的结果同时包含siteinfo和wikidoc。

您应该使用应用程序模式来组合via、parserOne和parserTwo。这是我昨天写的一个完整的例子:

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE QuasiQuotes #-}
{-# LANGUAGE RecordWildCards #-}
{-# LANGUAGE DeriveAnyClass #-}
{-# LANGUAGE DeriveGeneric #-}

module Main where

--import Text.Groom
import Control.Monad.Trans.Resource
import Data.Conduit (ConduitT, (.|), runConduit)
import Data.Foldable
import qualified Data.HashMap.Strict as H
import Data.HashMap.Strict (HashMap)
import qualified Data.HashSet as HS
import Data.Hashable
import Data.Maybe
import Data.Semigroup
import qualified Data.String.Class as S
import Data.Text (Text)
import Data.XML.Types (Event)
import GHC.Generics (Generic)
import Safe
import System.Environment
import Text.XML
import Text.XML.Stream.Parse

data Mistake = Mistake
  { startPar :: Int
  , startOff :: Int
  , endPar :: Int
  , endOff :: Int
  , mistakeType :: Text
  , correction :: Maybe Text
  } deriving (Show, Eq, Generic, Hashable)

type TeacherId = Int

data Annotation = Annotation
  { teacherId :: TeacherId
  , mistakes :: [Mistake]
  } deriving (Show, Eq)

type Nid = Text

data Doc = Doc
  { nid :: Nid
  , annotations :: [Annotation]
  } deriving (Show, Eq)

-- | We compare mistakes per documents
type PerTeacherMistakes = HashMap TeacherId (HS.HashSet (Nid, Mistake))

parseMistake :: MonadThrow m => ConduitT Event o m (Maybe Mistake)
parseMistake =
  tag'
    "MISTAKE"
    ((,,,) <$> requireAttr "start_par" <*> requireAttr "start_off" <*>
     requireAttr "end_par" <*>
     requireAttr "end_off") $ \(sp, so, ep, eo) -> do
    startPar <- pread sp
    startOff <- pread so
    endPar <- pread ep
    endOff <- pread eo
    mistakeType <- fromMaybe "" <$> tagNoAttr "TYPE" content
    correction <- fromMaybe Nothing <$> tagNoAttr "CORRECTION" contentMaybe
    _comment <- ignoreTreeContent "COMMENT"
    return (Mistake {..})

pread :: (Read a, S.ConvString s, Monad m) => s -> m a
pread x =
  maybe
    (fail ("Couldn't read: " ++ (S.toString x)))
    return
    (readMay (S.toString x))

parseAnnotation :: MonadThrow m => ConduitT Event o m (Maybe Annotation)
parseAnnotation =
  tag' "ANNOTATION" (requireAttr "teacher_id") $ \ti -> do
    teacherId <- pread ti
    mistakes <- many parseMistake
    return $ Annotation {..}

parseDoc :: MonadThrow m => ConduitT Event o m (Maybe Doc)
parseDoc =
  tag' "DOC" (requireAttr "nid") $ \nid -> do
    _ <- ignoreTreeContent "TEXT"
    anns <- many parseAnnotation
    return (Doc nid anns)

-- parseDocs :: MonadThrow m => ConduitT Event o m (Maybe [Doc])
-- parseDocs = tagNoAttr "DOCUMENTS" $ many parseDoc

perTeacherMistakes :: [Doc] -> PerTeacherMistakes
perTeacherMistakes docs = H.fromListWith (<>) (concatMap fromDoc docs)
  where
    fromDoc Doc {..} = concatMap (fromAnnotation nid) annotations
    fromAnnotation nid Annotation {..} =
      map (fromMistake teacherId nid) mistakes
    fromMistake ::
         TeacherId -> Nid -> Mistake -> (TeacherId, HS.HashSet (Nid, Mistake))
    fromMistake teacherId nid mistake = (teacherId, HS.singleton (nid, mistake))
  --   fromAnnotation Doc{..} Annotation {..} = (teacherId, HS.fromList mistakes)

main :: IO ()
main = do
  args <- getArgs
  case args of
    [fpath] -> do
      docs <-
        runResourceT $
        runConduit $ parseFile def fpath .| force "documents required" (fmap Just (many parseDoc))
      print (length docs)
      let mistakes = perTeacherMistakes docs
      -- putStrLn $ groom mistakes
      let keys = H.keys mistakes
      putStrLn $ "Found teachers: " <> show (length keys)
      putStrLn $
        "Found overall mistakes: " <>
        show (length (concatMap HS.toList mistakes))
      let mistakesIntersection = fold (H.elems mistakes)
      putStrLn $
        "Mistake annotations which intersect upon all teachers: " ++
        show (length mistakesIntersection)
    _ ->
      putStrLn
        "Usage: stack exec -- extract-conll2014 /path/to/official-2014.1.sgml"
下面是一个xml示例:

<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENTS>
   <DOC nid="1">
      <TEXT>
         <TITLE>Keeping the Secret of Genetic Testing</TITLE>
         <P>What is genetic risk? Genetic risk refers more to your chance of inheriting a disorder or disease. People get certain disease because of genetic changes. How much a genetic change tells us about your chance of developing a disorder is not always clear. If your genetic results indicate that you have gene changes associated with an increased risk of heart disease, it does not mean that you definitely will develop heart disease. The opposite is also true. If your genetic results show that you do not have changes associated with an increased risk of heart disease, it is still possible that you develop heart disease. However for some rare diseases, people who have certain gene changes are guaranteed to develop the disease. When we are diagonosed out with certain genetic disease, are we suppose to disclose this result to our relatives? My answer is no.</P>
         <P>On one hand, we do not want this potential danger causing firghtenning affects in our families' later lives. When people around us know that we got certain disease, their altitudes will be easily changed, whether caring us too much or keeping away from us. And both are not what we want since most of us just want to live as normal people. Surrounded by such concerns, it is very likely that we are distracted to worry about these problems. It is a concern that will be with us during our whole life, because we will never know when the ''potential bomb'' will explode.</P>
         <P>On the other hand, if there are ways can help us to control or cure the disease, we can going through thses process from the scope of the whole family. For an example, if exercising is helpful for family potential disease, we can always look for more chances for the family to go exercise. And we keep track of all family members health conditions. At the same time, we are prepared to know when there are other members got this disease.</P>
         <P>Here I want to share Forest'view on this issue. Although some people feel that an individual who is found to carry a dominant gene for Huntington's disease has an ethical obligation to disclose that fact to his or her siblings, there currently is no legal requirement to do so. In fact, requiring someone to communicate his or her own genetic risk to family members who are therefore also at risk is considered by many to be ethically dubious."</P>
         <P>Nothing is absolute right or wrong. If certain disease genetic test is very accurate and it is unavoidable and necessary to get treatment and known by others, it is OK to disclose the result. Above all, life is more important than secret.</P>
      </TEXT>
      <ANNOTATION teacher_id="8">
         <MISTAKE start_par="1" start_off="42" end_par="1" end_off="46">
            <TYPE>ArtOrDet</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="1" start_off="118" end_par="1" end_off="125">
            <TYPE>Nn</TYPE>
            <CORRECTION>diseases</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="1" start_off="620" end_par="1" end_off="627">
            <TYPE>Trans</TYPE>
            <CORRECTION>However,</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="1" start_off="740" end_par="1" end_off="751">
            <TYPE>Mec</TYPE>
            <CORRECTION>diagnosed</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="1" start_off="751" end_par="1" end_off="754">
            <TYPE>Prep</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="1" start_off="776" end_par="1" end_off="783">
            <TYPE>Nn</TYPE>
            <CORRECTION>diseases</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="50" end_par="2" end_off="58">
            <TYPE>Wci</TYPE>
            <CORRECTION>having</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="58" end_par="2" end_off="70">
            <TYPE>Mec</TYPE>
            <CORRECTION>frightening</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="71" end_par="2" end_off="78">
            <TYPE>Wform</TYPE>
            <CORRECTION>effects</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="144" end_par="2" end_off="147">
            <TYPE>Wci</TYPE>
            <CORRECTION>have</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="156" end_par="2" end_off="163">
            <TYPE>Nn</TYPE>
            <CORRECTION>diseases</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="171" end_par="2" end_off="180">
            <TYPE>Mec</TYPE>
            <CORRECTION>attitude</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="186" end_par="2" end_off="203">
            <TYPE>Vform</TYPE>
            <CORRECTION>easily change</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="213" end_par="2" end_off="219">
            <TYPE>Prep</TYPE>
            <CORRECTION>caring for</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="410" end_par="2" end_off="412">
            <TYPE>Trans</TYPE>
            <CORRECTION>and</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="512" end_par="2" end_off="516">
            <TYPE>Vt</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="3" start_off="32" end_par="3" end_off="36">
            <TYPE>Ssub</TYPE>
            <CORRECTION>ways that</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="88" end_par="3" end_off="93">
            <TYPE>Vform</TYPE>
            <CORRECTION>go</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="102" end_par="3" end_off="107">
            <TYPE>Mec</TYPE>
            <CORRECTION>these</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="108" end_par="3" end_off="115">
            <TYPE>Nn</TYPE>
            <CORRECTION>processes</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="156" end_par="3" end_off="158">
            <TYPE>Rloc-</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="3" start_off="193" end_par="3" end_off="196">
            <TYPE>V0</TYPE>
            <CORRECTION>reducing</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="277" end_par="3" end_off="279">
            <TYPE>Wci</TYPE>
            <CORRECTION>do</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="288" end_par="3" end_off="290">
            <TYPE>Rloc-</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="3" start_off="290" end_par="3" end_off="293">
            <TYPE>Trans</TYPE>
            <CORRECTION>so</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="420" end_par="3" end_off="423">
            <TYPE>Ssub</TYPE>
            <CORRECTION>who have got</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="4" start_off="21" end_par="4" end_off="28">
            <TYPE>Npos</TYPE>
            <CORRECTION>Forests's</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="5" start_off="11" end_par="5" end_off="19">
            <TYPE>Wform</TYPE>
            <CORRECTION>absolutely</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="5" start_off="39" end_par="5" end_off="46">
            <TYPE>ArtOrDet</TYPE>
            <CORRECTION>a certain</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="5" start_off="47" end_par="5" end_off="54">
            <TYPE>Rloc-</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="5" start_off="142" end_par="5" end_off="147">
            <TYPE>Wci</TYPE>
            <CORRECTION>tell</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="5" start_off="148" end_par="5" end_off="150">
            <TYPE>Prep</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="5" start_off="231" end_par="5" end_off="237">
            <TYPE>Nn</TYPE>
            <CORRECTION>secrets</CORRECTION>
         </MISTAKE>
      </ANNOTATION>
   </DOC>
</DOCUMENTS>
<?xml version="1.0" encoding="UTF-8"?>
<DOCUMENTS>
   <DOC nid="1">
      <TEXT>
         <TITLE>Keeping the Secret of Genetic Testing</TITLE>
         <P>What is genetic risk? Genetic risk refers more to your chance of inheriting a disorder or disease. People get certain disease because of genetic changes. How much a genetic change tells us about your chance of developing a disorder is not always clear. If your genetic results indicate that you have gene changes associated with an increased risk of heart disease, it does not mean that you definitely will develop heart disease. The opposite is also true. If your genetic results show that you do not have changes associated with an increased risk of heart disease, it is still possible that you develop heart disease. However for some rare diseases, people who have certain gene changes are guaranteed to develop the disease. When we are diagonosed out with certain genetic disease, are we suppose to disclose this result to our relatives? My answer is no.</P>
         <P>On one hand, we do not want this potential danger causing firghtenning affects in our families' later lives. When people around us know that we got certain disease, their altitudes will be easily changed, whether caring us too much or keeping away from us. And both are not what we want since most of us just want to live as normal people. Surrounded by such concerns, it is very likely that we are distracted to worry about these problems. It is a concern that will be with us during our whole life, because we will never know when the ''potential bomb'' will explode.</P>
         <P>On the other hand, if there are ways can help us to control or cure the disease, we can going through thses process from the scope of the whole family. For an example, if exercising is helpful for family potential disease, we can always look for more chances for the family to go exercise. And we keep track of all family members health conditions. At the same time, we are prepared to know when there are other members got this disease.</P>
         <P>Here I want to share Forest'view on this issue. Although some people feel that an individual who is found to carry a dominant gene for Huntington's disease has an ethical obligation to disclose that fact to his or her siblings, there currently is no legal requirement to do so. In fact, requiring someone to communicate his or her own genetic risk to family members who are therefore also at risk is considered by many to be ethically dubious."</P>
         <P>Nothing is absolute right or wrong. If certain disease genetic test is very accurate and it is unavoidable and necessary to get treatment and known by others, it is OK to disclose the result. Above all, life is more important than secret.</P>
      </TEXT>
      <ANNOTATION teacher_id="8">
         <MISTAKE start_par="1" start_off="42" end_par="1" end_off="46">
            <TYPE>ArtOrDet</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="1" start_off="118" end_par="1" end_off="125">
            <TYPE>Nn</TYPE>
            <CORRECTION>diseases</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="1" start_off="620" end_par="1" end_off="627">
            <TYPE>Trans</TYPE>
            <CORRECTION>However,</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="1" start_off="740" end_par="1" end_off="751">
            <TYPE>Mec</TYPE>
            <CORRECTION>diagnosed</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="1" start_off="751" end_par="1" end_off="754">
            <TYPE>Prep</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="1" start_off="776" end_par="1" end_off="783">
            <TYPE>Nn</TYPE>
            <CORRECTION>diseases</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="50" end_par="2" end_off="58">
            <TYPE>Wci</TYPE>
            <CORRECTION>having</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="58" end_par="2" end_off="70">
            <TYPE>Mec</TYPE>
            <CORRECTION>frightening</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="71" end_par="2" end_off="78">
            <TYPE>Wform</TYPE>
            <CORRECTION>effects</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="144" end_par="2" end_off="147">
            <TYPE>Wci</TYPE>
            <CORRECTION>have</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="156" end_par="2" end_off="163">
            <TYPE>Nn</TYPE>
            <CORRECTION>diseases</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="171" end_par="2" end_off="180">
            <TYPE>Mec</TYPE>
            <CORRECTION>attitude</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="186" end_par="2" end_off="203">
            <TYPE>Vform</TYPE>
            <CORRECTION>easily change</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="213" end_par="2" end_off="219">
            <TYPE>Prep</TYPE>
            <CORRECTION>caring for</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="410" end_par="2" end_off="412">
            <TYPE>Trans</TYPE>
            <CORRECTION>and</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="2" start_off="512" end_par="2" end_off="516">
            <TYPE>Vt</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="3" start_off="32" end_par="3" end_off="36">
            <TYPE>Ssub</TYPE>
            <CORRECTION>ways that</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="88" end_par="3" end_off="93">
            <TYPE>Vform</TYPE>
            <CORRECTION>go</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="102" end_par="3" end_off="107">
            <TYPE>Mec</TYPE>
            <CORRECTION>these</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="108" end_par="3" end_off="115">
            <TYPE>Nn</TYPE>
            <CORRECTION>processes</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="156" end_par="3" end_off="158">
            <TYPE>Rloc-</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="3" start_off="193" end_par="3" end_off="196">
            <TYPE>V0</TYPE>
            <CORRECTION>reducing</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="277" end_par="3" end_off="279">
            <TYPE>Wci</TYPE>
            <CORRECTION>do</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="288" end_par="3" end_off="290">
            <TYPE>Rloc-</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="3" start_off="290" end_par="3" end_off="293">
            <TYPE>Trans</TYPE>
            <CORRECTION>so</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="3" start_off="420" end_par="3" end_off="423">
            <TYPE>Ssub</TYPE>
            <CORRECTION>who have got</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="4" start_off="21" end_par="4" end_off="28">
            <TYPE>Npos</TYPE>
            <CORRECTION>Forests's</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="5" start_off="11" end_par="5" end_off="19">
            <TYPE>Wform</TYPE>
            <CORRECTION>absolutely</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="5" start_off="39" end_par="5" end_off="46">
            <TYPE>ArtOrDet</TYPE>
            <CORRECTION>a certain</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="5" start_off="47" end_par="5" end_off="54">
            <TYPE>Rloc-</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="5" start_off="142" end_par="5" end_off="147">
            <TYPE>Wci</TYPE>
            <CORRECTION>tell</CORRECTION>
         </MISTAKE>
         <MISTAKE start_par="5" start_off="148" end_par="5" end_off="150">
            <TYPE>Prep</TYPE>
            <CORRECTION />
         </MISTAKE>
         <MISTAKE start_par="5" start_off="231" end_par="5" end_off="237">
            <TYPE>Nn</TYPE>
            <CORRECTION>secrets</CORRECTION>
         </MISTAKE>
      </ANNOTATION>
   </DOC>
</DOCUMENTS>