Parsing 无需尝试即可实现递归解析器

Parsing 无需尝试即可实现递归解析器,parsing,haskell,recursion,parsec,Parsing,Haskell,Recursion,Parsec,我正在尝试解析Wikipedia的XML转储,以使用Haskell Parsec库在每个页面上找到某些链接。链接用双括号表示:texttext[[link]]texttext。为了尽可能地简化场景,假设我正在寻找第一个没有用双大括号括起来的链接(可以嵌套):{{{{{{{{{{{{{{{{{{[[Error link]]}[[Error link]]}[[Right link]]]。我编写了一个解析器来丢弃包含在非嵌套双括号中的链接: import Text.Parsec getLink ::

我正在尝试解析Wikipedia的XML转储,以使用Haskell Parsec库在每个页面上找到某些链接。链接用双括号表示:
texttext[[link]]texttext
。为了尽可能地简化场景,假设我正在寻找第一个没有用双大括号括起来的链接(可以嵌套):
{{{{{{{{{{{{{{{{{{[[Error link]]}[[Error link]]}[[Right link]]]
。我编写了一个解析器来丢弃包含在非嵌套双括号中的链接:

import Text.Parsec

getLink :: String -> Either ParseError String
getLink = parse linkParser "Links"

linkParser = do
    beforeLink
    link <- many $ noneOf "]"
    string "]]"
    return link

beforeLink = manyTill (many notLink) (try $ string "[[")

notLink = try doubleCurlyBrac <|> (many1 normalText)

normalText = noneOf "[{"
           <|> notFollowedByItself '['
           <|> notFollowedByItself '{'

notFollowedByItself c = try ( do x <- char c
                                 notFollowedBy $ char c
                                 return x)

doubleCurlyBrac = between (string "{{") (string "}}") (many $ noneOf "}")

getLinkTest = fmap getLink testList
    where testList = ["   [[rightLink]]   "                            --Correct link is found
                     , "  {{    [[Wrong_Link]]    }}  [[rightLink]]"   --Correct link is found
                     , "  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]" ] --Wrong link is found 

在嵌套示例中,此解析器在第一个
}
之后停止使用输入,而不是最后一个。有没有一种优雅的方法来编写递归解析器(在本例中)正确地忽略嵌套的双大括号中的链接?另外,不使用
try
也可以完成此操作吗?我发现,由于
try
不使用输入,它通常会导致解析器挂起意外的、格式错误的输入。

我的解决方案不使用
try
,但相对复杂:我使用了 你的问题是学习如何在中创建lexer的借口 不使用 :D我避免
try
,因为唯一的前瞻发生在lexer(
tokenize
)中,其中识别了各种括号对

高级思想是我们将
{{
}
[[
]
视为 并将输入解析为AST。你没有具体说明 语法非常精确,所以我选择了一个简单的语法来生成 示例:

node ::= '{{' node* '}}'
       | '[[' node* ']]'
       | string
string ::= <non-empty string without '{{', '}}', '[[', or ']]'>
变成

link = L <$> between llink  rlink  string
main
的输出为:

Test: ^   [[rightLink]]   $
Nodes: [S "   ",L [S "rightLink"],S "   "]
Link: Just (L [S "rightLink"])

Test: ^  {{    [[Wrong_Link]]    }}  [[rightLink]]$
Nodes: [S "  ",B [S "    ",L [S "Wrong_Link"],S "    "],S "  ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])

Test: ^  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S "  ",B [S "  ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])

Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])

Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing

Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])

Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing

Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting }}

Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 24):
unexpected }}
expecting end of input

Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting ]]
测试:^[[rightLink]]$
节点:[S',L[S“rightLink”],S']
Link:Just(L[S“rightLink”])
测试:^{{[[错误链接]]}[[rightLink]]$
节点:[S',B[S',L[S“错误链接”]、S“”、S“”、L[S“右链接”]]
Link:Just(L[S“rightLink”])
测试:^{{{}[[错误链接]]}[[rightLink]]$
节点:[S”“、B[S”“、B[S”“、S”“、L[S“错误链接”]、S”“、S”“、L[S“右链接”]]
Link:Just(L[S“rightLink”])
测试:^[{{[[someLink]}]{{{}}{{[[asdf]]}}$
节点:[S',L[B[L[S“someLink”]],S',B[],S',B[L[S“asdf”]]
Link:Just(L[B[L[S“someLink”]]))
测试:^{ab}cd}$
节点:[B[S“ab}cd”]]
链接:没什么
测试:^{{{asf{[[asdfa]]]}aasdff]]]{{[[asdf]]}asdf$
节点:[S{{{{asf{],L[S][asdfa],S']}aasdff]]]],B[L[S“asdf”]],S“asdf”]
链接:Just(L[S“[asdfa]”)
测试:^{[[错误链接]]asdf[[错误链接]]{{}}{{[[[错误链接]]]]}}{{{[[[错误链接]]]}}$
节点:[B[L[S“错误链接”]、S“asdf”、L[S“错误链接”]、B[],B[L[L[S“错误链接”]]
链接:没什么
测试:^{{{{{[[asdf]]}}}$
“”(第1行第26列):
输入意外结束
期待}}
测试:^{{{[[asdf]]}}}$
“”(第1行第24列):
意外}
预期输入结束
测试:^[{[{[{[{[[asdf]]}}]}}$
“”(第1行第25列):
输入意外结束
期望]]

这里有一个更直接的版本,它不使用自定义的lexer。它确实使用了
try
,但我不知道如何在这里避免它。问题是,我们似乎需要一个非提交前瞻来区分双括号和单括号;
try
用于非提交前瞻

高级别方法与中的方法相同 1.我一直很小心 使三节点解析器通勤——使代码更加健壮 要更改--请同时使用
try
notfollowerby

{-# LANGUAGE TupleSections #-}
import Text.Parsec hiding (string)
import qualified Text.Parsec
import Control.Applicative ((<$>) , (<*) , (<*>))
import Control.Monad (forM_)
import Data.List (find)

import Debug.Trace

----------------------------------------------------------------------
-- Token parsers.

llink , rlink , lbrace , rbrace :: Parsec String u String
[llink , rlink , lbrace , rbrace] = reserved
reserved = map (try . Text.Parsec.string) ["[[" , "]]" , "{{" , "}}"]

----------------------------------------------------------------------
-- Node parsers.

-- Link, braces, or string.
data Node = L [Node] | B [Node] | S String deriving Show

nodes :: Parsec String u [Node]
nodes = many node

node :: Parsec String u Node
node = link <|> braces <|> string

link , braces , string :: Parsec String u Node
link   = L <$> between llink  rlink  nodes
braces = B <$> between lbrace rbrace nodes
string = S <$> many1 (notFollowedBy (choice reserved) >> anyChar)

----------------------------------------------------------------------

parseNodes :: String -> Either ParseError [Node]
parseNodes = parse (nodes <* eof) "<no file>"

----------------------------------------------------------------------
-- Tests.

getLink :: [Node] -> Maybe Node
getLink = find isLink where
  isLink (L _) = True
  isLink _     = False

parseLink :: String -> Either ParseError (Maybe Node)
parseLink = either Left (Right . getLink) . parseNodes

testList = [ "   [[rightLink]]   "
           , "  {{    [[Wrong_Link]]    }}  [[rightLink]]"
           , "  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]"
           , " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
           -- Pathalogical example from comments.
           , "{{ab}cd}}"
           -- A more pathalogical example.
           , "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf" 
           -- No top level link.
           , "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
           -- Too many '{{'.
           , "{{ {{ {{ [[ asdf ]] }} }}"
           -- Too many '}}'.
           , "{{ {{ [[ asdf ]] }} }} }}"
           -- Too many '[['.
           , "[[ {{ [[{{[[asdf]]}}]]}}"
           ]

main =
  forM_ testList $ \ t -> do
  putStrLn $ "Test: ^" ++ t ++ "$"
  let parses = ( , ) <$> parseNodes t <*> parseLink t
      printParses (n , l) = do
        putStrLn $ "Nodes: " ++ show n
        putStrLn $ "Link: " ++ show l
      printError = putStrLn . show
  either printError printParses parses
  putStrLn ""
但是,解析错误消息在 无与伦比的空缺:

Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"

Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 26):
unexpected "}}"

Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"
Test:^{{{{{{[[asdf]]}}}$
“”(第1行第26列):
输入意外结束
应为“[]”、“{{”、“]]”或“}”
测试:^{{{[[asdf]]}}}$
“”(第1行第26列):
意外的“}”
测试:^[{[{[{[{[[asdf]]}}]}}$
“”(第1行第25列):
输入意外结束
应为“[]”、“{{”、“]]”或“}”

我不知道如何修复它们。

你想如何解析
“{{ab}cd}”
一个更详细的语法描述会很有帮助。@KarolisJuodelė在那个例子中,解析器应该选择
ab}cd
@JohnG它应该,但是
noneOf“}”
将在
ab
@KarolisJuodelė之后停止。为了问题的清晰性,我试图尽可能简化。我的实际代码类似于
many(noeof“}”notfollowerdbyiteslf'}')
Test: ^   [[rightLink]]   $
Nodes: [S "   ",L [S "rightLink"],S "   "]
Link: Just (L [S "rightLink"])

Test: ^  {{    [[Wrong_Link]]    }}  [[rightLink]]$
Nodes: [S "  ",B [S "    ",L [S "Wrong_Link"],S "    "],S "  ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])

Test: ^  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S "  ",B [S "  ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])

Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])

Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing

Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])

Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing

Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting }}

Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 24):
unexpected }}
expecting end of input

Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting ]]
{-# LANGUAGE TupleSections #-}
import Text.Parsec hiding (string)
import qualified Text.Parsec
import Control.Applicative ((<$>) , (<*) , (<*>))
import Control.Monad (forM_)
import Data.List (find)

import Debug.Trace

----------------------------------------------------------------------
-- Token parsers.

llink , rlink , lbrace , rbrace :: Parsec String u String
[llink , rlink , lbrace , rbrace] = reserved
reserved = map (try . Text.Parsec.string) ["[[" , "]]" , "{{" , "}}"]

----------------------------------------------------------------------
-- Node parsers.

-- Link, braces, or string.
data Node = L [Node] | B [Node] | S String deriving Show

nodes :: Parsec String u [Node]
nodes = many node

node :: Parsec String u Node
node = link <|> braces <|> string

link , braces , string :: Parsec String u Node
link   = L <$> between llink  rlink  nodes
braces = B <$> between lbrace rbrace nodes
string = S <$> many1 (notFollowedBy (choice reserved) >> anyChar)

----------------------------------------------------------------------

parseNodes :: String -> Either ParseError [Node]
parseNodes = parse (nodes <* eof) "<no file>"

----------------------------------------------------------------------
-- Tests.

getLink :: [Node] -> Maybe Node
getLink = find isLink where
  isLink (L _) = True
  isLink _     = False

parseLink :: String -> Either ParseError (Maybe Node)
parseLink = either Left (Right . getLink) . parseNodes

testList = [ "   [[rightLink]]   "
           , "  {{    [[Wrong_Link]]    }}  [[rightLink]]"
           , "  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]"
           , " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
           -- Pathalogical example from comments.
           , "{{ab}cd}}"
           -- A more pathalogical example.
           , "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf" 
           -- No top level link.
           , "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
           -- Too many '{{'.
           , "{{ {{ {{ [[ asdf ]] }} }}"
           -- Too many '}}'.
           , "{{ {{ [[ asdf ]] }} }} }}"
           -- Too many '[['.
           , "[[ {{ [[{{[[asdf]]}}]]}}"
           ]

main =
  forM_ testList $ \ t -> do
  putStrLn $ "Test: ^" ++ t ++ "$"
  let parses = ( , ) <$> parseNodes t <*> parseLink t
      printParses (n , l) = do
        putStrLn $ "Nodes: " ++ show n
        putStrLn $ "Link: " ++ show l
      printError = putStrLn . show
  either printError printParses parses
  putStrLn ""
Test: ^   [[rightLink]]   $
Nodes: [S "   ",L [S "rightLink"],S "   "]
Link: Just (L [S "rightLink"])

Test: ^  {{    [[Wrong_Link]]    }}  [[rightLink]]$
Nodes: [S "  ",B [S "    ",L [S "Wrong_Link"],S "    "],S "  ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])

Test: ^  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S "  ",B [S "  ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])

Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])

Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing

Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])

Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing
Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"

Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 26):
unexpected "}}"

Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"