用PHP解析Wiki标记

用PHP解析Wiki标记,php,regex,parsing,Php,Regex,Parsing,我有一个带有Wiki标记的文本文件。例如: [[April]] April is the fourth month of the year. It has 30 days. The name April comes from that Latin word aperire which means "to open". This probably refers to growing plants in spring. April begins on the same day of week as

我有一个带有Wiki标记的文本文件。例如:

[[April]]

April is the fourth month of the year. It has 30 days. The name April comes from that Latin word aperire which means "to open". This probably refers to growing plants in spring. April begins on the same day of week as July in all years and also January in leap years.

April's flower is the Sweet Pea. Its birthstone is the diamond. The meaning of the diamond is innocence.

== April in poetry ==

Poets use April to mean the end of winter. For example: April showers bring May flowers.

== Events in April ==

[[August]]

August is the eighth month of the year in the Gregorian calendar, coming between July and September. It has 31 days, the same number of days as the previous month, July, and is named after Roman Emperor Augustus Caesar.

== The Month ==

This month was first called Sextilis in Latin, because it was the sixth month in the old Roman calendar. The Roman calendar began in March about 735 BC with Romulus. October was the eighth month. August was the eighth month when January or February were added to the start of the year by King Numa Pompilius about 700 BC. Or, when those two months were moved from the end to the beginning of the year by the decemvirs about 450 BC (Roman writers disagree). In 153 BC January 1 was determined as the beginning of the year.

August is named for Augustus Caesar who became Roman consul in this month.  The month has 31 days because Julius Caesar added two days when he created the Julian calendar in 45 BC. August is after July and before September.

August, in either hemisphere, is the seasonal equivalent of February in the other. In the Northern hemisphere it is a summer month and it is a winter month in the Southern hemisphere. In a common year, no other month begins on the same day of the week as August, though in leap years, February starts on the same day as August. August always ends on the same day of the week as November.

August's flower is the Gladiolus with the birthstone being peridot. The astrological signs for August are Leo (July 24 - August 22) and Virgo (August 23 - September 23).

== August observances ==

=== Fixed observances and events ===

=== Moveable and Monthlong events ===

== Selection of Historical Events ==

== References ==
四月和八月都是维基文章。我成功地获得了以下标题:

$fh = fopen("wiki2.txt", "r");
if ($fh) {
    while (($line = fgets($fh)) !== false) {
        preg_match_all('#\\[\\[(.*?)\\]\\]#',$line,$matches,PREG_SET_ORDER);
        foreach($matches as $m) {
            echo $m[0]."<br />";
        }
    }
    fclose($fh);
}
但是,我也希望能够将文章中的文本拉出来。有没有人对我可以使用regex或其他解决方案来提取文章数据有什么想法


谢谢

我想你想得太多了,另外,wiki标记和HTML一样,都不适合正则表达式

为什么不干脆做:

$HeaderNumber = 0;
$Document[$HeaderNumber]['Title'] = "Default";
while (($line = fgets($fh)) !== false) {
        if (strpos('[[', $line) > -1 && strpos(']]', $line) > -1){
            $Document[$HeaderNumber]['Text'] = implode($Document[$HeaderNumber]['Lines'], "\n");
            unset($Document[$HeaderNumber]['Lines']);
            $HeaderNumber++;
            $line = str_replace(array("[[","]]"), "", $line);
            $Document[$HeaderNumber]['Title'] = $line;
            continue;
        }

        $Document[$HeaderNumber]['Lines'][] = $line;

    }
}

这将创建一个数字索引的数组,每个数组都有一个标题和一个文本字段,其中正好包含您期望的名称。您可以使用pear库将文本进一步处理为HTML。

拉取文章数据是什么意思?文本都在那里吗?你的实际目标是什么?如果您想将Wiki标记转换为HTML,有现有的解决方案,我们在这个主题上也有。我想从文章文本中拆分存储在$m[0]中的文章标题,但尚未找到与之匹配的方法。最后,我希望将这些精选文章标题和文章内容放在单独的变量中,我可以将它们插入MySQL数据库。MediaWiki运行Wikipedia的软件是用php编写的,是开源的。您可以从他们的代码中获得解析器。谢谢,这似乎与我要寻找的内容一致。但是,if语句中的代码似乎从未执行过。我检查了strpos的输出[[',$line,什么都没有。这是$line输出的问题吗?:\n我设法使用不同的代码使它工作,但我仍然使用您的strps代码。问题是strps只检查单个字符,而不是2。是否有办法检查两个括号?嗯,如果不工作,可能是类型强制问题不幸的是,strpos在以前一直是它的核心。它肯定在检查两者-注意,我在if中对每一个都做了测试。啊哈,是的-嗯,当然不起作用。strpos在'[['测试中返回0,因为字符位于位置0。编辑示例后,现在应该可以工作了。我还修复了缺少的括号。