从文件中抓取文本以查看-使用非标准格式-php

从文件中抓取文本以查看-使用非标准格式-php,php,regex,scrape,Php,Regex,Scrape,好的,我有一个文本文件,它会定期更改,我需要将其刮到屏幕上显示,并可能插入到数据库中。文本格式如下: "Stranglehold" Written by Ted Nugent Performed by Ted Nugent Courtesy of Epic Records By Arrangement with Sony Music Licensing "Chateau Lafltte '59 Boogie" Written by David Peverett and Rod Price Per

好的,我有一个文本文件,它会定期更改,我需要将其刮到屏幕上显示,并可能插入到数据库中。文本格式如下:

"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte '59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products
我只需要歌曲的标题(引号之间的信息),谁写的,谁表演的。正如您可以看到的那样,writebyline可以是多行

我已经搜索了所有问题,这一个问题与之类似,我可以修改下面的解决方案,这样它至少可以找到引号之间的信息,并将它们放入数组中。但是,如果我有正确的正则表达式,我不知道如何将writed by和performed by的下一个preg_match语句放在何处,以便它使用正确的信息将其添加到数组中。这是修改后的代码

<?php
$in_name = 'in.txt';
$in = fopen($in_name, 'r') or die();

function dump_record($r) {
    print_r($r);
}
    $current = array();
    while ($line = fgets($fh)) {

        /* Skip empty lines (any number of whitespaces is 'empty' */
        if (preg_match('/^\s*$/', $line)) continue;

        /* Search for 'things between quotes' stanzas */
        if (preg_match('/(?<=\")(.*?)(?=\")/', $line, $start)) {
            /* If we already parsed a record, this is the time to dump it */
            if (!empty($current)) dump_record($current);

        /* Let's start the new record */
        $current = array( 'id' => $start[1] );
    }
    else if (preg_match('/^(.*):\s+(.*)\s*/', $line, $keyval)) {
        /* Otherwise parse a plain 'key: value' stanza */
        $current[ $keyval[1] ] = $keyval[2];
    }
    else {
        error_log("parsing error: '$line'");
    }
}
/* Don't forget to dump the last parsed record, situation
 * we only detect at EOF (end of file) */
if (!empty($current)) dump_record($current);

fclose($in);

这里有一个解决这个问题的正则表达式。请记住,这里并不真正需要正则表达式。请参见下面的第二个选项

<?php

$string = '"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte \'59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products';

// Titles delimit a record
$title_pattern = '#"(?<title>[^\n]+)"\n(?<meta>.*?)(?=\n"|$)#s';
// From the meta section we need these tokens
$meta_keys = array(
    'Written by ' => 'written',
    'Performed by ' => 'performed',
    'Courtesy of ' => 'courtesy',
    "By Arrangement with\n" => 'arranged',
);
$meta_pattern = '#(?<key>' . join(array_keys($meta_keys), "|") . ')(?<value>[^\n$]+)(?:\n|$)#ims';


$songs = array();
if (preg_match_all($title_pattern, $string, $matches, PREG_SET_ORDER)) {
    foreach ($matches as $match) {
        $t = array(
            'title' => $match['title'],
        );

        if (preg_match_all($meta_pattern, $match['meta'], $_matches, PREG_SET_ORDER)) {
            foreach ($_matches as $_match) {
                $k = $meta_keys[$_match['key']];
                $t[$k] = $_match['value'];
            }
        }

        $songs[] = $t;
    }
}
也可以使用不带正则表达式的解决方案,不过要稍微详细一些:

<?php

$string = '"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte \'59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products';

$songs = array();
$current = array();
$lines = explode("\n", $string);
// can't use foreach if we want to extract "By Arrangement"
// cause it spans two lines
for ($i = 0, $_length = count($lines); $i < $_length; $i++) {
    $line = $lines[$i];
    $length = strlen($line); // might want to use mb_strlen()

    // if line is enclosed in " it's a title
    if ($line[0] == '"' && $line[$length - 1] == '"') {
        if ($current) {
            $songs[] = $current;
        }

        $current = array(
            'title' => substr($line, 1, $length - 2),
        );

        continue;
    }

    $meta_keys = array(
        'By Arrangement with' => 'arranged', 
    );

    foreach ($meta_keys as $key => $k) {
        if ($key == $line) {
            $i++;
            $current[$k] = $lines[$i];
            continue;
        }
    }

    $meta_keys = array(
        'Written by ' => 'written', 
        'Performed by ' => 'performed', 
        'Courtesy of ' => 'courtesy',
    );

    foreach ($meta_keys as $key => $k) {
        if (strpos($line, $key) === 0) {
            $current[$k] = substr($line, strlen($key));
            continue 2;
        }
    }    
}

if ($current) {
    $songs[] = $current;
}
怎么样:

$str =<<<EOD
"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte '59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products

EOD;

preg_match_all('/"([^"]+)".*?Written by (.*?)Performed by (.*?)Courtesy/s', $str, $m, PREG_SET_ORDER);
print_r($m);

如果文件的格式不会很快改变,我会从一个没有任何正则表达式的解决方案开始,并且只在绝对必要的时候才使用它。是否有一个规则在后面?“Rhino Entertainment Company”分为两行。那么,如果我的公司名称包含“礼貌”或“书面”两个词,该怎么办?我使用的信息的问题是,当信息通过OCR软件从屏幕抓图转换成文本时,没有硬而快的方式显示出来。最流行的格式是由行提供、由行执行和由行书写。感谢您的回复。我当然编辑了一点,但这似乎是我想要的。谢谢你的回复。我没有用这个来解决上面的问题,但它对我在同一个项目中遇到的其他问题很有用。谢谢你提供的信息。
$str =<<<EOD
"Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy of Epic Records
By Arrangement with
Sony Music Licensing
"Chateau Lafltte '59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy of Rhino Entertainment
Company and Bearsville Records
By Arrangement with
Warner Special Products

EOD;

preg_match_all('/"([^"]+)".*?Written by (.*?)Performed by (.*?)Courtesy/s', $str, $m, PREG_SET_ORDER);
print_r($m);
Array
(
    [0] => Array
        (
            [0] => "Stranglehold"
Written by Ted Nugent
Performed by Ted Nugent
Courtesy
            [1] => Stranglehold
            [2] => Ted Nugent

            [3] => Ted Nugent

        )

    [1] => Array
        (
            [0] => "Chateau Lafltte '59 Boogie"
Written by David Peverett
and Rod Price
Performed by Foghat
Courtesy
            [1] => Chateau Lafltte '59 Boogie
            [2] => David Peverett
and Rod Price

            [3] => Foghat

        )

)