PHP-按顺序填充数组,直到达到最大长度
我需要使用PHP从以这种方式格式化的文本文件中提取数据:PHP-按顺序填充数组,直到达到最大长度,php,arrays,regex,loops,sequential,Php,Arrays,Regex,Loops,Sequential,我需要使用PHP从以这种方式格式化的文本文件中提取数据: BEGIN #1 #2 #3 #4 #5 #6 1 2015-05-31 2001-11-24 'Name Surname' ID_1 0 2 2011-04-01 ? ? ID_2 1 2 2013-02-24 ? ? ID_3
BEGIN
#1
#2
#3
#4
#5
#6
1 2015-05-31 2001-11-24 'Name Surname' ID_1 0
2 2011-04-01 ? ? ID_2 1
2 2013-02-24 ? ? ID_3 1
2 2014-02-28 ? 'Name Surname' ID_4 2
END
信息按如下数组逻辑组织:
Array ( [#1] => 1 [#2] => 2015-05-31 [#3] => 2001-11-24 [#4] => 'Name Surname' [#5] => ID_1 [#6] => 0 )
Array ( [#1] => 2 [#2] => 2011-04-01 [#3] => ? [#4] => ? [#5] => ID_2 [#6] => 1 )
Array ( [#1] => 2 [#2] => 2013-02-24 [#3] => ? [#4] => ? [#5] => ID_3 [#6] => 1 )
Array ( [#1] => 2 [#2] => 2014-02-28 [#3] => ? [#4] => 'Name Surname' [#5] => ID_4 [#6] => 2 )
;This is some text
in multiline with "double
quotes" too
;
我正在寻找一种方法来获得这个结果。我正在使用以下代码:
<?php
//ini_set('max_execution_time', 300); //300 seconds = 5 minutes
function startsWith($str, $char){
return $str[0] === $char;
}
$txt_path = "./test.txt";
$txt_data = @file_get_contents($txt_path) or die("Could not access file: $txt_path");
//echo $txt_data;
$loop_pattern = "/BEGIN(.*?)END/s";
preg_match_all($loop_pattern, $txt_data, $matches);
$loops = $matches[0];
//print_r($loops);
$loops_count = count($loops);
//echo $loops_count; // number of loops into the file
foreach ($loops as $key => $value) {
$value = trim($value);
$pattern = array("/[[:blank:]]+/", "/BEGIN(.*)/", "/END(.*)/");
$replacement = array(" ", "", "");
$value = preg_replace($pattern, $replacement, $value);
//print_r($value);
//echo "<br><br>";
$value_array = explode("\n", $value);
$value_array_clean = array_filter($value_array, 'strlen');
$value_array_clean_reindex = array_values($value_array_clean);
//print_r($value_array_clean_reindex);
//echo "<br><br>";
$keys = array();
$values = array();
foreach ($value_array_clean_reindex as $key => $value) {
$value = trim($value);
if ( startsWith($value, "#") ) {
array_push($keys, $value);
$keys_count = count($keys);
} else {
array_push($values, $value);
$values_count = count($values);
$loop_dic = array();
foreach ($values as $key => $value) {
$value = trim($value);
preg_match_all("/'(?:.|[^'])*'|\S+/", $value, $matches);
//print_r($matches[0]);
$loop_dic = array_combine($keys, $matches[0]);
}
print_r($loop_dic);
echo "<br><br>";
}
}
}
?>
但有时在指挥层会出现问题:
$loop_dic = array_combine($keys, $matches[0]);
$loop_pattern = "/BEGIN(.*?)END/s";
preg_match_all($loop_pattern, $txt_data, $matches);
$loops = $matches[0];
//print_r($loops);
$loops_count = count($loops);
//echo $loops_count; // number of loops into the file
我了解到,在原始文本文件中,有很长的行,这些行被打断,生成新行;而不是:
2 2014-02-28 ? 'Name Surname' ID_4 2
这条线是这样断的:
2 2014-02-28 ? 'Name Surname'
ID_4 2
因此,当我将字符串分解为\n
时,两个数组的长度出现了一个错误,然后我将其合并
我想问您一个解决这个问题的替代方案,即获得长度相等的数组,如果原始文件中出现中断也可以
在网上搜索,我发现;也许,如果我知道(通过count
)每个循环的数组中的键数([#1],…,[#6]),就可以循环并填充数组中的值,按顺序添加它们,直到每个数组中的值的最大长度为止
感谢您的关注和帮助
编辑#1
感谢@fusion3k的解决方案!
通过检查一些输入文件的行为,可以发现另外两个问题:
1)分析一些错误,我发现有时输入文件使用双引号(而不是单引号),分号之间也有多行文本块,如下所示:
Array ( [#1] => 1 [#2] => 2015-05-31 [#3] => 2001-11-24 [#4] => 'Name Surname' [#5] => ID_1 [#6] => 0 )
Array ( [#1] => 2 [#2] => 2011-04-01 [#3] => ? [#4] => ? [#5] => ID_2 [#6] => 1 )
Array ( [#1] => 2 [#2] => 2013-02-24 [#3] => ? [#4] => ? [#5] => ID_3 [#6] => 1 )
Array ( [#1] => 2 [#2] => 2014-02-28 [#3] => ? [#4] => 'Name Surname' [#5] => ID_4 [#6] => 2 )
;This is some text
in multiline with "double
quotes" too
;
需要将其视为给定键的单个值,该值需要内联,如@fusion3k code do,将\n
替换为
(空格)。我正在尝试将@fusion3k的工作代码与为解决此行为而精心设计的代码合并。文件结构可以如下所示:
BEGIN
#1
#2
#3
#4
#5
#6
1 2015-05-31 2001-11-24 "Name Surname" ID_1 0
2 2011-04-01 ? ? ID_2 1
2 2013-02-24 ? ? ID_3 1
2 2014-02-28 ? "Name Surname" ID_4 2
;This is some text
in multiline with "double
quotes" too
;
2016-01-22 ? "Name Surname" ID_5 2
END
if ( preg_match('/;(.*?);|\'(.*?)\'/', $value, $matches) ) {// semicolon with single quotes in the $value string
$value = str_replace( "\n", " ", $value );
$origin = array("/[[:blank:]]+/", "/'/", "/;/");
$replacement = array(" ", "' ", "; ");
$value = preg_replace($origin, $replacement, $value);
$pattern = '/'.str_repeat( "([;'])\s+", count( $keys ) ).'/';
print_r(array_filter(preg_split( $pattern, $value ), 'strlen')); // I would have an array of values of the same length of the array for the keys
echo "<br><br>";
} elseif ( preg_match('/;(.*?);|"(.*?)"/', $value, $matches) ) {// semicolon with double quotes in the $value string
$value = str_replace( "\n", " ", $value );
$origin = array("/[[:blank:]]+/", "/\"/", "/;/");
$replacement = array(" ", "\" ", "; ");
$value = preg_replace($origin, $replacement, $value);
$pattern = '/'.str_repeat( "([;\"])\s+", count( $keys ) ).'/';
print_r(array_filter(preg_split( $pattern, $value ), 'strlen')); // I would have an array of values of the same length of the array for the keys
echo "<br><br>";
} else {// neither single quotes (or double quotes) nor semicolon in the $value string
$pattern = '/'.str_repeat( "(\S+)\s+", count( $keys ) ).'/';
preg_match_all( $pattern, $value, $matches );
//print_r($matches);
//echo "<br><br>";
$loop_dic = array_combine( $keys, array_slice( $matches, 1 ) );
print_r( $loop_dic ); // this is good...maybe in a better way?
echo "<br><br>";
}
它应该生成类似于上述工作代码的内容,但考虑到存在不同的文本块定界符,如分号(;
)、单引号(”
)或双引号(“
)等,以分隔必须视为键的单个值的文本块,如与上述文本文件内容相关的此数组:
Array ( [#1] => Array ( [0] => 1 [1] => 2 [2] => 2 [3] => 2 [4] => This is some text in multiline with "double quotes" too ) [#2] => Array ( [0] => 2015-05-31 [1] => 2011-04-01 [2] => 2013-02-24 [3] => 2014-02-28 [4] => 2016-01-22 ) [#3] => Array ( [0] => 2001-11-24 [1] => ? [2] => ? [3] => ? [4] => ? ) [#4] => Array ( [0] => Name Surname [1] => ? [2] => ? [3] => Name Surname [4] => Name Surname ) [#5] => Array ( [0] => ID_1 [1] => ID_2 [2] => ID_3 [3] => ID_4 [4] => ID_5 ) [#6] => Array ( [0] => 0 [1] => 1 [2] => 1 [3] => 2 [4] => 2 ) )
我研究了一个简单的字符串,以找到一个考虑(分号)和(单引号或双引号)的“工作”正则表达式.目前我还没有找到使用所有三种定界符来分隔文本块的文件,但似乎可以找到分号+单引号或分号+双引号或仅单引号或仅双引号;最好在同一文本文件中找到所有三种定界符的解决方案…:
$string = 'something here
;and there
;
oh, "that\'s all!"';
$string = str_replace( "\n", " ", $string );
$origin = array("/[[:blank:]]+/", "/\"/", "/;/");
$replacement = array(" ", "\" ", "; ");
$string = preg_replace($origin, $replacement, $string);
$pattern = '/([;"])\s+/';
print_r(array_filter(preg_split( $pattern, $string ), 'strlen'));
这是输出(根据需要):
请注意分号之间的文本块:它总是在新行中开始,以分号开头,在新行中以分号结尾,然后开始另一个新行
我不知道它是否能以更好、最快的方式编写……然后我尝试将它与@fusion3k的代码合并,处理上述文本文件内容,但没有成功。我尝试了一种类似以下的if/elseif/else
构造:
BEGIN
#1
#2
#3
#4
#5
#6
1 2015-05-31 2001-11-24 "Name Surname" ID_1 0
2 2011-04-01 ? ? ID_2 1
2 2013-02-24 ? ? ID_3 1
2 2014-02-28 ? "Name Surname" ID_4 2
;This is some text
in multiline with "double
quotes" too
;
2016-01-22 ? "Name Surname" ID_5 2
END
if ( preg_match('/;(.*?);|\'(.*?)\'/', $value, $matches) ) {// semicolon with single quotes in the $value string
$value = str_replace( "\n", " ", $value );
$origin = array("/[[:blank:]]+/", "/'/", "/;/");
$replacement = array(" ", "' ", "; ");
$value = preg_replace($origin, $replacement, $value);
$pattern = '/'.str_repeat( "([;'])\s+", count( $keys ) ).'/';
print_r(array_filter(preg_split( $pattern, $value ), 'strlen')); // I would have an array of values of the same length of the array for the keys
echo "<br><br>";
} elseif ( preg_match('/;(.*?);|"(.*?)"/', $value, $matches) ) {// semicolon with double quotes in the $value string
$value = str_replace( "\n", " ", $value );
$origin = array("/[[:blank:]]+/", "/\"/", "/;/");
$replacement = array(" ", "\" ", "; ");
$value = preg_replace($origin, $replacement, $value);
$pattern = '/'.str_repeat( "([;\"])\s+", count( $keys ) ).'/';
print_r(array_filter(preg_split( $pattern, $value ), 'strlen')); // I would have an array of values of the same length of the array for the keys
echo "<br><br>";
} else {// neither single quotes (or double quotes) nor semicolon in the $value string
$pattern = '/'.str_repeat( "(\S+)\s+", count( $keys ) ).'/';
preg_match_all( $pattern, $value, $matches );
//print_r($matches);
//echo "<br><br>";
$loop_dic = array_combine( $keys, array_slice( $matches, 1 ) );
print_r( $loop_dic ); // this is good...maybe in a better way?
echo "<br><br>";
}
不接受文件(大文件)中的所有循环。
我想答案可能是这样。所以,设置:
ini_set('max_execution_time', 300); // 300 seconds = 5 minutes
ini_set("pcre.backtrack_limit", "100000000"); // default 100k = "100000"
似乎可以解决这个问题,但我不知道这是否是唯一的方法:事实上,如果文件很大(17MB或更大),在页面加载完成之前,浏览器会有一点不响应时间(我在Firefox上测试最新版本)……将整个文件分块解析到其完整大小可能会很好,但如何做到呢
非常感谢您的关注和帮助为了解决您的问题,常用的方法是对检索到的匹配进行计数,如果它们少于键数,则继续循环,而无需重新初始化
$loop\u dic
我建议您使用一种反向方法:在检索值之前,不要逐行分解字符串,而是用空格替换换行符:您的字符串结构足够坚固,可以使用这种方法,并且您知道字段编号,因此这种方法应该有效
mainforeach
循环之外的代码不会更改。同样,检索由BEGIN…END
包装的文本的代码也不会更改:
foreach( $loops as $key => $value )
{
$value = trim( $value );
$pattern = array( "/[[:blank:]]+/", "/BEGIN(.*)/", "/END(.*)/" );
$replacement = array( " ", "", "" );
$value = preg_replace( $pattern, $replacement, $value );
要检索密钥,我们使用preg\u match\u all()
,然后使用preg\u replace()
删除相关行:
现在,在$value
中,我们只有数据行。我们用空格替换所有新行:
$value = str_replace( "\n", " ", $value );
然后,我们通过重复关键字编号的字段模式来构造行模式,并通过preg\u match\u all()
检索所有行:
最后,我们使用array\u slice()
删除全局匹配,并将其与$keys
组合,得到了预期的结果。可以关闭foreach
循环:
$values = array_combine( $keys, array_slice( $matches, 1 ) );
}
我的$values
和你的$loop\u dic
之间的主要区别在于,在$values
主数组中,你有列,但是如果你喜欢按行数组,你可以很容易地变换它
我已经用许多不同的“断线”测试了代码,它是有效的。我建议您仔细地用不同的字符串测试它,看看它在任何情况下是否正常工作。如果断线被
ID\u
打破,您可以预处理字符串:$txt\u data=str\u replace(“\nID\u”,“ID\u”,“$txt\u data”)
谢谢@fusion3k。不,行中的中断级别没有规则。我只知道问题是出于这个原因,所以\n
的explode
命令失败,生成不同长度的数组太棒了!!!我理解了你的方法:我明天会用一些输入文件来检查它一切!非常感谢您的时间!!RegardsI迫不及待:我在不同的文件上测试了它,它看起来非常完美,速度非常快!!!对于一个170925行17569148个字符的文件,它需要0.2635