Php 用于筛选表的Regexp
好吧,我有一个表,它由一些开源软件输出,但它没有以实际的表格式输出,例如Php 用于筛选表的Regexp,php,ruby,regex,ruby-on-rails-3,codeigniter,Php,Ruby,Regex,Ruby On Rails 3,Codeigniter,好吧,我有一个表,它由一些开源软件输出,但它没有以实际的表格式输出,例如 <table> <thead> <td>Heading</td> <thead> <tbody> <tr> <td>Content</td> </tr> <tbody> </table 所以我不能建立一个网络刮板来获取数据,或者
<table>
<thead>
<td>Heading</td>
<thead>
<tbody>
<tr>
<td>Content</td>
</tr>
<tbody>
</table
所以我不能建立一个网络刮板来获取数据,或者我不是舒尔,如果我可以建立一个刮板来刮板,因为它都被包装在一个
text.lines.to_a.each do |line|
line.sub(/^\| |^\+*-*\+*\-*/) do |match|
puts "Regexp Match: " << match
end
STDIN.getc
puts "New Line "<< line
end
例如,第一行的输出将仅为+--------------+------------
它是CSV格式的,因此我将使用Gsub
将剩余的
替换为,
我可以使用PHP或Ruby,因此任何答案都是非常受欢迎的对于从表中获取字段的主要工作,请使用带有模式的split
来获取每一行:
$table = '+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1 | HEADING 2 | ETC | ANOTHER | HEADING3 | HEADING4 | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS AGENTS:21 | total| total| total| total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+';
$lines = preg_split('/\r\n|\r|\n/', $table);
$array = array();
foreach($lines as $line){
if(!preg_match('/\+-+\+/', $line)){
$array[] = preg_split('/\s*\|\s*/', trim($line, '| '));
}
}
print_r($array);
这将根据每个|
和周围的任何空格将行拆分为一个数组。丢弃数组的第一个和最后一个元素,因为模式也匹配开头和结尾|
签出:
Array
(
[0] => Array
(
[0] => HEADING 1
[1] => HEADING 2
[2] => ETC
[3] => ANOTHER
[4] => HEADING3
[5] => HEADING4
[6] => SML
)
[1] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[2] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[3] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[4] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[5] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[6] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[7] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[8] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[9] => Array
(
[0] => TOTALS AGENTS:21
[1] => total
[2] => total
[3] => total
[4] => total
[5] => total
)
)
输出:
require 'builder'
table = '+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1 | HEADING 2 | ETC | ANOTHER | HEADING3 | HEADING4 | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS AGENTS:21 | total| total| total| total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+';
def parse_table(table)
rows = []
table.each_line do |line|
next if line.match /^\+/
rows << line.split(/\s*\|\s*/).reject(&:empty?)
end
rows
end
def html_row(xml, columns)
xml.tr do
columns.each do |column|
xml.td column
end
end
end
def html_table(rows)
head_row = rows.first
body_rows = rows[1..-1]
xml = Builder::XmlMarkup.new :indent => 2
xml.table do
xml.thead do
html_row xml, head_row
end
xml.tbody do
body_rows.each do |body_row|
html_row xml, body_row
end
end
end.to_s
end
rows = parse_table(table)
html = html_table(rows)
puts html
@text = <<END
+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1 | HEADING 2 | ETC | ANOTHER | HEADING3 | HEADING4 | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS AGENTS:21 | total| total| total| total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+
END
s = @text.scan(/^[|]\W(.*)[|]$/)
puts s
arr = []
arr2 = []
s.each do |o|
a = o.to_s.split('|')
a.each do |oo|
arr2 << oo.to_s.gsub('["','').gsub('"]','').gsub(/\s+/, "")
end
arr << arr2
arr2 = []
end
arr.each do |i|
puts i
end
希望这有帮助:)这是一个完整的ruby解决方案。不过,您需要在最后一行手动添加一个|
<table>
<thead>
<tr>
<td>HEADING 1</td>
<td>HEADING 2</td>
<td>ETC</td>
<td>ANOTHER</td>
<td>HEADING3</td>
<td>HEADING4</td>
<td>SML</td>
</tr>
</thead>
<tbody>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>TOTALS AGENTS:21</td>
<td>total</td>
<td>total</td>
<td>total</td>
<td>total</td>
<td>total</td>
</tr>
</tbody>
</table>
需要“生成器”
桌子+------------+-------------+-------+-------------+------------+---------------+----------+
|品目1 |品目2 |等|另一|品目3 |品目4 | SML|
+------------+-------------+-------+-------------+------------+---------------+----------+
|内容|更多内容|继续|更多|内容|内容2.0 | litl|
|内容|更多内容|继续|更多|内容|内容2.0 | litl|
|内容|更多内容|继续|更多|内容|内容2.0 | litl|
|内容|更多内容|继续|更多|内容|内容2.0 | litl|
|内容|更多内容|继续|更多|内容|内容2.0 | litl|
|内容|更多内容|继续|更多|内容|内容2.0 | litl|
|内容|更多内容|继续|更多|内容|内容2.0 | litl|
|内容|更多内容|继续|更多|内容|内容2.0 | litl|
+------------+-------------+-------+-------------+------------+--------------+----------+
|总计代理:21 |总计|总计|总计|总计|总计|
+------------+-------------+-------+-------------+------------+--------------+----------+';
def parse_表格(表格)
行=[]
表1.每行do|
下一个if line.match/^\+/
第2行
xml.do表
xml.thead-do
html\u行xml,头\u行
结束
xml.tbody-do
身体排。每个都做身体排|
html_行xml,body_行
结束
结束
完
结束
行=解析表(表)
html=html\u表格(行)
放置html
输出:
require 'builder'
table = '+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1 | HEADING 2 | ETC | ANOTHER | HEADING3 | HEADING4 | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS AGENTS:21 | total| total| total| total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+';
def parse_table(table)
rows = []
table.each_line do |line|
next if line.match /^\+/
rows << line.split(/\s*\|\s*/).reject(&:empty?)
end
rows
end
def html_row(xml, columns)
xml.tr do
columns.each do |column|
xml.td column
end
end
end
def html_table(rows)
head_row = rows.first
body_rows = rows[1..-1]
xml = Builder::XmlMarkup.new :indent => 2
xml.table do
xml.thead do
html_row xml, head_row
end
xml.tbody do
body_rows.each do |body_row|
html_row xml, body_row
end
end
end.to_s
end
rows = parse_table(table)
html = html_table(rows)
puts html
@text = <<END
+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1 | HEADING 2 | ETC | ANOTHER | HEADING3 | HEADING4 | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS AGENTS:21 | total| total| total| total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+
END
s = @text.scan(/^[|]\W(.*)[|]$/)
puts s
arr = []
arr2 = []
s.each do |o|
a = o.to_s.split('|')
a.each do |oo|
arr2 << oo.to_s.gsub('["','').gsub('"]','').gsub(/\s+/, "")
end
arr << arr2
arr2 = []
end
arr.each do |i|
puts i
end
标题1
标题2
等
另一个
头3
头4
SML
内容
更多内容
续
更多
内容
内容2.0
利特尔
内容
更多内容
续
更多
内容
内容2.0
利特尔
内容
更多内容
续
更多
内容
内容2.0
利特尔
内容
更多内容
续
更多
内容
内容2.0
利特尔
内容
更多内容
续
更多
内容
内容2.0
利特尔
内容
更多内容
续
更多
内容
内容2.0
利特尔
内容
更多内容
续
更多
内容
内容2.0
利特尔
内容
更多内容
续
更多
内容
内容2.0
利特尔
总数:21
全部的
全部的
全部的
全部的
全部的
这可能不像可能的那么干净,但它适用于此示例:)
红宝石:
@text=使用HTML解析器选择pre
标记中的文本,然后使用子字符串提取数据(我假设列位于固定位置)。如果列的宽度在一个表中固定,但在另一个表中不固定,然后,您可以分析标题以计算出每列的宽度currently@nhahtdh这些列的宽度不是固定的,我希望它们是啊哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈?如果内容中没有出现
,则可以按
进行拆分。固定宽度是指每一列的宽度是固定的(不同的列可能有不同的宽度,但一列的所有行必须有相同的宽度)。所有的全局变量是什么?在这里使用它们有什么意义?@paile哇太棒了,然后我只需要像一个迷你铲运机一样构建,从本地文件中获取数据,然后导出到CSV?或者有什么好东西吗?所以你想要的输出是CSV?看看ruby std中的CSV类lib@paddle是的,因为第一次使用fastercsv,所以拍摄的帮助进行了调查,但它似乎被贬值了?