PHP/Perl代码从任意TSV文件创建mysql表
有谁能告诉我一些PHP或Perl代码,它们将从任意TSV文件创建一个mysql表 根据找到的数据和一些参数,它将使用其逻辑为每个字段计算出适当的字段类型,创建数据库表,并上载数据。(即,表结构事先不知道)PHP/Perl代码从任意TSV文件创建mysql表,php,mysql,perl,tsv,Php,Mysql,Perl,Tsv,有谁能告诉我一些PHP或Perl代码,它们将从任意TSV文件创建一个mysql表 根据找到的数据和一些参数,它将使用其逻辑为每个字段计算出适当的字段类型,创建数据库表,并上载数据。(即,表结构事先不知道) (或者,我可以想象它创建一个具有通用文本类型的初始表,然后运行sql查询来分析数据,然后更改表结构以匹配数据。)我建议它用逗号替换所有选项卡,然后使用一些CSV导入器,Internet上有很多PHP示例代码。例如 我找到的唯一解决方案是MYSQL食谱 代码示例位于: 这是第一部分…分析TSV
(或者,我可以想象它创建一个具有通用文本类型的初始表,然后运行sql查询来分析数据,然后更改表结构以匹配数据。)我建议它用逗号替换所有选项卡,然后使用一些CSV导入器,Internet上有很多PHP示例代码。例如
我找到的唯一解决方案是MYSQL食谱 代码示例位于: 这是第一部分…分析TSV数据并生成适合它的“创建表”。 第二部分,将TSV上传到表结构中,简单而常见 guess_table.pl
#!/usr/bin/perl
# guess_table.pl - characterize the contents of a data file and use the
# information to guess a CREATE TABLE statement for the file
# Usage: guess_table.pl table_name data_file
# To do:
# - Use value range information for something. It's collected but not yet
# used. For example, suggest better INT types.
# - Get rid of nonnegative attribute; it can be assessed now from the range.
# Load a data file and read column names and data values.
# Guess the declaration for each of the columns based on what the data
# values look like, and then generate an SQL CREATE TABLE statement for the
# table. Because the column declarations are just guesses, you'll likely
# want to edit the output, for example, to change a data type or
# length. You may also want to add indexes. Nevertheless, using this
# script can be easier than writing the CREATE TABLE statement by hand.
# Some assumptions:
# - Lines are tab-delimited, linefeed-terminated
# - Dates consist of 3 numeric parts, separated by - or /, in y/m/d order
# Here are some ways that guess_table.pl could be improved. Each of
# them would make it smarter, albeit at the cost of increased processing
# requirements. Some of the suggestions are likely impractical for really
# huge files.
# - For numeric columns, use min/max values to better guess the type.
# - Keep track of the number of unique values in a column. If there
# aren't many, the column might be a good candidate for being an ENUM.
# Testing should not be case sensitive, because ENUM columns are not
# case sensitive.
# - Make the date guessing code smarter. Have it recognize non-ISO format
# and attempt to make suggestions that a column needs to be reformatted.
# (This actually needs to see entire column, because that would help
# it distinguish U.S. from British formats WRT order of month and day.)
# This would need to track min/max for each of the three date parts.
# - If all values in a column are unique, suggest that it should be a PRIMARY
# KEY or a UNIQUE index.
# - For DATETIME columns, allow some times to be missing without flagging
# column as neither DATE nor TIME.
# Paul DuBois
# paul@kitebird.com
# 2002-01-31
# 2002-01-31
# - Created.
# 2002-02-19
# - Add code to track ranges for numeric columns and for the three date
# subparts of columns that look like they contain dates.
# 2002-02-20
# - Added --lower and --upper options to force column labels to lowercase
# or uppercase.
# 2002-03-01
# - For character columns longer than 255 characters, choose TEXT type based
# on maximum length.
# 2002-04-04
# - Add --quote-names option to quote table and column names `like this`.
# The resulting statement requires MySQL 3.23.6 or higher.
# 2002-07-16
# - Fix "uninitialized value" warnings resulting from missing columns in
# data lines.
# - Don't attempt to assess date characteristics for columns that are always
# empty.
# 2005-12-28
# - Make --quote-names the default, add --skip-quote-names option so that
# identifier quoting can be turned off.
# - Default data type now is VARCHAR, not CHAR.
# 2006-06-10
# - Emit UNSIGNED for double/decimal columns if they're unsigned.
use strict;
use warnings;
use Getopt::Long;
$Getopt::Long::ignorecase = 0; # options are case sensitive
$Getopt::Long::bundling = 1; # allow short options to be bundled
# ----------------------------------------------------------------------
# Create information structures to use for characterizing each column in
# in the data file. We need to know whether any nonnumeric values are
# found, whether numeric values are always integers, and the maximum length
# of column values.
# Argument is the array of column labels.
# Creates an array of hash references and returns a reference to that array.
sub init_col_info
{
my @labels = @_;
my @col_info;
for my $i (0 .. @labels - 1)
{
my $info = { };
$info->{label} = $labels[$i];
$info->{max_length} = 0;
# these can be tested directly, so they're set false until found
# to be true
$info->{hasempty} = 0; # has empty values
$info->{hasnonempty} = 0; # has nonempty values
# these can be assessed only by seeing all the values in the
# column, so they're set true until discovered by counterexample
# to be false
$info->{numeric} = 1; # used to detect general numeric types
$info->{integer} = 1; # used to detect INT
$info->{nonnegative} = 1; # used to detect UNSIGNED
$info->{temporal} = 1; # used to detect general temporal types
$info->{date} = 1; # used to detect DATE
$info->{datetime} = 1; # used to detect DATETIME
$info->{time} = 1; # used to detect TIME
# track min/max value for numeric columns
$info->{min_val} = undef;
$info->{max_val} = undef;
# track min/max for each of three date parts
$info->{date_range} = [ undef, undef, undef];
push (@col_info, $info);
}
return (\@col_info);
}
sub print_create_table
{
my ($tbl_name, $col_info_list, $quote) = @_;
my $ncols = @{$col_info_list};
my $s;
my $extra = "";
$quote = ($quote ? "`" : ""); # quote names?
for my $i (0 .. $ncols - 1)
{
my $info = $col_info_list->[$i];
$s .= ",\n" if $i > 0;
$s .= $extra if $extra ne "";
$extra = "";
$s .= " $quote$info->{label}$quote ";
if (!$info->{hasnonempty}) # column is always empty, make wild guess
{
$s .= "CHAR(10) /* NOTE: column is always empty */";
next;
}
# if the column has nonempty values but one of
# these hasn't been ruled out, that's a problem
if ($info->{numeric} && $info->{temporal})
{
die "Logic error: $info->{label} was characterized as both"
. " numeric and temporal\n";
}
if ($info->{numeric})
{
if ($info->{integer})
{
$s .= "INT";
## TO DO: use range to make guess about type
# Print "might be YEAR" if in range...(0, 1901-2155)
}
else
{
$s .= "DOUBLE";
}
$s .= " UNSIGNED" if $info->{nonnegative};
}
elsif ($info->{temporal})
{
# if a date column looks more like a U.S. or British
# date, add some comments to that effect
if (exists ($info->{date_type}))
{
my $ref = $info->{date_type};
$extra .= " # $info->{label} might be a U.S. date\n"
if $ref->{us};
$extra .= " # $info->{label} might be a British date\n"
if $ref->{br};
}
if ($info->{date})
{
$s .= "DATE";
}
elsif ($info->{datetime})
{
$s .= "DATETIME";
}
elsif ($info->{time})
{
$s .= "TIME";
}
else
{
die "Logic error: $info->{label} is flagged as temporal, but"
. " not as any of the temporal types\n";
}
}
else
{
if ($info->{max_length} < 256)
{
$s .= "VARCHAR($info->{max_length})";
}
elsif ($info->{max_length} < 65536)
{
$s .= "TEXT";
}
elsif ($info->{max_length} < 16777216)
{
$s .= "MEDIUMTEXT";
}
else
{
$s .= "LONGTEXT";
}
}
# if a column doesn't have empty values, guess that it cannot be NULL
$s .= " " . ($info->{hasempty} ? "NULL" : "NOT NULL");
}
$s = "CREATE TABLE $quote$tbl_name$quote\n(\n$s\n);\n";
print $s;
}
sub print_report
{
my $col_info_list = shift;
my $ncols = @{$col_info_list};
my $s;
for my $i (0 .. $ncols - 1)
{
my $info = $col_info_list->[$i];
printf "Column %d: %s\n", $i+1, $info->{label};
if (!$info->{hasnonempty}) # column is always empty
{
print " column is always empty\n";
next;
}
# if the column has nonempty values but one of
# these hasn't been ruled out, that's a problem
if ($info->{numeric} && $info->{temporal})
{
die "Logic error: $info->{label} was characterized as both"
. " numeric and temporal\n";
}
print " column has empty values: "
. ($info->{hasempty} ? "yes" : "no") . "\n";
printf " column value maximum length = %d\n", $info->{max_length};
if ($info->{numeric})
{
printf " column is numeric (range: %g - %g)\n",
$info->{min_val}, $info->{max_val};
if ($info->{integer})
{
print " column is integer\n";
if ($info->{nonnegative})
{
print " column is nonnegative\n";
}
}
}
elsif ($info->{temporal})
{
if ($info->{date})
{
my $ref = $info->{date_range};
print " column contains date values";
printf " (part ranges: %d - %d, %d - %d, %d - %d)\n",
$ref->[0]->{min}, $ref->[0]->{max},
$ref->[1]->{min}, $ref->[1]->{max},
$ref->[2]->{min}, $ref->[2]->{max};
$ref = $info->{date_type};
printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n",
($ref->{iso} ? "yes" : "no"),
($ref->{us} ? "yes" : "no"),
($ref->{br} ? "yes" : "no");
}
elsif ($info->{datetime})
{
my $ref = $info->{date_range};
print " column contains date+time values";
printf " (part ranges: %d - %d, %d - %d, %d - %d)\n",
$ref->[0]->{min}, $ref->[0]->{max},
$ref->[1]->{min}, $ref->[1]->{max},
$ref->[2]->{min}, $ref->[2]->{max};
$ref = $info->{date_type};
printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n",
($ref->{iso} ? "yes" : "no"),
($ref->{us} ? "yes" : "no"),
($ref->{br} ? "yes" : "no");
}
elsif ($info->{time})
{
print " column contains time values\n";
}
else
{
die "Logic error: $info->{label} is flagged as temporal, but"
. " not as any of the temporal types\n";
}
}
else
{
print " column appears to be a string"
. " (cannot further narrow the type)\n";
}
}
}
# ----------------------------------------------------------------------
my $prog = "guess_table.pl";
my $usage = <<EOF;
Usage: $prog [options] [data_file]
Options:
--help
Print this message
--labels, -l
Interpret first input line as row of table column labels
(default = c1, c2, ...)
--lower, --upper
Force column labels to be in lowercase or uppercase
--quote-names, --skip-quote-names
Quote or do not quote table and column identifiers with `` characters
in case they are reserved words (default = quote identifiers)
--report , -r
Report mode; print findings rather than generating a CREATE
TABLE statement
--table=tbl_name, -t tbl_name
Specify table name (default = t)
EOF
my $help;
my $labels; # expect a line of column labels?
my $tbl_name = "t"; # table name (default: t)
my $report;
my $lower;
my $upper;
my $quote_names = 1;
my $skip_quote_names;
GetOptions (
# =s means a string value is required after the option
"help" => \$help, # print help message
"labels|l" => \$labels, # expect row of column labels
"table|t=s" => \$tbl_name, # table name
"report|r" => \$report, # report mode
"lower" => \$lower, # lowercase labels
"upper" => \$upper, # uppercase labels
"quote-names" => \$quote_names, # quote identifiers
"skip-quote-names" => \$skip_quote_names # don't quote identifiers
) or die "$usage\n";
die "$usage\n" if defined $help;
$report = defined ($report); # convert defined/undefined to boolean
$lower = defined ($lower);
$upper = defined ($upper);
$quote_names = defined ($quote_names);
$quote_names = 0 if defined ($skip_quote_names);
die "--lower and --upper were both specified; that makes no sense\n"
if $lower && $upper;
my $line;
my $line_count = 0;
my @labels; # column labels
my $ncols; # number of columns
my $col_info_list;
# If labels are expected, read the first line to get them
if ($labels)
{
defined ($line = <>) or die;
chomp ($line);
@labels = split (/\t/, $line);
}
# Arrays to hold line numbers of lines with too many/too few fields.
# The first line in the file is assumed to be representative. The
# number of fields it contains becomes the norm against which any following
# lines are assessed.
my @excess_fields;
my @too_few_fields;
while (<>)
{
chomp ($line = $_);
++$line_count;
if (!defined ($ncols)) # don't know this until first data line read
{
# determine number of columns (assume no more than 10,000)
my @val = split (/\t/, $line, 10000);
$ncols = @val;
if (@labels) # label count must match data column count
{
die "Label count doesn't match data column count\n"
if $ncols != @labels;
}
else # if there were no labels, create them
{
@labels = map { "c" . $_ } 1 .. $ncols;
}
$col_info_list = init_col_info (@labels);
}
my @val = split (/\t/, $line, 10000);
push (@excess_fields, $line_count) if @val > $ncols;
push (@too_few_fields, $line_count) if @val < $ncols;
for my $i (0 .. $ncols - 1)
{
my $val = ($i < @val ? $val[$i] : ""); # use "" if field is missing
my $info = $col_info_list->[$i];
$info->{max_length} = length ($val)
if $info->{max_length} < length ($val);
if ($val eq "")
{
# column does have empty values
$info->{hasempty} = 1;
next; # no other tests apply
}
$info->{hasnonempty} = 1;
# perform numeric tests if no nonnumeric values have yet been seen
if ($info->{numeric})
{
# numeric test (doesn't recognize scientific notation)
if ($val =~ /^[-+]?(\d+(\.\d*)?|\.\d+)$/)
{
# not int if contains decimal point
$info->{integer} = 0 if $val =~ /\./;
# not unsigned if begins with minus sign
$info->{nonnegative} = 0 if $val =~ /^-/;
# track min/max value
$info->{min_val} = $val
if !defined ($info->{min_val}) || $info->{min_val} > $val;
$info->{max_val} = $val
if !defined ($info->{max_val}) || $info->{max_val} < $val;
}
else
{
# column contains nonnumeric information
$info->{numeric} = 0;
$info->{integer} = 0;
}
}
# perform temporal tests if no nontemporal values have yet been seen
if ($info->{temporal})
{
# date/datetime test
# allow date, date hour:min, date hour:min:sec
if (($info->{date} || $info->{datetime})
&& $val =~ /^(\d+)[-\/](\d+)[-\/](\d+)\s*(\d+:\d+(:\d+)?)?$/)
{
# it's not a time
$info->{time} = 0;
# not a date if time part was present; not a
# datetime if no time part was present
$info->{ defined ($4) ? "date" : "datetime" } = 0;
# use the first three parts to track range of date parts
my @val = ($1, $2, $3);
my $ref = $info->{date_range};
foreach my $i (0 .. 2)
{
# if this is the first value we've checked, create the
# structure to hold the min and max; otherwise compare
# the stored min/max to the current value
if (!defined ($ref->[$i]))
{
$ref->[$i]->{min} = $val[$i];
$ref->[$i]->{max} = $val[$i];
next;
}
$ref->[$i]->{min} = $val[$i]
if $ref->[$i]->{min} > $val[$i];
$ref->[$i]->{max} = $val[$i]
if $ref->[$i]->{max} < $val[$i];
}
}
# time test
# allow hour:min, hour:min:sec
elsif ($info->{time} && $val =~ /^\d+:\d+(:\d+)?$/)
{
# it's not a date or datetime
$info->{date} = 0;
$info->{datetime} = 0;
}
else
{
# column contains nontemporal information
$info->{temporal} = 0;
}
}
}
}
die "Input contained no data lines\n" if $line_count == 0;
die "Input lines all were empty\n" if $ncols == 0;
# Look at columns that look like DATE or DATETIME columns and attempt
# to determine whether they appear to be in ISO, U.S., or British format.
# (Skip columns that are always empty, because these assessments cannot
# be made for such columns.)
for my $i (0 .. $ncols - 1)
{
my $info = $col_info_list->[$i];
next unless $info->{hasnonempty};
next unless $info->{temporal} && ($info->{date} || $info->{datetime});
my $ref = $info->{date_range};
# assume that the column is valid as each of the types until ruled out
my $valid_as_iso = 1; # [CC]YY-MM-DD
my $valid_as_us = 1; # MM-DD-[CC]YY
my $valid_as_br = 1; # DD-MM-[CC]YY
# first segment is U.S. month, British day
my $min = $ref->[0]->{min};
my $max = $ref->[0]->{max};
$valid_as_us = 0 if $min < 0 || $max > 12;
$valid_as_br = 0 if $min < 0 || $max > 31;
# second segment is U.S. day, British month, ISO month
$min = $ref->[1]->{min};
$max = $ref->[1]->{max};
$valid_as_us = 0 if $min < 0 || $max > 31;
$valid_as_br = 0 if $min < 0 || $max > 12;
$valid_as_iso = 0 if $min < 0 || $max > 12;
# third segment is ISO day
$min = $ref->[2]->{min};
$max = $ref->[2]->{max};
$valid_as_iso = 0 if $min < 0 || $max > 31;
if (!$valid_as_iso && !$valid_as_us && !$valid_as_br)
{
$info->{temporal} = 0; # huh! guess it's not a date after all
}
else # save date type results for later
{
$info->{date_type}->{iso} = $valid_as_iso;
$info->{date_type}->{us} = $valid_as_us;
$info->{date_type}->{br} = $valid_as_br;
}
}
warn "# Number of lines = $line_count, columns = $ncols\n";
warn "# Number of lines with too few fields: " . scalar (@too_few_fields) . "\n"
if @too_few_fields;
warn "# Number of lines with excess fields: " . scalar (@excess_fields) . "\n"
if @excess_fields;
if ($report)
{
print_report ($col_info_list);
}
else
{
for my $i (0 .. $ncols - 1)
{
my $info = $col_info_list->[$i];
$info->{label} = lc ($info->{label}) if $lower;
$info->{label} = uc ($info->{label}) if $upper;
}
print_create_table ($tbl_name, $col_info_list, $quote_names);
}
#/usr/bin/perl
#guess_table.pl-描述数据文件的内容并使用
#猜测文件的CREATETABLE语句的信息
#用法:guess_table.pl table_name data_文件
#要做:
#-对某些内容使用值范围信息。已经收集了,但还没有
#用过。例如,建议更好的INT类型。
#-去除非负属性;现在可以从范围内进行评估。
#加载数据文件并读取列名和数据值。
#根据数据内容猜测每列的声明
#值,然后为
#桌子。因为列声明只是猜测,您很可能
#要编辑输出,例如更改数据类型或
#长度。您可能还需要添加索引。然而,使用这个
#脚本可能比手工编写CREATETABLE语句更容易。
#一些假设:
#-行以制表符分隔,换行符终止
#-日期由3个数字部分组成,以-或/分隔,按y/m/d顺序排列
#下面是一些可以改进guess_table.pl的方法。每个
#他们将使它更智能,尽管这是以增加处理的成本为代价的
#要求。有些建议可能对真正意义上的人来说是不切实际的
#巨大的文件。
#-对于数字列,使用最小/最大值更好地猜测类型。
#-跟踪列中唯一值的数量。如果有
#如果不是很多,该列可能是一个很好的枚举候选者。
#测试不应区分大小写,因为枚举列不区分大小写
#区分大小写。
#-使日期猜测代码更智能。让它识别非ISO格式
#并尝试提出需要重新格式化列的建议。
#(这实际上需要查看整个列,因为这样会有所帮助。)
#它将美国和英国格式区分为月和日。)
#这需要跟踪三个日期部分中每个部分的最小/最大值。
#-如果列中的所有值都是唯一的,则建议它应该是主值
#键或唯一索引。
#-对于DATETIME列,允许在不标记的情况下丢失某些时间
#列,既不是日期也不是时间。
#保罗·杜波伊斯
# paul@kitebird.com
# 2002-01-31
# 2002-01-31
#-创建。
# 2002-02-19
#-为数字列和三个日期的跟踪范围添加代码
#看起来包含日期的列的子部分。
# 2002-02-20
#-添加--lower和--upper选项以强制列标签为小写
#或大写。
# 2002-03-01
#-对于长度超过255个字符的字符列,请选择“基于文本类型”
#在最大长度上。
# 2002-04-04
#-Add--quote names选项引用表名和列名`like this`。
#结果语句需要MySQL 3.23.6或更高版本。
# 2002-07-16
#-修复由于中缺少列而导致的“未初始化值”警告
#数据线。
#-不要尝试评估始终为空的列的日期特征
#空的。
# 2005-12-28
#-将--quote name设置为默认值,添加--skip quote names选项,以便
#可以关闭标识符引用。
#-默认数据类型现在是VARCHAR,而不是CHAR。
# 2006-06-10
#-如果双精度/十进制列为无符号列,则发出无符号列。
严格使用;
使用警告;
使用Getopt::Long;
$Getopt::Long::ignorecase=0;#选项区分大小写
$Getopt::Long::bundling=1;#允许绑定短选项
# ----------------------------------------------------------------------
#创建信息结构,用于描述中的每个列
#在数据文件中。我们需要知道是否存在任何非数值
#已找到,数值是否始终为整数,以及最大长度
#列值的类型。
#参数是列标签的数组。
#创建哈希引用数组并返回对该数组的引用。
子初始化列信息
{
我的@labels=@;
我的@col_info;
对于我的$i(0..@labels-1)
{
我的$info={};
$info->{label}=$labels[$i];
$info->{max_length}=0;
#这些可以直接测试,所以在找到之前,它们都被设置为false
#说实话
$info->{hasempty}=0;#具有空值
$info->{hasnonempty}=0;#具有非空值
#只有通过查看表中的所有值,才能对其进行评估
#列,所以它们被设置为true,直到通过反例发现
#虚伪
$info->{numeric}=1;#用于检测一般数字类型
$info->{integer}=1;#用于检测INT
$info->{nonnegative}=1;#用于检测无符号
$info->{temporal}=1;#用于检测一般时态类型
$info->{date}=1;#用于检测日期
$info->{datetime}=1;#用于检测日期时间
$info->{time}=1;#用于检测时间
#数字列的跟踪最小/最大值
$info->{min_val}=undef;
$info->{max_val}=undef;
#跟踪三个日期部分中每个部分的最小/最大值
$info->{date
#!/usr/bin/perl
# guess_table.pl - characterize the contents of a data file and use the
# information to guess a CREATE TABLE statement for the file
# Usage: guess_table.pl table_name data_file
# To do:
# - Use value range information for something. It's collected but not yet
# used. For example, suggest better INT types.
# - Get rid of nonnegative attribute; it can be assessed now from the range.
# Load a data file and read column names and data values.
# Guess the declaration for each of the columns based on what the data
# values look like, and then generate an SQL CREATE TABLE statement for the
# table. Because the column declarations are just guesses, you'll likely
# want to edit the output, for example, to change a data type or
# length. You may also want to add indexes. Nevertheless, using this
# script can be easier than writing the CREATE TABLE statement by hand.
# Some assumptions:
# - Lines are tab-delimited, linefeed-terminated
# - Dates consist of 3 numeric parts, separated by - or /, in y/m/d order
# Here are some ways that guess_table.pl could be improved. Each of
# them would make it smarter, albeit at the cost of increased processing
# requirements. Some of the suggestions are likely impractical for really
# huge files.
# - For numeric columns, use min/max values to better guess the type.
# - Keep track of the number of unique values in a column. If there
# aren't many, the column might be a good candidate for being an ENUM.
# Testing should not be case sensitive, because ENUM columns are not
# case sensitive.
# - Make the date guessing code smarter. Have it recognize non-ISO format
# and attempt to make suggestions that a column needs to be reformatted.
# (This actually needs to see entire column, because that would help
# it distinguish U.S. from British formats WRT order of month and day.)
# This would need to track min/max for each of the three date parts.
# - If all values in a column are unique, suggest that it should be a PRIMARY
# KEY or a UNIQUE index.
# - For DATETIME columns, allow some times to be missing without flagging
# column as neither DATE nor TIME.
# Paul DuBois
# paul@kitebird.com
# 2002-01-31
# 2002-01-31
# - Created.
# 2002-02-19
# - Add code to track ranges for numeric columns and for the three date
# subparts of columns that look like they contain dates.
# 2002-02-20
# - Added --lower and --upper options to force column labels to lowercase
# or uppercase.
# 2002-03-01
# - For character columns longer than 255 characters, choose TEXT type based
# on maximum length.
# 2002-04-04
# - Add --quote-names option to quote table and column names `like this`.
# The resulting statement requires MySQL 3.23.6 or higher.
# 2002-07-16
# - Fix "uninitialized value" warnings resulting from missing columns in
# data lines.
# - Don't attempt to assess date characteristics for columns that are always
# empty.
# 2005-12-28
# - Make --quote-names the default, add --skip-quote-names option so that
# identifier quoting can be turned off.
# - Default data type now is VARCHAR, not CHAR.
# 2006-06-10
# - Emit UNSIGNED for double/decimal columns if they're unsigned.
use strict;
use warnings;
use Getopt::Long;
$Getopt::Long::ignorecase = 0; # options are case sensitive
$Getopt::Long::bundling = 1; # allow short options to be bundled
# ----------------------------------------------------------------------
# Create information structures to use for characterizing each column in
# in the data file. We need to know whether any nonnumeric values are
# found, whether numeric values are always integers, and the maximum length
# of column values.
# Argument is the array of column labels.
# Creates an array of hash references and returns a reference to that array.
sub init_col_info
{
my @labels = @_;
my @col_info;
for my $i (0 .. @labels - 1)
{
my $info = { };
$info->{label} = $labels[$i];
$info->{max_length} = 0;
# these can be tested directly, so they're set false until found
# to be true
$info->{hasempty} = 0; # has empty values
$info->{hasnonempty} = 0; # has nonempty values
# these can be assessed only by seeing all the values in the
# column, so they're set true until discovered by counterexample
# to be false
$info->{numeric} = 1; # used to detect general numeric types
$info->{integer} = 1; # used to detect INT
$info->{nonnegative} = 1; # used to detect UNSIGNED
$info->{temporal} = 1; # used to detect general temporal types
$info->{date} = 1; # used to detect DATE
$info->{datetime} = 1; # used to detect DATETIME
$info->{time} = 1; # used to detect TIME
# track min/max value for numeric columns
$info->{min_val} = undef;
$info->{max_val} = undef;
# track min/max for each of three date parts
$info->{date_range} = [ undef, undef, undef];
push (@col_info, $info);
}
return (\@col_info);
}
sub print_create_table
{
my ($tbl_name, $col_info_list, $quote) = @_;
my $ncols = @{$col_info_list};
my $s;
my $extra = "";
$quote = ($quote ? "`" : ""); # quote names?
for my $i (0 .. $ncols - 1)
{
my $info = $col_info_list->[$i];
$s .= ",\n" if $i > 0;
$s .= $extra if $extra ne "";
$extra = "";
$s .= " $quote$info->{label}$quote ";
if (!$info->{hasnonempty}) # column is always empty, make wild guess
{
$s .= "CHAR(10) /* NOTE: column is always empty */";
next;
}
# if the column has nonempty values but one of
# these hasn't been ruled out, that's a problem
if ($info->{numeric} && $info->{temporal})
{
die "Logic error: $info->{label} was characterized as both"
. " numeric and temporal\n";
}
if ($info->{numeric})
{
if ($info->{integer})
{
$s .= "INT";
## TO DO: use range to make guess about type
# Print "might be YEAR" if in range...(0, 1901-2155)
}
else
{
$s .= "DOUBLE";
}
$s .= " UNSIGNED" if $info->{nonnegative};
}
elsif ($info->{temporal})
{
# if a date column looks more like a U.S. or British
# date, add some comments to that effect
if (exists ($info->{date_type}))
{
my $ref = $info->{date_type};
$extra .= " # $info->{label} might be a U.S. date\n"
if $ref->{us};
$extra .= " # $info->{label} might be a British date\n"
if $ref->{br};
}
if ($info->{date})
{
$s .= "DATE";
}
elsif ($info->{datetime})
{
$s .= "DATETIME";
}
elsif ($info->{time})
{
$s .= "TIME";
}
else
{
die "Logic error: $info->{label} is flagged as temporal, but"
. " not as any of the temporal types\n";
}
}
else
{
if ($info->{max_length} < 256)
{
$s .= "VARCHAR($info->{max_length})";
}
elsif ($info->{max_length} < 65536)
{
$s .= "TEXT";
}
elsif ($info->{max_length} < 16777216)
{
$s .= "MEDIUMTEXT";
}
else
{
$s .= "LONGTEXT";
}
}
# if a column doesn't have empty values, guess that it cannot be NULL
$s .= " " . ($info->{hasempty} ? "NULL" : "NOT NULL");
}
$s = "CREATE TABLE $quote$tbl_name$quote\n(\n$s\n);\n";
print $s;
}
sub print_report
{
my $col_info_list = shift;
my $ncols = @{$col_info_list};
my $s;
for my $i (0 .. $ncols - 1)
{
my $info = $col_info_list->[$i];
printf "Column %d: %s\n", $i+1, $info->{label};
if (!$info->{hasnonempty}) # column is always empty
{
print " column is always empty\n";
next;
}
# if the column has nonempty values but one of
# these hasn't been ruled out, that's a problem
if ($info->{numeric} && $info->{temporal})
{
die "Logic error: $info->{label} was characterized as both"
. " numeric and temporal\n";
}
print " column has empty values: "
. ($info->{hasempty} ? "yes" : "no") . "\n";
printf " column value maximum length = %d\n", $info->{max_length};
if ($info->{numeric})
{
printf " column is numeric (range: %g - %g)\n",
$info->{min_val}, $info->{max_val};
if ($info->{integer})
{
print " column is integer\n";
if ($info->{nonnegative})
{
print " column is nonnegative\n";
}
}
}
elsif ($info->{temporal})
{
if ($info->{date})
{
my $ref = $info->{date_range};
print " column contains date values";
printf " (part ranges: %d - %d, %d - %d, %d - %d)\n",
$ref->[0]->{min}, $ref->[0]->{max},
$ref->[1]->{min}, $ref->[1]->{max},
$ref->[2]->{min}, $ref->[2]->{max};
$ref = $info->{date_type};
printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n",
($ref->{iso} ? "yes" : "no"),
($ref->{us} ? "yes" : "no"),
($ref->{br} ? "yes" : "no");
}
elsif ($info->{datetime})
{
my $ref = $info->{date_range};
print " column contains date+time values";
printf " (part ranges: %d - %d, %d - %d, %d - %d)\n",
$ref->[0]->{min}, $ref->[0]->{max},
$ref->[1]->{min}, $ref->[1]->{max},
$ref->[2]->{min}, $ref->[2]->{max};
$ref = $info->{date_type};
printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n",
($ref->{iso} ? "yes" : "no"),
($ref->{us} ? "yes" : "no"),
($ref->{br} ? "yes" : "no");
}
elsif ($info->{time})
{
print " column contains time values\n";
}
else
{
die "Logic error: $info->{label} is flagged as temporal, but"
. " not as any of the temporal types\n";
}
}
else
{
print " column appears to be a string"
. " (cannot further narrow the type)\n";
}
}
}
# ----------------------------------------------------------------------
my $prog = "guess_table.pl";
my $usage = <<EOF;
Usage: $prog [options] [data_file]
Options:
--help
Print this message
--labels, -l
Interpret first input line as row of table column labels
(default = c1, c2, ...)
--lower, --upper
Force column labels to be in lowercase or uppercase
--quote-names, --skip-quote-names
Quote or do not quote table and column identifiers with `` characters
in case they are reserved words (default = quote identifiers)
--report , -r
Report mode; print findings rather than generating a CREATE
TABLE statement
--table=tbl_name, -t tbl_name
Specify table name (default = t)
EOF
my $help;
my $labels; # expect a line of column labels?
my $tbl_name = "t"; # table name (default: t)
my $report;
my $lower;
my $upper;
my $quote_names = 1;
my $skip_quote_names;
GetOptions (
# =s means a string value is required after the option
"help" => \$help, # print help message
"labels|l" => \$labels, # expect row of column labels
"table|t=s" => \$tbl_name, # table name
"report|r" => \$report, # report mode
"lower" => \$lower, # lowercase labels
"upper" => \$upper, # uppercase labels
"quote-names" => \$quote_names, # quote identifiers
"skip-quote-names" => \$skip_quote_names # don't quote identifiers
) or die "$usage\n";
die "$usage\n" if defined $help;
$report = defined ($report); # convert defined/undefined to boolean
$lower = defined ($lower);
$upper = defined ($upper);
$quote_names = defined ($quote_names);
$quote_names = 0 if defined ($skip_quote_names);
die "--lower and --upper were both specified; that makes no sense\n"
if $lower && $upper;
my $line;
my $line_count = 0;
my @labels; # column labels
my $ncols; # number of columns
my $col_info_list;
# If labels are expected, read the first line to get them
if ($labels)
{
defined ($line = <>) or die;
chomp ($line);
@labels = split (/\t/, $line);
}
# Arrays to hold line numbers of lines with too many/too few fields.
# The first line in the file is assumed to be representative. The
# number of fields it contains becomes the norm against which any following
# lines are assessed.
my @excess_fields;
my @too_few_fields;
while (<>)
{
chomp ($line = $_);
++$line_count;
if (!defined ($ncols)) # don't know this until first data line read
{
# determine number of columns (assume no more than 10,000)
my @val = split (/\t/, $line, 10000);
$ncols = @val;
if (@labels) # label count must match data column count
{
die "Label count doesn't match data column count\n"
if $ncols != @labels;
}
else # if there were no labels, create them
{
@labels = map { "c" . $_ } 1 .. $ncols;
}
$col_info_list = init_col_info (@labels);
}
my @val = split (/\t/, $line, 10000);
push (@excess_fields, $line_count) if @val > $ncols;
push (@too_few_fields, $line_count) if @val < $ncols;
for my $i (0 .. $ncols - 1)
{
my $val = ($i < @val ? $val[$i] : ""); # use "" if field is missing
my $info = $col_info_list->[$i];
$info->{max_length} = length ($val)
if $info->{max_length} < length ($val);
if ($val eq "")
{
# column does have empty values
$info->{hasempty} = 1;
next; # no other tests apply
}
$info->{hasnonempty} = 1;
# perform numeric tests if no nonnumeric values have yet been seen
if ($info->{numeric})
{
# numeric test (doesn't recognize scientific notation)
if ($val =~ /^[-+]?(\d+(\.\d*)?|\.\d+)$/)
{
# not int if contains decimal point
$info->{integer} = 0 if $val =~ /\./;
# not unsigned if begins with minus sign
$info->{nonnegative} = 0 if $val =~ /^-/;
# track min/max value
$info->{min_val} = $val
if !defined ($info->{min_val}) || $info->{min_val} > $val;
$info->{max_val} = $val
if !defined ($info->{max_val}) || $info->{max_val} < $val;
}
else
{
# column contains nonnumeric information
$info->{numeric} = 0;
$info->{integer} = 0;
}
}
# perform temporal tests if no nontemporal values have yet been seen
if ($info->{temporal})
{
# date/datetime test
# allow date, date hour:min, date hour:min:sec
if (($info->{date} || $info->{datetime})
&& $val =~ /^(\d+)[-\/](\d+)[-\/](\d+)\s*(\d+:\d+(:\d+)?)?$/)
{
# it's not a time
$info->{time} = 0;
# not a date if time part was present; not a
# datetime if no time part was present
$info->{ defined ($4) ? "date" : "datetime" } = 0;
# use the first three parts to track range of date parts
my @val = ($1, $2, $3);
my $ref = $info->{date_range};
foreach my $i (0 .. 2)
{
# if this is the first value we've checked, create the
# structure to hold the min and max; otherwise compare
# the stored min/max to the current value
if (!defined ($ref->[$i]))
{
$ref->[$i]->{min} = $val[$i];
$ref->[$i]->{max} = $val[$i];
next;
}
$ref->[$i]->{min} = $val[$i]
if $ref->[$i]->{min} > $val[$i];
$ref->[$i]->{max} = $val[$i]
if $ref->[$i]->{max} < $val[$i];
}
}
# time test
# allow hour:min, hour:min:sec
elsif ($info->{time} && $val =~ /^\d+:\d+(:\d+)?$/)
{
# it's not a date or datetime
$info->{date} = 0;
$info->{datetime} = 0;
}
else
{
# column contains nontemporal information
$info->{temporal} = 0;
}
}
}
}
die "Input contained no data lines\n" if $line_count == 0;
die "Input lines all were empty\n" if $ncols == 0;
# Look at columns that look like DATE or DATETIME columns and attempt
# to determine whether they appear to be in ISO, U.S., or British format.
# (Skip columns that are always empty, because these assessments cannot
# be made for such columns.)
for my $i (0 .. $ncols - 1)
{
my $info = $col_info_list->[$i];
next unless $info->{hasnonempty};
next unless $info->{temporal} && ($info->{date} || $info->{datetime});
my $ref = $info->{date_range};
# assume that the column is valid as each of the types until ruled out
my $valid_as_iso = 1; # [CC]YY-MM-DD
my $valid_as_us = 1; # MM-DD-[CC]YY
my $valid_as_br = 1; # DD-MM-[CC]YY
# first segment is U.S. month, British day
my $min = $ref->[0]->{min};
my $max = $ref->[0]->{max};
$valid_as_us = 0 if $min < 0 || $max > 12;
$valid_as_br = 0 if $min < 0 || $max > 31;
# second segment is U.S. day, British month, ISO month
$min = $ref->[1]->{min};
$max = $ref->[1]->{max};
$valid_as_us = 0 if $min < 0 || $max > 31;
$valid_as_br = 0 if $min < 0 || $max > 12;
$valid_as_iso = 0 if $min < 0 || $max > 12;
# third segment is ISO day
$min = $ref->[2]->{min};
$max = $ref->[2]->{max};
$valid_as_iso = 0 if $min < 0 || $max > 31;
if (!$valid_as_iso && !$valid_as_us && !$valid_as_br)
{
$info->{temporal} = 0; # huh! guess it's not a date after all
}
else # save date type results for later
{
$info->{date_type}->{iso} = $valid_as_iso;
$info->{date_type}->{us} = $valid_as_us;
$info->{date_type}->{br} = $valid_as_br;
}
}
warn "# Number of lines = $line_count, columns = $ncols\n";
warn "# Number of lines with too few fields: " . scalar (@too_few_fields) . "\n"
if @too_few_fields;
warn "# Number of lines with excess fields: " . scalar (@excess_fields) . "\n"
if @excess_fields;
if ($report)
{
print_report ($col_info_list);
}
else
{
for my $i (0 .. $ncols - 1)
{
my $info = $col_info_list->[$i];
$info->{label} = lc ($info->{label}) if $lower;
$info->{label} = uc ($info->{label}) if $upper;
}
print_create_table ($tbl_name, $col_info_list, $quote_names);
}