PHP/Perl代码从任意TSV文件创建mysql表_Php_Mysql_Perl_Tsv

PHP/Perl代码从任意TSV文件创建mysql表

php mysql perl

PHP/Perl代码从任意TSV文件创建mysql表,php,mysql,perl,tsv,Php,Mysql,Perl,Tsv,有谁能告诉我一些PHP或Perl代码，它们将从任意TSV文件创建一个mysql表根据找到的数据和一些参数，它将使用其逻辑为每个字段计算出适当的字段类型，创建数据库表，并上载数据。（即，表结构事先不知道）（或者，我可以想象它创建一个具有通用文本类型的初始表，然后运行sql查询来分析数据，然后更改表结构以匹配数据。）我建议它用逗号替换所有选项卡，然后使用一些CSV导入器，Internet上有很多PHP示例代码。例如我找到的唯一解决方案是MYSQL食谱代码示例位于：这是第一部分…分析TSV

有谁能告诉我一些PHP或Perl代码，它们将从任意TSV文件创建一个mysql表

根据找到的数据和一些参数，它将使用其逻辑为每个字段计算出适当的字段类型，创建数据库表，并上载数据。（即，表结构事先不知道）

（或者，我可以想象它创建一个具有通用文本类型的初始表，然后运行sql查询来分析数据，然后更改表结构以匹配数据。）

我建议它用逗号替换所有选项卡，然后使用一些CSV导入器，Internet上有很多PHP示例代码。例如

我找到的唯一解决方案是MYSQL食谱

代码示例位于：

这是第一部分…分析TSV数据并生成适合它的“创建表”。第二部分，将TSV上传到表结构中，简单而常见

guess_table.pl

#!/usr/bin/perl
# guess_table.pl - characterize the contents of a data file and use the
# information to guess a CREATE TABLE statement for the file

# Usage: guess_table.pl table_name data_file

# To do:
# - Use value range information for something.  It's collected but not yet
#   used.  For example, suggest better INT types.
# - Get rid of nonnegative attribute; it can be assessed now from the range.

# Load a data file and read column names and data values.
# Guess the declaration for each of the columns based on what the data
# values look like, and then generate an SQL CREATE TABLE statement for the
# table.  Because the column declarations are just guesses, you'll likely
# want to edit the output, for example, to change a data type or
# length.  You may also want to add indexes.  Nevertheless, using this
# script can be easier than writing the CREATE TABLE statement by hand.

# Some assumptions:
# - Lines are tab-delimited, linefeed-terminated
# - Dates consist of 3 numeric parts, separated by - or /, in y/m/d order

# Here are some ways that guess_table.pl could be improved.  Each of
# them would make it smarter, albeit at the cost of increased processing
# requirements. Some of the suggestions are likely impractical for really
# huge files.

# - For numeric columns, use min/max values to better guess the type.
# - Keep track of the number of unique values in a column.  If there
#   aren't many, the column might be a good candidate for being an ENUM.
#   Testing should not be case sensitive, because ENUM columns are not
#   case sensitive.
# - Make the date guessing code smarter.  Have it recognize non-ISO format
#   and attempt to make suggestions that a column needs to be reformatted.
#   (This actually needs to see entire column, because that would help
#   it distinguish U.S. from British formats WRT order of month and day.)
#   This would need to track min/max for each of the three date parts.
# - If all values in a column are unique, suggest that it should be a PRIMARY
#   KEY or a UNIQUE index.
# - For DATETIME columns, allow some times to be missing without flagging
#   column as neither DATE nor TIME.

# Paul DuBois
# paul@kitebird.com
# 2002-01-31

# 2002-01-31
# - Created.
# 2002-02-19
# - Add code to track ranges for numeric columns and for the three date
#   subparts of columns that look like they contain dates.
# 2002-02-20
# - Added --lower and --upper options to force column labels to lowercase
#   or uppercase.
# 2002-03-01
# - For character columns longer than 255 characters, choose TEXT type based
#   on maximum length.
# 2002-04-04
# - Add --quote-names option to quote table and column names `like this`.
#   The resulting statement requires MySQL 3.23.6 or higher.
# 2002-07-16
# - Fix "uninitialized value" warnings resulting from missing columns in
#   data lines.
# - Don't attempt to assess date characteristics for columns that are always
#   empty.
# 2005-12-28
# - Make --quote-names the default, add --skip-quote-names option so that
#   identifier quoting can be turned off.
# - Default data type now is VARCHAR, not CHAR.
# 2006-06-10
# - Emit UNSIGNED for double/decimal columns if they're unsigned.

use strict;
use warnings;
use Getopt::Long;
$Getopt::Long::ignorecase = 0; # options are case sensitive
$Getopt::Long::bundling = 1;   # allow short options to be bundled

# ----------------------------------------------------------------------

# Create information structures to use for characterizing each column in
# in the data file.  We need to know whether any nonnumeric values are
# found, whether numeric values are always integers, and the maximum length
# of column values.

# Argument is the array of column labels.
# Creates an array of hash references and returns a reference to that array.

sub init_col_info
{
my @labels = @_;
my @col_info;

    for my $i (0 .. @labels - 1)
    {
        my $info = { };
        $info->{label} = $labels[$i];
        $info->{max_length} = 0;

        # these can be tested directly, so they're set false until found
        # to be true
        $info->{hasempty} = 0;    # has empty values
        $info->{hasnonempty} = 0; # has nonempty values

        # these can be assessed only by seeing all the values in the
        # column, so they're set true until discovered by counterexample
        # to be false
        $info->{numeric} = 1;     # used to detect general numeric types
        $info->{integer} = 1;     # used to detect INT
        $info->{nonnegative} = 1; # used to detect UNSIGNED
        $info->{temporal} = 1;    # used to detect general temporal types
        $info->{date} = 1;        # used to detect DATE
        $info->{datetime} = 1;    # used to detect DATETIME
        $info->{time} = 1;        # used to detect TIME

        # track min/max value for numeric columns
        $info->{min_val} = undef;
        $info->{max_val} = undef;

        # track min/max for each of three date parts
        $info->{date_range} = [ undef, undef, undef];

        push (@col_info, $info);
    }
    return (\@col_info);
}

sub print_create_table
{
my ($tbl_name, $col_info_list, $quote) = @_;
my $ncols = @{$col_info_list};
my $s;
my $extra = "";

    $quote = ($quote ? "`" : "");     # quote names?
    for my $i (0 .. $ncols - 1)
    {
        my $info = $col_info_list->[$i];

        $s .= ",\n" if $i > 0;
        $s .= $extra if $extra ne "";
        $extra = "";

        $s .= "  $quote$info->{label}$quote ";

        if (!$info->{hasnonempty})  # column is always empty, make wild guess
        {
            $s .= "CHAR(10)    /* NOTE: column is always empty */";
            next;
        }

        # if the column has nonempty values but one of
        # these hasn't been ruled out, that's a problem
        if ($info->{numeric} && $info->{temporal})
        {
            die "Logic error: $info->{label} was characterized as both"
    . " numeric and temporal\n";
        }

        if ($info->{numeric})
        {
            if ($info->{integer})
            {
    $s .= "INT";
## TO DO: use range to make guess about type
    # Print "might be YEAR" if in range...(0, 1901-2155)
            }
            else
            {
    $s .= "DOUBLE";
            }
            $s .= " UNSIGNED" if $info->{nonnegative};
        }
        elsif ($info->{temporal})
        {
            # if a date column looks more like a U.S. or British
            # date, add some comments to that effect
            if (exists ($info->{date_type}))
            {
    my $ref = $info->{date_type};
    $extra .= "  # $info->{label} might be a U.S. date\n"
            if $ref->{us};
    $extra .= "  # $info->{label} might be a British date\n"
            if $ref->{br};
            }
            if ($info->{date})
            {
    $s .= "DATE";
            }
            elsif ($info->{datetime})
            {
    $s .= "DATETIME";
            }
            elsif ($info->{time})
            {
    $s .= "TIME";
            }
            else
            {
    die "Logic error: $info->{label} is flagged as temporal, but"
        . " not as any of the temporal types\n";
            }
        }
        else
        {
            if ($info->{max_length} < 256)
            {
    $s .= "VARCHAR($info->{max_length})";
            }
            elsif ($info->{max_length} < 65536)
            {
    $s .= "TEXT";
            }
            elsif ($info->{max_length} < 16777216)
            {
    $s .= "MEDIUMTEXT";
            }
            else
            {
    $s .= "LONGTEXT";
            }
        }
        # if a column doesn't have empty values, guess that it cannot be NULL
        $s .= " " . ($info->{hasempty} ? "NULL" : "NOT NULL");
    }

    $s = "CREATE TABLE $quote$tbl_name$quote\n(\n$s\n);\n";
    print $s;
}

sub print_report
{
my $col_info_list = shift;
my $ncols = @{$col_info_list};
my $s;

    for my $i (0 .. $ncols - 1)
    {
        my $info = $col_info_list->[$i];

        printf "Column %d: %s\n", $i+1, $info->{label};
        if (!$info->{hasnonempty})  # column is always empty
        {
            print " column is always empty\n";
            next;
        }

        # if the column has nonempty values but one of
        # these hasn't been ruled out, that's a problem
        if ($info->{numeric} && $info->{temporal})
        {
            die "Logic error: $info->{label} was characterized as both"
    . " numeric and temporal\n";
        }

        print " column has empty values: "
    . ($info->{hasempty} ? "yes" : "no") . "\n";
        printf " column value maximum length = %d\n", $info->{max_length};

        if ($info->{numeric})
        {
            printf " column is numeric (range: %g - %g)\n",
                $info->{min_val}, $info->{max_val};
            if ($info->{integer})
            {
    print " column is integer\n";
    if ($info->{nonnegative})
    {
        print " column is nonnegative\n";
    }
            }
        }
        elsif ($info->{temporal})
        {
            if ($info->{date})
            {
    my $ref = $info->{date_range};
    print " column contains date values";
    printf " (part ranges: %d - %d, %d - %d, %d - %d)\n",
                $ref->[0]->{min}, $ref->[0]->{max},
                $ref->[1]->{min}, $ref->[1]->{max},
                $ref->[2]->{min}, $ref->[2]->{max};
    $ref = $info->{date_type};
    printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n",
                ($ref->{iso} ? "yes" : "no"),
                ($ref->{us} ? "yes" : "no"),
                ($ref->{br} ? "yes" : "no");
            }
            elsif ($info->{datetime})
            {
    my $ref = $info->{date_range};
    print " column contains date+time values";
    printf " (part ranges: %d - %d, %d - %d, %d - %d)\n",
                $ref->[0]->{min}, $ref->[0]->{max},
                $ref->[1]->{min}, $ref->[1]->{max},
                $ref->[2]->{min}, $ref->[2]->{max};
    $ref = $info->{date_type};
    printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n",
                ($ref->{iso} ? "yes" : "no"),
                ($ref->{us} ? "yes" : "no"),
                ($ref->{br} ? "yes" : "no");
            }
            elsif ($info->{time})
            {
    print " column contains time values\n";
            }
            else
            {
    die "Logic error: $info->{label} is flagged as temporal, but"
        . " not as any of the temporal types\n";
            }
        }
        else
        {
            print " column appears to be a string"
    . " (cannot further narrow the type)\n";
        }
    }
}

# ----------------------------------------------------------------------

my $prog = "guess_table.pl";
my $usage = <<EOF;
Usage: $prog [options] [data_file]

Options:
--help
        Print this message
--labels, -l
        Interpret first input line as row of table column labels
        (default = c1, c2, ...)
--lower, --upper
        Force column labels to be in lowercase or uppercase
--quote-names, --skip-quote-names
     Quote or do not quote table and column identifiers with `` characters
     in case they are reserved words (default = quote identifiers)
--report , -r
        Report mode; print findings rather than generating a CREATE
        TABLE statement
--table=tbl_name, -t tbl_name
        Specify table name (default = t)
EOF

my $help;
my $labels;     # expect a line of column labels?
my $tbl_name = "t"; # table name (default: t)
my $report;
my $lower;
my $upper;
my $quote_names = 1;
my $skip_quote_names;

GetOptions (
    # =s means a string value is required after the option
    "help"              => \$help,            # print help message
    "labels|l"          => \$labels,          # expect row of column labels
    "table|t=s"         => \$tbl_name,        # table name
    "report|r"          => \$report,          # report mode
    "lower"             => \$lower,           # lowercase labels
    "upper"             => \$upper,           # uppercase labels
    "quote-names"       => \$quote_names,     # quote identifiers
    "skip-quote-names"  => \$skip_quote_names # don't quote identifiers
) or die "$usage\n";

die  "$usage\n" if defined $help;

$report = defined ($report);  # convert defined/undefined to boolean
$lower = defined ($lower);
$upper = defined ($upper);
$quote_names = defined ($quote_names);
$quote_names = 0 if defined ($skip_quote_names);

die "--lower and --upper were both specified; that makes no sense\n"
            if $lower && $upper;

my $line;
my $line_count = 0;
my @labels;   # column labels
my $ncols;    # number of columns
my $col_info_list;

# If labels are expected, read the first line to get them
if ($labels)
{
    defined ($line = <>) or die;
    chomp ($line);
    @labels = split (/\t/, $line);
}

# Arrays to hold line numbers of lines with too many/too few fields.
# The first line in the file is assumed to be representative.  The
# number of fields it contains becomes the norm against which any following
# lines are assessed.

my @excess_fields;
my @too_few_fields;

while (<>)
{
    chomp ($line = $_);
    ++$line_count;
    if (!defined ($ncols))  # don't know this until first data line read
    {
        # determine number of columns (assume no more than 10,000)
        my @val = split (/\t/, $line, 10000);
        $ncols = @val;
        if (@labels)  # label count must match data column count
        {
            die "Label count doesn't match data column count\n"
                if $ncols != @labels;
        }
        else      # if there were no labels, create them
        {
            @labels = map { "c" . $_ } 1 .. $ncols;
        }
        $col_info_list = init_col_info (@labels);
    }

    my @val = split (/\t/, $line, 10000);
    push (@excess_fields, $line_count) if @val > $ncols;
    push (@too_few_fields, $line_count) if @val < $ncols;
    for my $i (0 .. $ncols - 1)
    {
        my $val = ($i < @val ? $val[$i] : "");  # use "" if field is missing
        my $info = $col_info_list->[$i];

        $info->{max_length} = length ($val)
            if $info->{max_length} < length ($val);

        if ($val eq "")
        {
            # column does have empty values
            $info->{hasempty} = 1;
            next; # no other tests apply
        }
        $info->{hasnonempty} = 1;

        # perform numeric tests if no nonnumeric values have yet been seen

        if ($info->{numeric})
        {
            # numeric test (doesn't recognize scientific notation)
            if ($val =~ /^[-+]?(\d+(\.\d*)?|\.\d+)$/)
            {
    # not int if contains decimal point
    $info->{integer} = 0 if $val =~ /\./;
    # not unsigned if begins with minus sign
    $info->{nonnegative} = 0 if $val =~ /^-/;

    # track min/max value
    $info->{min_val} = $val
        if !defined ($info->{min_val}) || $info->{min_val} > $val;
    $info->{max_val} = $val
        if !defined ($info->{max_val}) || $info->{max_val} < $val;
            }
            else
            {
    # column contains nonnumeric information
    $info->{numeric} = 0;
    $info->{integer} = 0;
            }
        }

        # perform temporal tests if no nontemporal values have yet been seen

        if ($info->{temporal})
        {
            # date/datetime test
            # allow date, date hour:min, date hour:min:sec
            if (($info->{date} || $info->{datetime})
    && $val =~ /^(\d+)[-\/](\d+)[-\/](\d+)\s*(\d+:\d+(:\d+)?)?$/)
            {
    # it's not a time
    $info->{time} = 0;

    # not a date if time part was present; not a
    # datetime if no time part was present
    $info->{ defined ($4) ? "date" : "datetime" } = 0;

    # use the first three parts to track range of date parts
    my @val = ($1, $2, $3);
    my $ref = $info->{date_range};
    foreach my $i (0 .. 2)
    {
        # if this is the first value we've checked, create the
        # structure to hold the min and max; otherwise compare
        # the stored min/max to the current value
        if (!defined ($ref->[$i]))
        {
            $ref->[$i]->{min} = $val[$i];
            $ref->[$i]->{max} = $val[$i];
            next;
        }
        $ref->[$i]->{min} = $val[$i]
            if $ref->[$i]->{min} > $val[$i];
        $ref->[$i]->{max} = $val[$i]
            if $ref->[$i]->{max} < $val[$i];
    }
            }
            # time test
            # allow hour:min, hour:min:sec
            elsif ($info->{time} && $val =~ /^\d+:\d+(:\d+)?$/)
            {
    # it's not a date or datetime
    $info->{date} = 0;
    $info->{datetime} = 0;
            }
            else
            {
    # column contains nontemporal information
    $info->{temporal} = 0;
            }
        }
    }
}

die "Input contained no data lines\n" if $line_count == 0;
die "Input lines all were empty\n" if $ncols == 0;

# Look at columns that look like DATE or DATETIME columns and attempt
# to determine whether they appear to be in ISO, U.S., or British format.
# (Skip columns that are always empty, because these assessments cannot
# be made for such columns.)

for my $i (0 .. $ncols - 1)
{
    my $info = $col_info_list->[$i];
    next unless $info->{hasnonempty};
    next unless $info->{temporal} && ($info->{date} || $info->{datetime});
    my $ref = $info->{date_range};
    # assume that the column is valid as each of the types until ruled out
    my $valid_as_iso = 1; # [CC]YY-MM-DD
    my $valid_as_us = 1;  # MM-DD-[CC]YY
    my $valid_as_br = 1;  # DD-MM-[CC]YY
    # first segment is U.S. month, British day
    my $min = $ref->[0]->{min};
    my $max = $ref->[0]->{max};
    $valid_as_us = 0 if $min < 0 || $max > 12;
    $valid_as_br = 0 if $min < 0 || $max > 31;
    # second segment is U.S. day, British month, ISO month
    $min = $ref->[1]->{min};
    $max = $ref->[1]->{max};
    $valid_as_us = 0 if $min < 0 || $max > 31;
    $valid_as_br = 0 if $min < 0 || $max > 12;
    $valid_as_iso = 0 if $min < 0 || $max > 12;
    # third segment is ISO day
    $min = $ref->[2]->{min};
    $max = $ref->[2]->{max};
    $valid_as_iso = 0 if $min < 0 || $max > 31;
    if (!$valid_as_iso && !$valid_as_us && !$valid_as_br)
    {
        $info->{temporal} = 0;  # huh! guess it's not a date after all
    }
    else    # save date type results for later
    {
        $info->{date_type}->{iso} = $valid_as_iso;
        $info->{date_type}->{us} = $valid_as_us;
        $info->{date_type}->{br} = $valid_as_br;
    }
}


warn "# Number of lines = $line_count, columns = $ncols\n";
warn "# Number of lines with too few fields: " . scalar (@too_few_fields) . "\n"
                    if @too_few_fields;
warn "# Number of lines with excess fields: " . scalar (@excess_fields) . "\n"
                    if @excess_fields;

if ($report)
{
    print_report ($col_info_list);
}
else
{
    for my $i (0 .. $ncols - 1)
    {
        my $info = $col_info_list->[$i];
        $info->{label} = lc ($info->{label}) if $lower;
        $info->{label} = uc ($info->{label}) if $upper;
    }
    print_create_table ($tbl_name, $col_info_list, $quote_names);
}

#/usr/bin/perl
#guess_table.pl-描述数据文件的内容并使用
#猜测文件的CREATETABLE语句的信息
#用法：guess_table.pl table_name data_文件
#要做：
#-对某些内容使用值范围信息。已经收集了，但还没有
#用过。例如，建议更好的INT类型。
#-去除非负属性；现在可以从范围内进行评估。
#加载数据文件并读取列名和数据值。
#根据数据内容猜测每列的声明
#值，然后为
#桌子。因为列声明只是猜测，您很可能
#要编辑输出，例如更改数据类型或
#长度。您可能还需要添加索引。然而，使用这个
#脚本可能比手工编写CREATETABLE语句更容易。
#一些假设：
#-行以制表符分隔，换行符终止
#-日期由3个数字部分组成，以-或/分隔，按y/m/d顺序排列
#下面是一些可以改进guess_table.pl的方法。每个
#他们将使它更智能，尽管这是以增加处理的成本为代价的
#要求。有些建议可能对真正意义上的人来说是不切实际的
#巨大的文件。
#-对于数字列，使用最小/最大值更好地猜测类型。
#-跟踪列中唯一值的数量。如果有
#如果不是很多，该列可能是一个很好的枚举候选者。
#测试不应区分大小写，因为枚举列不区分大小写
#区分大小写。
#-使日期猜测代码更智能。让它识别非ISO格式
#并尝试提出需要重新格式化列的建议。
#（这实际上需要查看整个列，因为这样会有所帮助。）
#它将美国和英国格式区分为月和日。）
#这需要跟踪三个日期部分中每个部分的最小/最大值。
#-如果列中的所有值都是唯一的，则建议它应该是主值
#键或唯一索引。
#-对于DATETIME列，允许在不标记的情况下丢失某些时间
#列，既不是日期也不是时间。
#保罗·杜波伊斯
# paul@kitebird.com
# 2002-01-31
# 2002-01-31
#-创建。
# 2002-02-19
#-为数字列和三个日期的跟踪范围添加代码
#看起来包含日期的列的子部分。
# 2002-02-20
#-添加--lower和--upper选项以强制列标签为小写
#或大写。
# 2002-03-01
#-对于长度超过255个字符的字符列，请选择“基于文本类型”
#在最大长度上。
# 2002-04-04
#-Add--quote names选项引用表名和列名`like this`。
#结果语句需要MySQL 3.23.6或更高版本。
# 2002-07-16
#-修复由于中缺少列而导致的“未初始化值”警告
#数据线。
#-不要尝试评估始终为空的列的日期特征
#空的。
# 2005-12-28
#-将--quote name设置为默认值，添加--skip quote names选项，以便
#可以关闭标识符引用。
#-默认数据类型现在是VARCHAR，而不是CHAR。
# 2006-06-10
#-如果双精度/十进制列为无符号列，则发出无符号列。
严格使用；
使用警告；
使用Getopt:：Long；
$Getopt:：Long:：ignorecase=0；#选项区分大小写
$Getopt:：Long:：bundling=1；#允许绑定短选项
# ----------------------------------------------------------------------
#创建信息结构，用于描述中的每个列
#在数据文件中。我们需要知道是否存在任何非数值
#已找到，数值是否始终为整数，以及最大长度
#列值的类型。
#参数是列标签的数组。
#创建哈希引用数组并返回对该数组的引用。
子初始化列信息
{
我的@labels=@；
我的@col_info；
对于我的$i（0..@labels-1）
{
我的$info={}；
$info->{label}=$labels[$i]；
$info->{max_length}=0；
#这些可以直接测试，所以在找到之前，它们都被设置为false
#说实话
$info->{hasempty}=0；#具有空值
$info->{hasnonempty}=0；#具有非空值
#只有通过查看表中的所有值，才能对其进行评估
#列，所以它们被设置为true，直到通过反例发现
#虚伪
$info->{numeric}=1；#用于检测一般数字类型
$info->{integer}=1；#用于检测INT
$info->{nonnegative}=1；#用于检测无符号
$info->{temporal}=1；#用于检测一般时态类型
$info->{date}=1；#用于检测日期
$info->{datetime}=1；#用于检测日期时间
$info->{time}=1；#用于检测时间
#数字列的跟踪最小/最大值
$info->{min_val}=undef；
$info->{max_val}=undef；
#跟踪三个日期部分中每个部分的最小/最大值
$info->{date
#!/usr/bin/perl
# guess_table.pl - characterize the contents of a data file and use the
# information to guess a CREATE TABLE statement for the file

# Usage: guess_table.pl table_name data_file

# To do:
# - Use value range information for something.  It's collected but not yet
#   used.  For example, suggest better INT types.
# - Get rid of nonnegative attribute; it can be assessed now from the range.

# Load a data file and read column names and data values.
# Guess the declaration for each of the columns based on what the data
# values look like, and then generate an SQL CREATE TABLE statement for the
# table.  Because the column declarations are just guesses, you'll likely
# want to edit the output, for example, to change a data type or
# length.  You may also want to add indexes.  Nevertheless, using this
# script can be easier than writing the CREATE TABLE statement by hand.

# Some assumptions:
# - Lines are tab-delimited, linefeed-terminated
# - Dates consist of 3 numeric parts, separated by - or /, in y/m/d order

# Here are some ways that guess_table.pl could be improved.  Each of
# them would make it smarter, albeit at the cost of increased processing
# requirements. Some of the suggestions are likely impractical for really
# huge files.

# - For numeric columns, use min/max values to better guess the type.
# - Keep track of the number of unique values in a column.  If there
#   aren't many, the column might be a good candidate for being an ENUM.
#   Testing should not be case sensitive, because ENUM columns are not
#   case sensitive.
# - Make the date guessing code smarter.  Have it recognize non-ISO format
#   and attempt to make suggestions that a column needs to be reformatted.
#   (This actually needs to see entire column, because that would help
#   it distinguish U.S. from British formats WRT order of month and day.)
#   This would need to track min/max for each of the three date parts.
# - If all values in a column are unique, suggest that it should be a PRIMARY
#   KEY or a UNIQUE index.
# - For DATETIME columns, allow some times to be missing without flagging
#   column as neither DATE nor TIME.

# Paul DuBois
# paul@kitebird.com
# 2002-01-31

# 2002-01-31
# - Created.
# 2002-02-19
# - Add code to track ranges for numeric columns and for the three date
#   subparts of columns that look like they contain dates.
# 2002-02-20
# - Added --lower and --upper options to force column labels to lowercase
#   or uppercase.
# 2002-03-01
# - For character columns longer than 255 characters, choose TEXT type based
#   on maximum length.
# 2002-04-04
# - Add --quote-names option to quote table and column names `like this`.
#   The resulting statement requires MySQL 3.23.6 or higher.
# 2002-07-16
# - Fix "uninitialized value" warnings resulting from missing columns in
#   data lines.
# - Don't attempt to assess date characteristics for columns that are always
#   empty.
# 2005-12-28
# - Make --quote-names the default, add --skip-quote-names option so that
#   identifier quoting can be turned off.
# - Default data type now is VARCHAR, not CHAR.
# 2006-06-10
# - Emit UNSIGNED for double/decimal columns if they're unsigned.

use strict;
use warnings;
use Getopt::Long;
$Getopt::Long::ignorecase = 0; # options are case sensitive
$Getopt::Long::bundling = 1;   # allow short options to be bundled

# ----------------------------------------------------------------------

# Create information structures to use for characterizing each column in
# in the data file.  We need to know whether any nonnumeric values are
# found, whether numeric values are always integers, and the maximum length
# of column values.

# Argument is the array of column labels.
# Creates an array of hash references and returns a reference to that array.

sub init_col_info
{
my @labels = @_;
my @col_info;

    for my $i (0 .. @labels - 1)
    {
        my $info = { };
        $info->{label} = $labels[$i];
        $info->{max_length} = 0;

        # these can be tested directly, so they're set false until found
        # to be true
        $info->{hasempty} = 0;    # has empty values
        $info->{hasnonempty} = 0; # has nonempty values

        # these can be assessed only by seeing all the values in the
        # column, so they're set true until discovered by counterexample
        # to be false
        $info->{numeric} = 1;     # used to detect general numeric types
        $info->{integer} = 1;     # used to detect INT
        $info->{nonnegative} = 1; # used to detect UNSIGNED
        $info->{temporal} = 1;    # used to detect general temporal types
        $info->{date} = 1;        # used to detect DATE
        $info->{datetime} = 1;    # used to detect DATETIME
        $info->{time} = 1;        # used to detect TIME

        # track min/max value for numeric columns
        $info->{min_val} = undef;
        $info->{max_val} = undef;

        # track min/max for each of three date parts
        $info->{date_range} = [ undef, undef, undef];

        push (@col_info, $info);
    }
    return (\@col_info);
}

sub print_create_table
{
my ($tbl_name, $col_info_list, $quote) = @_;
my $ncols = @{$col_info_list};
my $s;
my $extra = "";

    $quote = ($quote ? "`" : "");     # quote names?
    for my $i (0 .. $ncols - 1)
    {
        my $info = $col_info_list->[$i];

        $s .= ",\n" if $i > 0;
        $s .= $extra if $extra ne "";
        $extra = "";

        $s .= "  $quote$info->{label}$quote ";

        if (!$info->{hasnonempty})  # column is always empty, make wild guess
        {
            $s .= "CHAR(10)    /* NOTE: column is always empty */";
            next;
        }

        # if the column has nonempty values but one of
        # these hasn't been ruled out, that's a problem
        if ($info->{numeric} && $info->{temporal})
        {
            die "Logic error: $info->{label} was characterized as both"
    . " numeric and temporal\n";
        }

        if ($info->{numeric})
        {
            if ($info->{integer})
            {
    $s .= "INT";
## TO DO: use range to make guess about type
    # Print "might be YEAR" if in range...(0, 1901-2155)
            }
            else
            {
    $s .= "DOUBLE";
            }
            $s .= " UNSIGNED" if $info->{nonnegative};
        }
        elsif ($info->{temporal})
        {
            # if a date column looks more like a U.S. or British
            # date, add some comments to that effect
            if (exists ($info->{date_type}))
            {
    my $ref = $info->{date_type};
    $extra .= "  # $info->{label} might be a U.S. date\n"
            if $ref->{us};
    $extra .= "  # $info->{label} might be a British date\n"
            if $ref->{br};
            }
            if ($info->{date})
            {
    $s .= "DATE";
            }
            elsif ($info->{datetime})
            {
    $s .= "DATETIME";
            }
            elsif ($info->{time})
            {
    $s .= "TIME";
            }
            else
            {
    die "Logic error: $info->{label} is flagged as temporal, but"
        . " not as any of the temporal types\n";
            }
        }
        else
        {
            if ($info->{max_length} < 256)
            {
    $s .= "VARCHAR($info->{max_length})";
            }
            elsif ($info->{max_length} < 65536)
            {
    $s .= "TEXT";
            }
            elsif ($info->{max_length} < 16777216)
            {
    $s .= "MEDIUMTEXT";
            }
            else
            {
    $s .= "LONGTEXT";
            }
        }
        # if a column doesn't have empty values, guess that it cannot be NULL
        $s .= " " . ($info->{hasempty} ? "NULL" : "NOT NULL");
    }

    $s = "CREATE TABLE $quote$tbl_name$quote\n(\n$s\n);\n";
    print $s;
}

sub print_report
{
my $col_info_list = shift;
my $ncols = @{$col_info_list};
my $s;

    for my $i (0 .. $ncols - 1)
    {
        my $info = $col_info_list->[$i];

        printf "Column %d: %s\n", $i+1, $info->{label};
        if (!$info->{hasnonempty})  # column is always empty
        {
            print " column is always empty\n";
            next;
        }

        # if the column has nonempty values but one of
        # these hasn't been ruled out, that's a problem
        if ($info->{numeric} && $info->{temporal})
        {
            die "Logic error: $info->{label} was characterized as both"
    . " numeric and temporal\n";
        }

        print " column has empty values: "
    . ($info->{hasempty} ? "yes" : "no") . "\n";
        printf " column value maximum length = %d\n", $info->{max_length};

        if ($info->{numeric})
        {
            printf " column is numeric (range: %g - %g)\n",
                $info->{min_val}, $info->{max_val};
            if ($info->{integer})
            {
    print " column is integer\n";
    if ($info->{nonnegative})
    {
        print " column is nonnegative\n";
    }
            }
        }
        elsif ($info->{temporal})
        {
            if ($info->{date})
            {
    my $ref = $info->{date_range};
    print " column contains date values";
    printf " (part ranges: %d - %d, %d - %d, %d - %d)\n",
                $ref->[0]->{min}, $ref->[0]->{max},
                $ref->[1]->{min}, $ref->[1]->{max},
                $ref->[2]->{min}, $ref->[2]->{max};
    $ref = $info->{date_type};
    printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n",
                ($ref->{iso} ? "yes" : "no"),
                ($ref->{us} ? "yes" : "no"),
                ($ref->{br} ? "yes" : "no");
            }
            elsif ($info->{datetime})
            {
    my $ref = $info->{date_range};
    print " column contains date+time values";
    printf " (part ranges: %d - %d, %d - %d, %d - %d)\n",
                $ref->[0]->{min}, $ref->[0]->{max},
                $ref->[1]->{min}, $ref->[1]->{max},
                $ref->[2]->{min}, $ref->[2]->{max};
    $ref = $info->{date_type};
    printf " most likely date types: ISO: %s; U.S.: %s; British: %s\n",
                ($ref->{iso} ? "yes" : "no"),
                ($ref->{us} ? "yes" : "no"),
                ($ref->{br} ? "yes" : "no");
            }
            elsif ($info->{time})
            {
    print " column contains time values\n";
            }
            else
            {
    die "Logic error: $info->{label} is flagged as temporal, but"
        . " not as any of the temporal types\n";
            }
        }
        else
        {
            print " column appears to be a string"
    . " (cannot further narrow the type)\n";
        }
    }
}

# ----------------------------------------------------------------------

my $prog = "guess_table.pl";
my $usage = <<EOF;
Usage: $prog [options] [data_file]

Options:
--help
        Print this message
--labels, -l
        Interpret first input line as row of table column labels
        (default = c1, c2, ...)
--lower, --upper
        Force column labels to be in lowercase or uppercase
--quote-names, --skip-quote-names
     Quote or do not quote table and column identifiers with `` characters
     in case they are reserved words (default = quote identifiers)
--report , -r
        Report mode; print findings rather than generating a CREATE
        TABLE statement
--table=tbl_name, -t tbl_name
        Specify table name (default = t)
EOF

my $help;
my $labels;     # expect a line of column labels?
my $tbl_name = "t"; # table name (default: t)
my $report;
my $lower;
my $upper;
my $quote_names = 1;
my $skip_quote_names;

GetOptions (
    # =s means a string value is required after the option
    "help"              => \$help,            # print help message
    "labels|l"          => \$labels,          # expect row of column labels
    "table|t=s"         => \$tbl_name,        # table name
    "report|r"          => \$report,          # report mode
    "lower"             => \$lower,           # lowercase labels
    "upper"             => \$upper,           # uppercase labels
    "quote-names"       => \$quote_names,     # quote identifiers
    "skip-quote-names"  => \$skip_quote_names # don't quote identifiers
) or die "$usage\n";

die  "$usage\n" if defined $help;

$report = defined ($report);  # convert defined/undefined to boolean
$lower = defined ($lower);
$upper = defined ($upper);
$quote_names = defined ($quote_names);
$quote_names = 0 if defined ($skip_quote_names);

die "--lower and --upper were both specified; that makes no sense\n"
            if $lower && $upper;

my $line;
my $line_count = 0;
my @labels;   # column labels
my $ncols;    # number of columns
my $col_info_list;

# If labels are expected, read the first line to get them
if ($labels)
{
    defined ($line = <>) or die;
    chomp ($line);
    @labels = split (/\t/, $line);
}

# Arrays to hold line numbers of lines with too many/too few fields.
# The first line in the file is assumed to be representative.  The
# number of fields it contains becomes the norm against which any following
# lines are assessed.

my @excess_fields;
my @too_few_fields;

while (<>)
{
    chomp ($line = $_);
    ++$line_count;
    if (!defined ($ncols))  # don't know this until first data line read
    {
        # determine number of columns (assume no more than 10,000)
        my @val = split (/\t/, $line, 10000);
        $ncols = @val;
        if (@labels)  # label count must match data column count
        {
            die "Label count doesn't match data column count\n"
                if $ncols != @labels;
        }
        else      # if there were no labels, create them
        {
            @labels = map { "c" . $_ } 1 .. $ncols;
        }
        $col_info_list = init_col_info (@labels);
    }

    my @val = split (/\t/, $line, 10000);
    push (@excess_fields, $line_count) if @val > $ncols;
    push (@too_few_fields, $line_count) if @val < $ncols;
    for my $i (0 .. $ncols - 1)
    {
        my $val = ($i < @val ? $val[$i] : "");  # use "" if field is missing
        my $info = $col_info_list->[$i];

        $info->{max_length} = length ($val)
            if $info->{max_length} < length ($val);

        if ($val eq "")
        {
            # column does have empty values
            $info->{hasempty} = 1;
            next; # no other tests apply
        }
        $info->{hasnonempty} = 1;

        # perform numeric tests if no nonnumeric values have yet been seen

        if ($info->{numeric})
        {
            # numeric test (doesn't recognize scientific notation)
            if ($val =~ /^[-+]?(\d+(\.\d*)?|\.\d+)$/)
            {
    # not int if contains decimal point
    $info->{integer} = 0 if $val =~ /\./;
    # not unsigned if begins with minus sign
    $info->{nonnegative} = 0 if $val =~ /^-/;

    # track min/max value
    $info->{min_val} = $val
        if !defined ($info->{min_val}) || $info->{min_val} > $val;
    $info->{max_val} = $val
        if !defined ($info->{max_val}) || $info->{max_val} < $val;
            }
            else
            {
    # column contains nonnumeric information
    $info->{numeric} = 0;
    $info->{integer} = 0;
            }
        }

        # perform temporal tests if no nontemporal values have yet been seen

        if ($info->{temporal})
        {
            # date/datetime test
            # allow date, date hour:min, date hour:min:sec
            if (($info->{date} || $info->{datetime})
    && $val =~ /^(\d+)[-\/](\d+)[-\/](\d+)\s*(\d+:\d+(:\d+)?)?$/)
            {
    # it's not a time
    $info->{time} = 0;

    # not a date if time part was present; not a
    # datetime if no time part was present
    $info->{ defined ($4) ? "date" : "datetime" } = 0;

    # use the first three parts to track range of date parts
    my @val = ($1, $2, $3);
    my $ref = $info->{date_range};
    foreach my $i (0 .. 2)
    {
        # if this is the first value we've checked, create the
        # structure to hold the min and max; otherwise compare
        # the stored min/max to the current value
        if (!defined ($ref->[$i]))
        {
            $ref->[$i]->{min} = $val[$i];
            $ref->[$i]->{max} = $val[$i];
            next;
        }
        $ref->[$i]->{min} = $val[$i]
            if $ref->[$i]->{min} > $val[$i];
        $ref->[$i]->{max} = $val[$i]
            if $ref->[$i]->{max} < $val[$i];
    }
            }
            # time test
            # allow hour:min, hour:min:sec
            elsif ($info->{time} && $val =~ /^\d+:\d+(:\d+)?$/)
            {
    # it's not a date or datetime
    $info->{date} = 0;
    $info->{datetime} = 0;
            }
            else
            {
    # column contains nontemporal information
    $info->{temporal} = 0;
            }
        }
    }
}

die "Input contained no data lines\n" if $line_count == 0;
die "Input lines all were empty\n" if $ncols == 0;

# Look at columns that look like DATE or DATETIME columns and attempt
# to determine whether they appear to be in ISO, U.S., or British format.
# (Skip columns that are always empty, because these assessments cannot
# be made for such columns.)

for my $i (0 .. $ncols - 1)
{
    my $info = $col_info_list->[$i];
    next unless $info->{hasnonempty};
    next unless $info->{temporal} && ($info->{date} || $info->{datetime});
    my $ref = $info->{date_range};
    # assume that the column is valid as each of the types until ruled out
    my $valid_as_iso = 1; # [CC]YY-MM-DD
    my $valid_as_us = 1;  # MM-DD-[CC]YY
    my $valid_as_br = 1;  # DD-MM-[CC]YY
    # first segment is U.S. month, British day
    my $min = $ref->[0]->{min};
    my $max = $ref->[0]->{max};
    $valid_as_us = 0 if $min < 0 || $max > 12;
    $valid_as_br = 0 if $min < 0 || $max > 31;
    # second segment is U.S. day, British month, ISO month
    $min = $ref->[1]->{min};
    $max = $ref->[1]->{max};
    $valid_as_us = 0 if $min < 0 || $max > 31;
    $valid_as_br = 0 if $min < 0 || $max > 12;
    $valid_as_iso = 0 if $min < 0 || $max > 12;
    # third segment is ISO day
    $min = $ref->[2]->{min};
    $max = $ref->[2]->{max};
    $valid_as_iso = 0 if $min < 0 || $max > 31;
    if (!$valid_as_iso && !$valid_as_us && !$valid_as_br)
    {
        $info->{temporal} = 0;  # huh! guess it's not a date after all
    }
    else    # save date type results for later
    {
        $info->{date_type}->{iso} = $valid_as_iso;
        $info->{date_type}->{us} = $valid_as_us;
        $info->{date_type}->{br} = $valid_as_br;
    }
}


warn "# Number of lines = $line_count, columns = $ncols\n";
warn "# Number of lines with too few fields: " . scalar (@too_few_fields) . "\n"
                    if @too_few_fields;
warn "# Number of lines with excess fields: " . scalar (@excess_fields) . "\n"
                    if @excess_fields;

if ($report)
{
    print_report ($col_info_list);
}
else
{
    for my $i (0 .. $ncols - 1)
    {
        my $info = $col_info_list->[$i];
        $info->{label} = lc ($info->{label}) if $lower;
        $info->{label} = uc ($info->{label}) if $upper;
    }
    print_create_table ($tbl_name, $col_info_list, $quote_names);
}