Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 将数据格式化为csv的空间_Regex_Csv_Awk_Formatting_Pretty Print - Fatal编程技术网

Regex 将数据格式化为csv的空间

Regex 将数据格式化为csv的空间,regex,csv,awk,formatting,pretty-print,Regex,Csv,Awk,Formatting,Pretty Print,很长一段时间以来,我一直在尝试将空间分隔的数据格式化为CSV结构 初始位置 初始数据表如下所示: Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment Dr. Hema Sanath C BHMS,

很长一段时间以来,我一直在尝试将空间分隔的数据格式化为CSV结构

初始位置 初始数据表如下所示:

Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE    Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment   
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic    Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment   
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center     Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment
它包含大量的空间和不必要的信息。信息是这样呈现的

Doctor's name | Degree | Years of experience | Specialization | Hospital name | Address | Fees | Schedule | and an unnecessary book appointment field.
 Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
 Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM   
 Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM
我想把它转换成以下格式

Doctor's name,Specialization,Hospital name,Address,Fees,Schedule
所以现在的数据应该是这样的

Doctor's name | Degree | Years of experience | Specialization | Hospital name | Address | Fees | Schedule | and an unnecessary book appointment field.
 Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
 Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM   
 Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM
到目前为止,我已成功删除Book约会字段

问题 然而,我在对医院名称进行分类时遇到了困难。因为它的间距变化很大。这个问题可行吗

编辑
cat-A文件的输出如下所示:

 Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE ^I Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment $
 Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic ^I Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment $
 Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center ^I Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment

不幸的是,根据您的输入,无法将专科与医院名称分开。其他字段可以被捕获,尽管不雅观且带有gawk(可能>=4.0,但我认为3.x应该可以工作):


没有直接的方法将专科与医院名称区分开来,但是根据一些假设,您可以使用
perl
来实现这一点:

perl -pe 's/^(\S+\s+\S+\s+\S+).+experience\s([^\t]+?)\s+(\b[A-Z0-9]{2}[^\t]+?|(?:(?!\b[A-Z0-9]{2})[^\t])*)\s+\t\s+([^,]+,).+?(INR.+?PM)\s+.*/\1,\2,\3,\4\5/' file
给出:

Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250 MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200 MON-SUN10:00AM-8:00PM
由于它是基于perl的正则表达式,您可以使用它通过正则表达式调试器了解它的工作原理。正则表达式非常简单,但它有许多部分,这一事实使它看起来令人望而生畏

警告:以上可以基于两个方面来区分专门化:

  • 它尝试查找第一个出现的空格,后跟两个大写字符或数字,并在找到它时开始匹配医院名称;或
  • 如果没有连续的大写字符或数字,则只取第一个单词作为专科,其余单词作为医院名称

  • 我知道这可能无法解决所有问题,因为总有一些线条不符合上述规则,但这可以让你开始清理这些线条。如果有任何内容被错误地分开(即,当专科包含超过1个单词且医院名称没有两个连续的上/数字时),您将正确放置专科的一个单词,其余的放在医院名称中。

    在您的原始文件中似乎有一些
    选项卡
    ,请您运行命令
    cat-A文件
    并更新输出给我们好吗?我已经在编辑部分添加了cat-A的输出。有没有办法在专科和医院名称之间进行某种分隔?水平制表符是另一个常用的值分隔符。逗号字符不是唯一用于分隔值的字符。我现在问自己,如果您通过将制表符替换为空格来删除分隔符字符,那么可以很容易地将制表符分隔的CSV文件重新格式化为逗号分隔的CSV文件,并按照所需的顺序使用所需的数据。使用制表符作为分隔符的CSV文件可以使用当然是Excel。