替换R中vector中的rogue双引号
我有一个破碎的CSV文件,长文本字段包含双引号和逗号。我已经能够在某种程度上对其进行清理,现在将制表符分隔的字段作为整行向量(每个值都是一行) 然后,我将temp作为文件写入,并将其读回(我发现这比textConnection快得多)。但是,替换R中vector中的rogue双引号,r,regex,R,Regex,我有一个破碎的CSV文件,长文本字段包含双引号和逗号。我已经能够在某种程度上对其进行清理,现在将制表符分隔的字段作为整行向量(每个值都是一行) 然后,我将temp作为文件写入,并将其读回(我发现这比textConnection快得多)。但是,read.table(“temp”,sep=“\t”,quote=“\”,encoding=“UTF-8”,colClasses=“character”)会在某些行上阻塞,并向我提供以下消息: 扫描错误(文件=文件,内容=内容,sep=sep,quote=q
read.table(“temp”,sep=“\t”,quote=“\”,encoding=“UTF-8”,colClasses=“character”)
会在某些行上阻塞,并向我提供以下消息:
扫描错误(文件=文件,内容=内容,sep=sep,quote=quote,dec
=dec,:第66951行没有29个元素
我认为这是由于rogue双引号引起的,如下所示(rogue引号可以在“TripAdvisor de la sant?”之后立即找到)
我建议用单引号替换恶意双引号,但我必须保留预期的引号。引号应在分隔符(制表符)之前或之后,以及行首(仅第一行)和行尾。我在正则表达式中编写了以下尝试,其中包含制表符和行首和行尾的lookarounds,但不起作用:
temp <- gsub("(?<![^\t])\"(?![\t$])", "'", temp, perl = T)
您的(?前面不带除制表符以外的字符(因此,“
”之前必须有制表符或字符串开头),并且后面不带制表符或$
符号
因此,字符类中的^
和$
将失去其锚定意义
用替换组替换字符类:
gsub("(?<!\t|^)\"(?!\t|$)", "'", temp, perl=TRUE)
gsub(“”)试试gsub(“”?
temp <- gsub("(?<![^\t])\"(?![\t$])", "'", temp, perl = T)
temp[181]
[1] "198\torganizations/playfusion\tplayfusion\torganizations/playfusion\torganization/playfusion\tPlayFusion\t\tPlayFusion is a developer of computer games.\tPlayFusion is pioneering the next generation of connected interactive entertainment. PlayFusion's proprietary technology platform fuses video games, robotics, toys, and trans-media entertainment. The company is currently working on its own original IP to trail-blaze its vision ahead of opening its platform to others. PlayFusion is an independent, employee-owned company with offices in Cambridge and Derby in the UK, Douglas in the Isle of Man, and New York and San Francisco in the USA.\thttp://public.crunchbase.com/t_api_images/v1475688372/xnhrd4t254pxj6yxegzt.png\tcompany\t\t\t\t\t2015-01-01\t4\tFALSE\t\t0\t11\t50\t\t\t0\t0\thttp://playfusion.com/#intro\t1475688521\t1475899292"
gsub("(?<!\t|^)\"(?!\t|$)", "'", temp, perl=TRUE)