替换R中vector中的rogue双引号_R_Regex

替换R中vector中的rogue双引号

r regex

替换R中vector中的rogue双引号,r,regex,R,Regex,我有一个破碎的CSV文件，长文本字段包含双引号和逗号。我已经能够在某种程度上对其进行清理，现在将制表符分隔的字段作为整行向量（每个值都是一行）然后，我将temp作为文件写入，并将其读回（我发现这比textConnection快得多）。但是，read.table（“temp”，sep=“\t”，quote=“\”，encoding=“UTF-8”，colClasses=“character”）会在某些行上阻塞，并向我提供以下消息：扫描错误（文件=文件，内容=内容，sep=sep，quote=q

我有一个破碎的CSV文件，长文本字段包含双引号和逗号。我已经能够在某种程度上对其进行清理，现在将制表符分隔的字段作为整行向量（每个值都是一行）

然后，我将temp作为文件写入，并将其读回（我发现这比textConnection快得多）。但是，

read.table（“temp”，sep=“\t”，quote=“\”，encoding=“UTF-8”，colClasses=“character”）

会在某些行上阻塞，并向我提供以下消息：

扫描错误（文件=文件，内容=内容，sep=sep，quote=quote，dec =dec，：第66951行没有29个元素

我认为这是由于rogue双引号引起的，如下所示（rogue引号可以在“TripAdvisor de la sant？”之后立即找到）

我建议用单引号替换恶意双引号，但我必须保留预期的引号。引号应在分隔符（制表符）之前或之后，以及行首（仅第一行）和行尾。我在正则表达式中编写了以下尝试，其中包含制表符和行首和行尾的lookarounds，但不起作用：

temp <- gsub("(?<![^\t])\"(?![\t$])", "'", temp, perl = T)

您的

（？前面不带除制表符以外的字符（因此，“
”之前必须有制表符或字符串开头），并且后面不带制表符或$
符号
因此，字符类中的^
和$
将失去其锚定意义
用替换组替换字符类：
gsub("(?<!\t|^)\"(?!\t|$)", "'", temp, perl=TRUE)

gsub（“”）试试gsub（“”？
temp <- gsub("(?<![^\t])\"(?![\t$])", "'", temp, perl = T)

temp[181]
[1] "198\torganizations/playfusion\tplayfusion\torganizations/playfusion\torganization/playfusion\tPlayFusion\t\tPlayFusion is a developer of computer games.\tPlayFusion is pioneering the next generation of connected interactive entertainment. PlayFusion's proprietary technology platform fuses video games, robotics, toys, and trans-media entertainment. The company is currently working on its own original IP to trail-blaze its vision ahead of opening its platform to others.    PlayFusion is an independent, employee-owned company with offices in Cambridge and Derby in the UK, Douglas in the Isle of Man, and New York and San Francisco in the USA.\thttp://public.crunchbase.com/t_api_images/v1475688372/xnhrd4t254pxj6yxegzt.png\tcompany\t\t\t\t\t2015-01-01\t4\tFALSE\t\t0\t11\t50\t\t\t0\t0\thttp://playfusion.com/#intro\t1475688521\t1475899292"

gsub("(?<!\t|^)\"(?!\t|$)", "'", temp, perl=TRUE)