Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/346.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中从元组(字符串)写入csv_Python_Regex_String_Python 3.x_Csv - Fatal编程技术网

在Python中从元组(字符串)写入csv

在Python中从元组(字符串)写入csv,python,regex,string,python-3.x,csv,Python,Regex,String,Python 3.x,Csv,为了了解我目前的问题,以下是关于更广泛问题的一些背景信息: 我有一个由多个文档组成的大型文本文件。我需要找到一种方法将这个文件组织成它的组成部分。不幸的是,所有单独的文档都有不同的格式,其中唯一的共同点是每个文档的标题都包含日期,每次都以相同的格式书写:dd MONTH yyyy。我把日期作为书头,把它们之间的文字隔开 #the date pattern with positive lookbehind bookend_1 = "(?<=\d{1,2}\sJANUARY\s\d{4}|\d

为了了解我目前的问题,以下是关于更广泛问题的一些背景信息:

我有一个由多个文档组成的大型文本文件。我需要找到一种方法将这个文件组织成它的组成部分。不幸的是,所有单独的文档都有不同的格式,其中唯一的共同点是每个文档的标题都包含日期,每次都以相同的格式书写:
dd MONTH yyyy
。我把日期作为书头,把它们之间的文字隔开

#the date pattern with positive lookbehind
bookend_1 = "(?<=\d{1,2}\sJANUARY\s\d{4}|\d{1,2}\sFEBRUARY\s\d{4}|\d{1,2}\sMARCH\s\d{4}|\d{1,2}\sAPRIL\s\d{4}|\d{1,2}\sMAY\s\d{4}|\d{1,2}\sJUNE\s\d{4}|\d{1,2}\sJULY\s\d{4}|\d{1,2}\sAUGUST\s\d{4}|\d{1,2}\sSEPTEMBER\s\d{4}|\d{1,2}\sOCTOBER\s\d{4}|\d{1,2}\sNOVEMBER\s\d{4}|\d{1,2}\sDECEMBER\s\d)"

#The date pattern with positive lookahead
bookend_2 = "(?=\d{1,2}\sJANUARY\s\d{4}|\d{1,2}\sFEBRUARY\s\d{4}|\d{1,2}\sMARCH\s\d{4}|\d{1,2}\sAPRIL\s\d{4}|\d{1,2}\sMAY\s\d{4}|\d{1,2}\sJUNE\s\d{4}|\d{1,2}\sJULY\s\d{4}|\d{1,2}\sAUGUST\s\d{4}|\d{1,2}\sSEPTEMBER\s\d{4}|\d{1,2}\sOCTOBER\s\d{4}|\d{1,2}\sNOVEMBER\s\d{4}|\d{1,2}\sDECEMBER\s\d)"

#using the bookends to find the text in between dates
docs = regex.findall(bookend_1+'(.*?)'+ bookend_2, psc_comm_raw, re.DOTALL|re.MULTILINE) 
下面是几行psc_comm_元组

[('27 JULY 2004',
  ' ADDIS ABABA, ETHIOPIA\n\nPSC/PR/Comm.(XIII)\n\nCOMMUNIQUÉ\n\nPSC/PR/Comm.(XIII) Page l\n\nCOMMUNIQUÉ OF THE THIRTEENTH MEETING OF THE PEACE AND SECURITY COUNCIL\n\nThe Peace and Security Council (PSC) of the African Union (AU), at its thirteenth meeting, held on 27 July 2004, adopted the following communiqué on the crisis in the Darfur region of the Sudan:\n\nCouncil,\n\n1.\tReiterates its deep concern over the grave situation that still prevails in the Darfur region of the Sudan, in particular the continued attacks by the Janjaweed militia against the civilian population, as well as other human rights abuses and the humanitarian crisis;\n\n2.\tUnderlines the urgent need to implement decision AU/Dec.54(111) on Darfur, adopted by the 3rd Ordinary Session of the Assembly...'),
 ('29 JANUARY 2001',
  '\n\nThe Central Organ of the OAU Mechanism for Conflict Prevention, Management and Resolution held its seventy-third * ordinary session at the level of Ambassadors on 29 January 2001, in Addis Ababa. The session was chaired by Ambassador Kati Ohara Korga, Permanent Representative of Togo to the OAU.\n\nHaving considered the Report of the Secretary General on the Democratic Republic of the Congo (DRC) and the situation in that country, the Central Organ:\n\n1.\tstrongly condemns the assassination of Pre...'),
 ('20 MARCH 2001',
  "\n\nThe Central Organ of the OAU Mechanism for Conflict Prevention, Management and Resolution held its 74th ordinary session at ambassadorial level, in Addis Ababa, Ethiopia, on Tuesday March 20, 2001. The session was chaired by Ambassador Ohara Korga, Permanent representative of Togo to the OAU....'),
 ('22 AUGUST 2001',
  '\n\nThe Central Organ of the OAU Mechanism for Conflict Prevention, Management and Resolution held its 75th Ordinary Session at Ambassadorial level in Addis Ababa, Ethiopia, on Wednesday 22 August 2001....')...]
我的最终目标是创建一个包含两列的CSV:一列用于日期,另一列用于与该日期关联的文本体

import csv
import os

with open('psc_comm.csv','w') as out:
    csv_out=csv.writer(out)
    csv_out.writerow(['date','text'])
    for row in psc_comm_tuple:
        csv_out.writerow(row)
当我将元组输出写入csv时,一些行完全正常。但是一些输出变得混乱——文本被分成看似随机的块,还有空行,一行行的句子片段。这类事件有数百起。当我回顾原始文档并找到相应的断句位置时,我看不到文本本身的任何特殊或独特之处。没有特别的角色。它只是纯文本。然而,它们似乎是特别长的文本部分,所以我想知道CSV文件中单个单元格可以包含多少信息是否有限制

我的问题是:为什么CSV输出在某些地方如此时髦,而在其他地方却不是?您可以在每个单元格中输入多少文本有限制吗


您没有提供足够的信息来识别问题,但是Excel在读取带有嵌入换行符的CSV单元格时往往会遇到问题,因此我的第一个猜测是,这就是问题所在:您有一个带有嵌入换行符的CSV,
csvwriter
可能是以可逆的方式编写的,但Excel无法正确解析

换句话说,您的CSV文件可能没有问题;就跟它被读入Excel的方式一样。你不会说你是如何确定存在问题的


如果您的目标是生成Excel可以阅读的内容,我会放弃CSV格式,直接使用电子表格。该模块可以生成xlsx文档,效果非常好。

在电子表格中可以看到的文本数量是有限制的,excel程序中是否可能有一堆文本被截断?如果你用纯文本打开它呢?欢迎来到stackoveflow,@chickpeaze。把你的问题交给网络是很好的尝试,但是如果你能缩小你的问题范围,你就更有可能得到帮助。至少,请给出您正在处理的数据的示例:编辑您的问题并从
psc\u comm\u tuple
添加几行。也可能
csv.writer
正在尝试将字符转义添加到会破坏格式的字符中(如逗号)而且你的程序没有按照csv.writer的意图解释它。我想你的意思是“编辑你的问题”@alexis,谢谢你的提示!我已经在上面添加了一些行,我将确保在以后的问题中继续这样做。你完全正确-问题是Excel。我在textedit中打开了csv文件,问题条目实际上完全没有问题。谢谢!
import csv
import os

with open('psc_comm.csv','w') as out:
    csv_out=csv.writer(out)
    csv_out.writerow(['date','text'])
    for row in psc_comm_tuple:
        csv_out.writerow(row)