python检测标签字符_Python_String_Split

python检测标签字符

python string

python检测标签字符,python,string,split,Python,String,Split,我试图在一个特定的文件中拆分单词和整数。文件字符串的格式如下（包含word的行没有'\t'字符，但整数（均为正数）有）：（有些单词是包含“-”字符的数字）所以我的想法是通过将线的对象设置为浮动来分割单词和字符串 def is_number(s): try: float(s) return True except ValueError: return False with codecs.open("/media/New Volu

我试图在一个特定的文件中拆分单词和整数。文件字符串的格式如下（包含word的行没有'\t'字符，但整数（均为正数）有）：（有些单词是包含“-”字符的数字）

所以我的想法是通过将线的对象设置为浮动来分割单词和字符串

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

with codecs.open("/media/New Volume/3rd_step.txt", 'Ur') as file:#open file
    for line in file: # read line by line
        temp_buffer = line.split() # split elements
        for word in temp_buffer:
            if not('-' in word or not is_number(word)):
            ....

所以，如果它是一个词，我会得到例外，如果不是，那么它是一个数字。文件是50 GB，在中间的某个地方，好像文件的格式出了问题。因此，拆分单词和数字的唯一可能方法是使用\t char。但是我怎样才能发现它呢？我的意思是我把线分开去拿线，这样我就失去了特殊的字符

编辑：

我真的很傻，很抱歉浪费了你的时间。我觉得这样比较容易：

with codecs.open("/media/D60A6CE00A6CBEDD/InvertedIndex/1.txt", 'Ur') as file:#open file
    for line in file: # read line by line
    if not '\t' in line:
            print line

您应该尝试将参数指定为

split（）

，而不仅仅是使用默认值，即所有空格字符。您可以先将其拆分为除

\t

之外的所有空格。试试这个：

white_str = list(string.whitespace)    # string.whitespace contains all whitespace.
white_str.remove("\t")                 # Remove \t
white_str = ''.join(white_str)         # New whitespace string, without \t

然后，不要使用

split（）

，而是使用

split（white\u str）

。这将在除

\t

之外的所有空白处分割行以获取字符串。然后，您可以稍后检测所需的

\t

white_str = list(string.whitespace)    # string.whitespace contains all whitespace.
white_str.remove("\t")                 # Remove \t
white_str = ''.join(white_str)         # New whitespace string, without \t