Python 删除其中包含制表符的行

Python 删除其中包含制表符的行,python,bash,awk,sed,text-files,Python,Bash,Awk,Sed,Text Files,如何删除包含tab的行 我有一个文件如下所示: 0 absinth Bohemian-style absinth Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately d

如何删除包含tab的行

我有一个文件如下所示:

0   absinth
Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.

1   acidophilus milk
Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk

2   adobo
Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.
所需输出具有已删除选项卡的行,即:

Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.

Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk

Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.
我可以在python中执行以下操作以获得相同的结果:

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
  for line in fin:
    if '\t' in line:
      continue
    else:
      fout.write(line)
但我有数百万条线路,效率不高。因此,我尝试使用剪切删除第二行,然后删除具有单个字符的行:

$ cut -f1 WIKI_WN_food | awk 'length>1' | less
获得所需输出的更具python风格的方法是什么?


有没有比我上面所展示的cut+awk管道更有效的方法?

您可以用tr试试

tr -d " \t" < tabbed-file.txt > sanitized-file.txt
--

你也可以试着用它

要删除所有空白,包括从左到第一个单词的制表符,请发出:


echo“This is a test”| sed-e的/^[\t]*/'

如果代码正常,可以尝试优化仅在字符串开头的查找:

if `\t' not in l[:5]: fout.write(l)
当子字符串的长度取决于最大记录数时,它可能会对不匹配的长字符串产生影响,谁知道呢

此外,您可能需要测试
mawk
grep
等,如中所示

# Edit : the following won't work. it strips also blank lines
# mawk -F"\t" "NF==1"  original > stripped
grep -vF "\t"        original > stripped
sed -e "/\t/d"       original > stripped
看看它是否比python解决方案快

测试 在我的系统上,有一个重复复制你的文件。它的尺寸是1418973184 我的大致时间如下:grep1.6s、sed6.4s、python4.6s。python运行时不依赖于对整个字符串或子字符串的搜索

补遗 我使用
mawk
测试了Jidder awk解决方案(如OP评论中所述),我的时间大约为3.2s。在这里,为了它的价值。。。获胜者是
grep-vF

测试记录 执行之间的运行时间相差0.1秒,这里我只报告每个命令的一个运行时间。。。对于接近的结果,人们无法做出明确的决定

另一方面,不同的工具给出的结果与实验误差相差甚远,我认为我们可以得出一些结论

% ls -l original 
-rw-r--r-- 1 boffi boffi 1418973184 Dec  8 21:33 original
% cat doit.py
from sys import stdout
with open('original', 'r') as fin:
  for line in fin:
    if '\t' in line: continue
    else: stdout.write(line)
% time wc -l original 
15731133 original

real    0m0.407s
user    0m0.184s
sys     0m0.220s
% time python doit.py | wc -l
12584034

real    0m5.334s
user    0m4.880s
sys     0m1.428s
% time grep -vF "       "  original | wc -l
12584035

real    0m1.527s
user    0m1.112s
sys     0m1.400s
% time grep -v "        "  original | wc -l
12584035

real    0m1.556s
user    0m1.120s
sys     0m1.436s
% time sed -e "/\t/d"  original | wc -l
12584034

real    0m6.481s
user    0m6.104s
sys     0m1.404s
% time mawk '!/\t/'  original | wc -l
12584035

real    0m3.059s
user    0m2.608s
sys     0m1.488s
% time gawk '!/\t/'  original | wc -l
12584035

real    0m9.148s
user    0m8.680s
sys     0m1.468s
% 
我的示例文件有一个截断的最后一行,因此python和sed之间的行数在一侧相差一倍,而所有其他工具都是如此

grep -v '\t' file

尝试将grep与Perl样式的正则表达式一起使用:

grep -vP "\t" file.in > file.out

如果使用
filter
给您带来优势,请尝试

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
    fout.write(''.join([line for line in filter(
             lambda l: r'\t' not in l, fin.readlines())]))
测试条件
r'\t'不在l
中是否适用于您的文件。您可能需要测试一组空格,而不是\t,可能需要使用正则表达式。我必须将代码\t放入file.txt文件中,代码才能正常工作。这就是为什么我尝试使用regex,进行替换:

import re

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
    fout.write(re.sub(r'^\d+\s{2,}[^\n]+', '', fin.read(), count=0, flags=re.M))
只是现在我得到了一个空行,而不是你想要消除的行

明白了:正则表达式需要一个
\n
在最后才能工作:

    fout.write(re.sub(r'^\d+\s{2,}[^\n]+\n', '', fin.read(), count=0, flags=re.M))

你可以用sed做这个

sed '/\t/d' 'my_file'

查找“\t”并删除包含它的行

这只会删除制表符本身。第二个行将不起作用,因为行首有字符。也许
sed的/*\t.*/'
会工作得更好,但问题注释中的etan Reisers是使用sed的最佳方式。
sed-e'/\t/d'文件
会更好如果你对这意味着什么添加一条评论就更好了不,那就更糟了。这个命令再简单不过了,再简单不过了,任何不知道它是什么意思的人都可以在大约30秒钟内找到它,只需访问grep手册页,这样他们就可以得到答案,并了解如何在将来找到答案。mawk怎么了?改为在我对问题的评论中使用命令:)我的解决方案被破坏,我已经测试了你的解决方案,当我看到你的评论时,我正在更新我的测试部分。。。几年前,我被引导认为,
mawk
gawk
快。。。许多年后,这仍然是真的吗?好吧,用gawk测试Jidder的解决方案给了我9.3秒。在这种情况下,你为什么要使用grep的
-F
标志?它提高了执行速度还是其他什么?@EdMorton-我有时间等待其他数据加载,所以我制作了一个~1.7mil的行文件来测试,然后它出现了(在一个取消绑定的来宾VM上)使用我的命令比Jiddler的慢17-20%。使用
awk
/
nawk
/
gawk
-比如4x更差:)默认值是神奇的。很高兴知道。为什么要将perl正则表达式用于选项卡?没有perl正则表达式,Grep无法搜索选项卡:
sed '/\t/d' 'my_file'