Python 如何通过标点符号拆分CSV内容
我有一个csv文件,有两列,一列是人名,另一列是人名定义的单词,问题是在这个列中有很多单词是用标点符号分隔的。我需要将这些单词分开,这样每个人每列只有一个单词,即:Python 如何通过标点符号拆分CSV内容,python,python-3.x,pandas,csv,split,Python,Python 3.x,Pandas,Csv,Split,我有一个csv文件,有两列,一列是人名,另一列是人名定义的单词,问题是在这个列中有很多单词是用标点符号分隔的。我需要将这些单词分开,这样每个人每列只有一个单词,即: name,word Oliver,"water,surf,windsurf" Tom,"football, striker, ball" Anna,"mountain;wind;sun" Sara,"basketball; nba; ball" Mark
name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine"
Sergi,"runner . athletics"
以下是CSV数据的一个示例。正如您所看到的,有些数据由不同的标点符号分隔(还有一些标点符号),其中它们由空格分隔,而另一些则不由空格分隔。我希望达到的结果是:
name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,stricker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Sergi,runner
Sergi,athletics
我已经用pandas打开了文件,在这里我用数据创建了一个数据框,这是我必须分离数据的地方。我尝试的是:
def splitter(df):
df['word'] = df['word'].str.split(",")
df = df.explode("word")
df['word'] = df['word'].str.split(", ")
df = df.explode("word")
df['word'] = df['word'].str.split(" , ")
df = df.explode("word")
df['word'] = df['word'].str.split("- ")
df = df.explode("word")
df['word'] = df['word'].str.split(" -")
df = df.explode("word")
df['word'] = df['word'].str.split("\. ")
df = df.explode("word")
df['word'] = df['word'].str.split(";")
df = df.explode("word")
df['word'] = df['word'].str.split("; ")
df = df.explode("word")
df['word'] = df['word'].str.split(" ;")
df = df.explode("word")
df['word'] = df['word'].str.split(" ; ")
df = df.explode("word")
df['word'] = df['word'].str.split("/ ")
df = df.explode("word")
return df
我得到的结果是我想要的结果,但有一些空格,它们不必出现:
name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom, stricker
Tom, ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara, nba
Sara, ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian, medicine
Sergi,runner
Sergi, athletics
既然我不知道如何修改代码以使一切正常工作,那么我如何解决这个问题并改进我输入的代码呢?
df['word'] = df['word'].str.strip()
它应该删除文本两侧的所有空格
、选项卡
和新行
顺便说一句: 可能您甚至可以使用
split(;”
而不使用split(;”
,split(;”)
等,因为strip()
将删除这些空格
如果要使用诸如
split(“;”)
,split(“;”)
,split(“;”)
,split(“;”)
)之类的变体,那么应该从最长的split(“;”)开始,然后使用较短的split(“;”
,split(“;”)
,最后使用最短的split(“;”)
-这样也许可以删除空格
您甚至可以尝试只使用一个split('[;,-./]')
而不是所有split()
最终,您可以将|
用作或
编辑:
在代码中直接使用数据的最小工作示例-因此每个人都可以对其进行测试
import pandas as pd
import io
text = '''name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine"
Sergi,"runner . athletics"'''
# text to dataframe
df = pd.read_csv(io.StringIO(text))
df['word'] = df['word'].str.split('[;,/]|\. |- | -')
df = df.explode('word')
df['word'] = df['word'].str.strip()
# dataframe to text
output = io.StringIO()
df.to_csv(output, index=False)
output.seek(0)
text = output.read()
print(text)
结果:
name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,striker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Sergi,runner
Sergi,athletics
0 []
1 []
2 []
3 [(date1, time1)]
4 []
5 [(date2, time2)]
6 []
Name: word, dtype: object
['(date1, time1)', '(date2, time2)']
{'\\(date1, time1\\)': 'XXX0', '\\(date2, time2\\)': 'XXX1'}
{'XXX0': '(date1, time1)', 'XXX1': '(date2, time2)'}
name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,striker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Sara,"(date1, time1)"
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Christian,"(date2, time2)"
Christian,date3
Christian,time3
Sergi,runner
Sergi,athletics
编辑:
没有strip()
的情况下也一样
我使用“?”
在字符之后获得可选的空格
/代码>和字符之前
我还使用'-'
之前的'-'
和'-'
来查找最长的版本
df['word'] = df['word'].str.split('[;,/] ?| ?\. | - |- | -')
df = df.explode('word')
编辑:
使用替换将(数据、科学)
保留为一个字符串而不拆分的示例
import pandas as pd
import io
text = '''name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball; (date1, time1)"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine - (date2, time2) - date3, time3"
Sergi,"runner . athletics"'''
# text to dataframe
df = pd.read_csv(io.StringIO(text))
# Find all `(...)`
found = df['word'].str.findall(r'\(.*?\)')
print(found)
# Flatten it
found = sum(found, [])
print(found)
# Create dict to put pattern in place of `(...)`.
# Because later I will use `regex=True` so I have to use `\(...\)` instead of `(...)`
patterns = {f'\({value[1:-1]}\)':f'XXX{i}' for i, value in enumerate(found)}
print(patterns)
df['word'] = df['word'].replace(patterns, regex=True)
# --- nromal spliting ---
df['word'] = df['word'].str.split('[;,/]|\. |- | -')
df = df.explode('word')
df['word'] = df['word'].str.strip()
# Create dict to put later `(...)` in place of pattern.
patterns_back = {f'XXX{i}':value for i, value in enumerate(found)}
print(patterns_back)
df['word'] = df['word'].replace(patterns_back, regex=True)
# dataframe to text
output = io.StringIO()
df.to_csv(output, index=False)
output.seek(0)
text = output.read()
print(text)
结果:
name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,striker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Sergi,runner
Sergi,athletics
0 []
1 []
2 []
3 [(date1, time1)]
4 []
5 [(date2, time2)]
6 []
Name: word, dtype: object
['(date1, time1)', '(date2, time2)']
{'\\(date1, time1\\)': 'XXX0', '\\(date2, time2\\)': 'XXX1'}
{'XXX0': '(date1, time1)', 'XXX1': '(date2, time2)'}
name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,striker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Sara,"(date1, time1)"
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Christian,"(date2, time2)"
Christian,date3
Christian,time3
Sergi,runner
Sergi,athletics
我对熊猫了解不多,但下面的代码可能对您有所帮助
import re
# [name,word]
data = [["Oliver", "water,surf,windsurf"],
["Tom", "football, striker, ball"],
["Anna", "mountain;wind;sun"],
["Sara", "basketball; nba; ball"],
["Mark", "informatic/web3.0/e-learning"],
["Christian", "doctor - medicine"],
["Sergi", "runner . athletics"]]
result = []
for item in data:
words = re.split(r'\s*;\s*|\s*,\s*|/|\s+-\s+|\s+.\s+', item[1])
result.extend([(item[0], w) for w in words])
您可以使用re模块拆分单词。
然后在元组列表中得到结果。您知道strip()
吗?我与regex一起尝试过,但没有成功。出于这个原因,我尝试了使用split()
。在所有代码之后,您是否尝试使用strip()
例如,df.str.strip()
。您甚至可以尝试df.explode(“word”).str.strip()
非常感谢,这个解释对我帮助很大!我有一个我无法解决的问题。括号中有一些词,如:(数据、科学)。问题是它也把他们分开了,我不想这样。我试着用一些条件来解决这个问题,但是没有办法,我怎么能解决这个问题呢?我发现了类似的问题,它们使用非常复杂的正则表达式,并且只在空间上分裂,所以需要做很多修改:。另一个想法是:首先找到所有带括号的子字符串,并在字典中用唯一的键记住它们-即第一个子字符串为XXX1
,第二个子字符串为XXX2
,等等-然后将这些键放在原始文本中代替子字符串。接下来使用普通split()
。最后,您必须将原始子字符串放回XXX1
,XXX2
等的位置。我添加了使用替换的示例XXX1
,XXX2