Python 提高数据帧(isin)处理性能
我的Python脚本在处理大文件时出现OOO内存错误。(适用于小部分记录~10K) 我正在处理2个文件:Python 提高数据帧(isin)处理性能,python,performance,pandas,Python,Performance,Pandas,我的Python脚本在处理大文件时出现OOO内存错误。(适用于小部分记录~10K) 我正在处理2个文件: companys.csv(文件大小19MB)~43K条记录 竞争对手_companies.csv(文件大小427MB)~450万条记录 在文件1中,我有一个名为uuid的字段 我需要比较: 如果文件1和文件2中的公司名称相同,请将文件1中的uuid字段复制到竞争对手公司数据框中 如果网站在文件1和文件2中相同,请将uuid字段从文件1复制到竞争对手的数据框中 当我在服务器(约30 GB
- companys.csv(文件大小19MB)~43K条记录
- 竞争对手_companies.csv(文件大小427MB)~450万条记录
logging.info('Matching TLD.')
match_tld = competitor_companies.tld.isin(companies.tld)
然后脚本停止,我在/var/log/syslog
中看到这一行:
Out of memory: Kill process 177106 (company_generat) score 923 or sacrifice child
Python代码:
def MatchCompanies(
companies: pandas.Dataframe,
competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
"""Find Competitor companies in companies dataframe and generate a new list.
Args:
companies: A dataframe with company information from CSV file.
competitor_companies: A dataframe with Competitor information from CSV file.
Returns:
A sequence of matched companies and their UUID.
Raises:
ValueError: No companies found.
"""
if _IsEmpty(companies):
raise ValueError('No companies found')
# Clean up empty fields.
companies = companies.fillna('')
logging.info('Found: %d records.', len(competitor_companies))
competitor_companies = competitor_companies.fillna('')
# Create a column to define if we found a match or not.
competitor_companies['match'] = False
# Add Top Level Domain (tld) column to compare matching companies.
companies.rename(columns={'website': 'tld'}, inplace=True)
logging.info('Cleaning up company name.')
companies.company_name = companies.company_name.apply(_NormalizeText)
competitor_companies.company_name = competitor_companies.company_name.apply(
_NormalizeText)
# Create a new column since AppAnnie already contains TLD in company_url.
competitor_companies.rename(columns={'company_url': 'tld'}, inplace=True)
logging.info('Matching TLD.')
match_tld = competitor_companies.tld.isin(companies.tld)
logging.info('Matching Company Name.')
match_company_name = competitor_companies.company_name.isin(
companies.company_name)
# Updates match column if TLD or company_name or similar companies matches.
competitor_companies['match'] = match_tld | match_company_name
# Extracts UUID for TLD matches.
logging.info('Extracting UUID')
merge_tld = competitor_companies.merge(
companies[['tld', 'uuid']], on='tld', how='left')
# Extracts UUID for company name matches.
merge_company_name = competitor_companies.merge(
companies[['company_name', 'uuid']], on='company_name', how='left')
# Combines dataframes.
competitor_companies['uuid'] = merge_tld['uuid'].combine_first(
merge_company_name['uuid'])
match_companies = len(competitor_companies[competitor_companies['match']])
total_companies = len(competitor_companies)
logging.info('Results found: %d out of %d', match_companies, total_companies)
competitor_companies.drop('match', axis=1, inplace=True)
competitor_companies.rename(columns={'tld': 'company_url'}, inplace=True)
return competitor_companies
以下是我读取文件的方式:
def LoadDataSet(filename: str) -> pandas.Dataframe:
"""Reads CSV file where company information is stored.
Header information exists in CSV file.
Args:
filename: Source CSV file. Header is present in file.
Returns:
A pandas dataframe with company information.
Raises:
FileError: Unable to read filename.
"""
with open(filename) as input_file:
data = input_file.read()
dataframe = pandas.read_csv(
io.BytesIO(data), header=0, low_memory=False, memory_map=True)
return dataframe.where((pandas.notnull(dataframe)), None)
寻找如何改进我的代码的建议
运行时的Top命令结果:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
190875 myuser 20 0 4000944 2.5g 107532 R 100.7 8.5 5:01.93 company_generat
为什么不直接使用
pd.merge
您可以创建两个数据框,一个用于公司名称
匹配,第二个用于网站
匹配,然后在每个数据框上左合并竞争对手
# Create 2 matching tables
c_website = companies[['uuid', 'website']].rename(columns={'uuid': 'uuid_from_website'})
c_name = companies[['uuid', 'company_name']].rename(columns={'uuid': 'uuid_from_name'})
# Merge on each of these tables
result = competitor_companies\
.merge(c_website, how='left', on='website')\
.merge(c_name, how='left', on='company_name')
然后,您需要协调这两个值,例如,从uuid\u name优先考虑uuid\u:
result['uuid'] = np.where(res.uuid_from_name.notnull(), res.uuid_from_name, res.uuid_from_website)
del result['uuid_from_name']
del result['uuid_from_website']
它应该比使用
pd.Series.isin快得多
您是否尝试分析应用程序以查看是否存在内存泄漏?我还没有,但似乎是Pandas isin函数很慢,占用了大部分内存。