Python 我如何处理这个复杂场景中的大数据?递归CTE和熊猫不工作?
我的设想:Python 我如何处理这个复杂场景中的大数据?递归CTE和熊猫不工作?,python,mysql,pandas,Python,Mysql,Pandas,我的设想: 用户A是(欺诈者) 用户B不是(欺诈者)。但是,系统将不允许用户B 做任何动作。因为B和A使用同一部电话 编号(与欺诈用户共享的属性)。(1层) 用户D不是(欺诈者)。但是D与B使用相同的设备ID B是与欺诈用户共享属性。然后将用户D阻止为 好。在这种情况下,有两层。D和B比较,B比较 用A 递归CTE(将数据增加到1000行时出错) 代码: with recursive cte as ( select ID, Email, MobileNo, DeviceId, I
- 用户A是(欺诈者)
- 用户B不是(欺诈者)。但是,系统将不允许用户B 做任何动作。因为B和A使用同一部电话 编号(与欺诈用户共享的属性)。(1层)
- 用户D不是(欺诈者)。但是D与B使用相同的设备ID B是与欺诈用户共享属性。然后将用户D阻止为 好。在这种情况下,有两层。D和B比较,B比较 用A
with recursive cte as (
select ID, Email, MobileNo, DeviceId, IPAddress, id as tracking
from tableuser
where isfraudsterstatus = 1
union all
select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress , concat_ws(',', cte.tracking, u.id)
from cte join
tableuser u
on u.email = cte.email or
u.mobileno = cte.mobileno or
u.deviceid = cte.deviceid or
u.ipaddress = cte.ipaddress
where find_in_set(u.id, cte.tracking) = 0
)
select *
from cte;
import mysql.connector
from mysql.connector import Error
import pandas as pd
#DATABASE CONNECTION
##
try:
connection = mysql.connector.connect(host='localhost',
database='database',
user='root',
password='')
cursor = connection.cursor()
#Create Dataframe (temporary data)
#df = pd.read_sql("select * from MOCK_DATA",con=connection)
df = pd.read_sql("select * from tableuser",con=connection)
##
def expand_fraud(no_fraud, fraud, col_name):
t = pd.merge(no_fraud, fraud, on=col_name)
if len(t):
df.loc[df.ID.isin(t.ID_x), "IsFraudsterStatus"] = 1
return True
return False
while True:
added_fraud = False
fraud = df[df.IsFraudsterStatus == 1]
no_fraud = df[df.IsFraudsterStatus == 0]
added_fraud |= expand_fraud(no_fraud, fraud, "DeviceId")
added_fraud |= expand_fraud(no_fraud, fraud, "Email")
added_fraud |= expand_fraud(no_fraud, fraud, "MobileNo")
if not added_fraud:
break
print(df)
Id_list = df.values.tolist()
except Error as e:
print("Error reading data from MySQL table", e)
finally:
if (connection.is_connected()):
connection.close()
cursor.close()
print("MySQL connection is closed")
错误:
with recursive cte as (
select ID, Email, MobileNo, DeviceId, IPAddress, id as tracking
from tableuser
where isfraudsterstatus = 1
union all
select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress , concat_ws(',', cte.tracking, u.id)
from cte join
tableuser u
on u.email = cte.email or
u.mobileno = cte.mobileno or
u.deviceid = cte.deviceid or
u.ipaddress = cte.ipaddress
where find_in_set(u.id, cte.tracking) = 0
)
select *
from cte;
import mysql.connector
from mysql.connector import Error
import pandas as pd
#DATABASE CONNECTION
##
try:
connection = mysql.connector.connect(host='localhost',
database='database',
user='root',
password='')
cursor = connection.cursor()
#Create Dataframe (temporary data)
#df = pd.read_sql("select * from MOCK_DATA",con=connection)
df = pd.read_sql("select * from tableuser",con=connection)
##
def expand_fraud(no_fraud, fraud, col_name):
t = pd.merge(no_fraud, fraud, on=col_name)
if len(t):
df.loc[df.ID.isin(t.ID_x), "IsFraudsterStatus"] = 1
return True
return False
while True:
added_fraud = False
fraud = df[df.IsFraudsterStatus == 1]
no_fraud = df[df.IsFraudsterStatus == 0]
added_fraud |= expand_fraud(no_fraud, fraud, "DeviceId")
added_fraud |= expand_fraud(no_fraud, fraud, "Email")
added_fraud |= expand_fraud(no_fraud, fraud, "MobileNo")
if not added_fraud:
break
print(df)
Id_list = df.values.tolist()
except Error as e:
print("Error reading data from MySQL table", e)
finally:
if (connection.is_connected()):
connection.close()
cursor.close()
print("MySQL connection is closed")
使用熊猫:(将数据从1000行增加到500000行时出错)
代码:
with recursive cte as (
select ID, Email, MobileNo, DeviceId, IPAddress, id as tracking
from tableuser
where isfraudsterstatus = 1
union all
select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress , concat_ws(',', cte.tracking, u.id)
from cte join
tableuser u
on u.email = cte.email or
u.mobileno = cte.mobileno or
u.deviceid = cte.deviceid or
u.ipaddress = cte.ipaddress
where find_in_set(u.id, cte.tracking) = 0
)
select *
from cte;
import mysql.connector
from mysql.connector import Error
import pandas as pd
#DATABASE CONNECTION
##
try:
connection = mysql.connector.connect(host='localhost',
database='database',
user='root',
password='')
cursor = connection.cursor()
#Create Dataframe (temporary data)
#df = pd.read_sql("select * from MOCK_DATA",con=connection)
df = pd.read_sql("select * from tableuser",con=connection)
##
def expand_fraud(no_fraud, fraud, col_name):
t = pd.merge(no_fraud, fraud, on=col_name)
if len(t):
df.loc[df.ID.isin(t.ID_x), "IsFraudsterStatus"] = 1
return True
return False
while True:
added_fraud = False
fraud = df[df.IsFraudsterStatus == 1]
no_fraud = df[df.IsFraudsterStatus == 0]
added_fraud |= expand_fraud(no_fraud, fraud, "DeviceId")
added_fraud |= expand_fraud(no_fraud, fraud, "Email")
added_fraud |= expand_fraud(no_fraud, fraud, "MobileNo")
if not added_fraud:
break
print(df)
Id_list = df.values.tolist()
except Error as e:
print("Error reading data from MySQL table", e)
finally:
if (connection.is_connected()):
connection.close()
cursor.close()
print("MySQL connection is closed")
错误
如何应对?
有没有其他方法可以做到这一点?这里的问题(对于MySQL部分)似乎是您的停止条件。您可以跟踪ID列表以防止无限循环(例如,A、B、C、D
)。不幸的是,该列的数据类型为“id”,可能是varchar(10)
,这实际上意味着曲目列表的长度有限
如果达到该深度,您将收到一条错误消息:
Error Code: 1406. Data too long for column 'tracking' at row 1
同样不幸的是,您可能通过禁用(例如通过使用)来抑制该错误,这是一种常见的方法,用于解决某些问题(最臭名昭著),而不是修复代码,但其副作用是您可能会获得无效数据
在您的情况下,这会导致tracking
值停止跟踪(不会抛出错误),例如,使用varchar(10)
可能会导致A、B、C、D、e、
,无法将F
添加到列表中,因此它会不断将F
添加到结果集中,从而导致无限循环
MySQL实际上具有防止无限循环的功能,因此您可能会
Error Code: 3636. Recursive query aborted after 1001 iterations.
Try increasing @@cte_max_recursion_depth to a larger value.
但它只在特定情况下保护您,就像您在每次迭代中添加多行,然后每次迭代都添加多行一样,您将在结果集中2^1000行之前达到资源限制(或超时)
如何修复它
如果您实际上不需要来自跟踪器的信息(而且由于您的panda代码没有这样做,您添加它似乎只是为了防止循环),您可以使用union distinct
,让MySQL处理重复的信息:
with recursive cte as (
select ID, Email, MobileNo, DeviceId, IPAddress
from tableuser
where isfraudsterstatus = 1
union distinct -- distinct!
select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress
from cte join tableuser u
on u.email = cte.email or u.mobileno = cte.mobileno
or u.deviceid = cte.deviceid or u.ipaddress = cte.ipaddress
)
select * from cte;
如果您愿意,您还可以扩展它以跟踪“原始欺诈者”。如果每条链中有多个欺诈者(例如,A和B都被标记为欺诈者,而A与B具有相同的MobileNo
),这可能会导致重复,但您可以通过分组方式再次清除这些欺诈者:
with recursive cte as (
select ID, Email, MobileNo, DeviceId, IPAddress, id as original_fraudster
from tableuser
where isfraudsterstatus = 1
union distinct
select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress,
cte.original_fraudster
from cte join tableuser u
on u.email = cte.email or u.mobileno = cte.mobileno
or u.deviceid = cte.deviceid or u.ipaddress = cte.ipaddress
)
select ID, Email, MobileNo, DeviceId, IPAddress,
min(original_fraudster) as original_fraudster
from cte
group by ID, Email, MobileNo, DeviceId, IPAddress;
从技术上讲,您还可以通过明确定义自己的长度来避免原始问题(即“id”列的有限长度),例如
with recursive cte as (
select ID, Email, MobileNo, DeviceId, IPAddress,
cast(id as char(1000)) as tracking
虽然这只是将问题转移到未来某个时间,而这个时间也可能不够长,但您可以判断这是否是一个潜在的问题。您的panda代码和mysql代码有不同的问题(和解决方案),基本上只是因为它们试图解决同一个问题而相关(这两个都不起作用),所以你应该把它们分成两个不同的问题。另外,500k行不是“bigdata”,所以你可能想删除该标记。@Solarflare指出。我将在另一个问题中拆分它。因为bigdata的标记已被删除。谢谢你的更正。非常感谢,我现在就得到了它。我将测试并让你知道它是否有效。