Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/mysql/67.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 我如何处理这个复杂场景中的大数据?递归CTE和熊猫不工作?_Python_Mysql_Pandas - Fatal编程技术网

Python 我如何处理这个复杂场景中的大数据?递归CTE和熊猫不工作?

Python 我如何处理这个复杂场景中的大数据?递归CTE和熊猫不工作?,python,mysql,pandas,Python,Mysql,Pandas,我的设想: 用户A是(欺诈者) 用户B不是(欺诈者)。但是,系统将不允许用户B 做任何动作。因为B和A使用同一部电话 编号(与欺诈用户共享的属性)。(1层) 用户D不是(欺诈者)。但是D与B使用相同的设备ID B是与欺诈用户共享属性。然后将用户D阻止为 好。在这种情况下,有两层。D和B比较,B比较 用A 递归CTE(将数据增加到1000行时出错) 代码: with recursive cte as ( select ID, Email, MobileNo, DeviceId, I

我的设想:

  • 用户A是(欺诈者)
  • 用户B不是(欺诈者)。但是,系统将不允许用户B 做任何动作。因为B和A使用同一部电话 编号(与欺诈用户共享的属性)。(1层)
  • 用户D不是(欺诈者)。但是D与B使用相同的设备ID B是与欺诈用户共享属性。然后将用户D阻止为 好。在这种情况下,有两层。D和B比较,B比较 用A
递归CTE(将数据增加到1000行时出错)

代码:

with recursive cte as (
      select ID, Email, MobileNo, DeviceId, IPAddress, id as tracking
      from tableuser
      where isfraudsterstatus = 1
      union all
      select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress , concat_ws(',', cte.tracking, u.id)
      from cte join
           tableuser u
           on u.email = cte.email or
              u.mobileno = cte.mobileno or
              u.deviceid = cte.deviceid or 
              u.ipaddress = cte.ipaddress
      where find_in_set(u.id, cte.tracking) = 0
     )
select *
from cte;
import mysql.connector
from mysql.connector import Error
import pandas as pd
#DATABASE CONNECTION
##
try:
    connection = mysql.connector.connect(host='localhost',
                                         database='database',
                                         user='root',
                                         password='')
    cursor = connection.cursor()
    #Create Dataframe (temporary data)
    #df = pd.read_sql("select * from MOCK_DATA",con=connection)
    df = pd.read_sql("select * from tableuser",con=connection)
##
    def expand_fraud(no_fraud, fraud, col_name):
        t = pd.merge(no_fraud, fraud, on=col_name)
        if len(t):
            df.loc[df.ID.isin(t.ID_x), "IsFraudsterStatus"] = 1
            return True
        return False

    while True:
        added_fraud = False
        fraud = df[df.IsFraudsterStatus == 1]
        no_fraud = df[df.IsFraudsterStatus == 0]
        added_fraud |= expand_fraud(no_fraud, fraud, "DeviceId")
        added_fraud |= expand_fraud(no_fraud, fraud, "Email")
        added_fraud |= expand_fraud(no_fraud, fraud, "MobileNo")
        if not added_fraud:
            break
    print(df)

    Id_list = df.values.tolist()

except Error as e:
    print("Error reading data from MySQL table", e)
finally:
    if (connection.is_connected()):
        connection.close()
        cursor.close()
        print("MySQL connection is closed")
错误:

with recursive cte as (
      select ID, Email, MobileNo, DeviceId, IPAddress, id as tracking
      from tableuser
      where isfraudsterstatus = 1
      union all
      select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress , concat_ws(',', cte.tracking, u.id)
      from cte join
           tableuser u
           on u.email = cte.email or
              u.mobileno = cte.mobileno or
              u.deviceid = cte.deviceid or 
              u.ipaddress = cte.ipaddress
      where find_in_set(u.id, cte.tracking) = 0
     )
select *
from cte;
import mysql.connector
from mysql.connector import Error
import pandas as pd
#DATABASE CONNECTION
##
try:
    connection = mysql.connector.connect(host='localhost',
                                         database='database',
                                         user='root',
                                         password='')
    cursor = connection.cursor()
    #Create Dataframe (temporary data)
    #df = pd.read_sql("select * from MOCK_DATA",con=connection)
    df = pd.read_sql("select * from tableuser",con=connection)
##
    def expand_fraud(no_fraud, fraud, col_name):
        t = pd.merge(no_fraud, fraud, on=col_name)
        if len(t):
            df.loc[df.ID.isin(t.ID_x), "IsFraudsterStatus"] = 1
            return True
        return False

    while True:
        added_fraud = False
        fraud = df[df.IsFraudsterStatus == 1]
        no_fraud = df[df.IsFraudsterStatus == 0]
        added_fraud |= expand_fraud(no_fraud, fraud, "DeviceId")
        added_fraud |= expand_fraud(no_fraud, fraud, "Email")
        added_fraud |= expand_fraud(no_fraud, fraud, "MobileNo")
        if not added_fraud:
            break
    print(df)

    Id_list = df.values.tolist()

except Error as e:
    print("Error reading data from MySQL table", e)
finally:
    if (connection.is_connected()):
        connection.close()
        cursor.close()
        print("MySQL connection is closed")

使用熊猫:(将数据从1000行增加到500000行时出错)

代码:

with recursive cte as (
      select ID, Email, MobileNo, DeviceId, IPAddress, id as tracking
      from tableuser
      where isfraudsterstatus = 1
      union all
      select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress , concat_ws(',', cte.tracking, u.id)
      from cte join
           tableuser u
           on u.email = cte.email or
              u.mobileno = cte.mobileno or
              u.deviceid = cte.deviceid or 
              u.ipaddress = cte.ipaddress
      where find_in_set(u.id, cte.tracking) = 0
     )
select *
from cte;
import mysql.connector
from mysql.connector import Error
import pandas as pd
#DATABASE CONNECTION
##
try:
    connection = mysql.connector.connect(host='localhost',
                                         database='database',
                                         user='root',
                                         password='')
    cursor = connection.cursor()
    #Create Dataframe (temporary data)
    #df = pd.read_sql("select * from MOCK_DATA",con=connection)
    df = pd.read_sql("select * from tableuser",con=connection)
##
    def expand_fraud(no_fraud, fraud, col_name):
        t = pd.merge(no_fraud, fraud, on=col_name)
        if len(t):
            df.loc[df.ID.isin(t.ID_x), "IsFraudsterStatus"] = 1
            return True
        return False

    while True:
        added_fraud = False
        fraud = df[df.IsFraudsterStatus == 1]
        no_fraud = df[df.IsFraudsterStatus == 0]
        added_fraud |= expand_fraud(no_fraud, fraud, "DeviceId")
        added_fraud |= expand_fraud(no_fraud, fraud, "Email")
        added_fraud |= expand_fraud(no_fraud, fraud, "MobileNo")
        if not added_fraud:
            break
    print(df)

    Id_list = df.values.tolist()

except Error as e:
    print("Error reading data from MySQL table", e)
finally:
    if (connection.is_connected()):
        connection.close()
        cursor.close()
        print("MySQL connection is closed")
错误

如何应对? 有没有其他方法可以做到这一点?

这里的问题(对于MySQL部分)似乎是您的停止条件。您可以跟踪ID列表以防止无限循环(例如,
A、B、C、D
)。不幸的是,该列的数据类型为“id”,可能是
varchar(10)
,这实际上意味着曲目列表的长度有限

如果达到该深度,您将收到一条错误消息:

Error Code: 1406. Data too long for column 'tracking' at row 1
同样不幸的是,您可能通过禁用(例如通过使用)来抑制该错误,这是一种常见的方法,用于解决某些问题(最臭名昭著),而不是修复代码,但其副作用是您可能会获得无效数据

在您的情况下,这会导致
tracking
值停止跟踪(不会抛出错误),例如,使用
varchar(10)
可能会导致
A、B、C、D、e、
,无法将
F
添加到列表中,因此它会不断将
F
添加到结果集中,从而导致无限循环

MySQL实际上具有防止无限循环的功能,因此您可能会

Error Code: 3636. Recursive query aborted after 1001 iterations. 
Try increasing @@cte_max_recursion_depth to a larger value.
但它只在特定情况下保护您,就像您在每次迭代中添加多行,然后每次迭代都添加多行一样,您将在结果集中2^1000行之前达到资源限制(或超时)

如何修复它

如果您实际上不需要来自跟踪器的信息(而且由于您的panda代码没有这样做,您添加它似乎只是为了防止循环),您可以使用
union distinct
,让MySQL处理重复的信息:

with recursive cte as (
  select ID, Email, MobileNo, DeviceId, IPAddress
  from tableuser
  where isfraudsterstatus = 1
  union distinct  -- distinct!
  select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress
  from cte join tableuser u
  on u.email = cte.email or u.mobileno = cte.mobileno 
     or u.deviceid = cte.deviceid or u.ipaddress = cte.ipaddress
)
select * from cte;
如果您愿意,您还可以扩展它以跟踪“原始欺诈者”。如果每条链中有多个欺诈者(例如,A和B都被标记为欺诈者,而A与B具有相同的
MobileNo
),这可能会导致重复,但您可以通过
分组方式再次清除这些欺诈者:

with recursive cte as (
  select ID, Email, MobileNo, DeviceId, IPAddress, id as original_fraudster
  from tableuser
  where isfraudsterstatus = 1
  union distinct
  select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress,
     cte.original_fraudster
  from cte join tableuser u
  on u.email = cte.email or u.mobileno = cte.mobileno 
     or u.deviceid = cte.deviceid or u.ipaddress = cte.ipaddress
)
select ID, Email, MobileNo, DeviceId, IPAddress, 
   min(original_fraudster) as original_fraudster
from cte
group by ID, Email, MobileNo, DeviceId, IPAddress;
从技术上讲,您还可以通过明确定义自己的长度来避免原始问题(即“id”列的有限长度),例如

with recursive cte as (
  select ID, Email, MobileNo, DeviceId, IPAddress, 
     cast(id as char(1000)) as tracking  

虽然这只是将问题转移到未来某个时间,而这个时间也可能不够长,但您可以判断这是否是一个潜在的问题。

您的panda代码和mysql代码有不同的问题(和解决方案),基本上只是因为它们试图解决同一个问题而相关(这两个都不起作用),所以你应该把它们分成两个不同的问题。另外,500k行不是“bigdata”,所以你可能想删除该标记。@Solarflare指出。我将在另一个问题中拆分它。因为bigdata的标记已被删除。谢谢你的更正。非常感谢,我现在就得到了它。我将测试并让你知道它是否有效。