隐语SecretFlow实际场景MPC算法开发实践#

This tutorial is only available in Chinese.

推荐使用conda创建一个新环境 > conda create -n sf python=3.8

直接使用pip安装secretflow > pip install -U secretflow

基于secreflow:0.7.7b1版本

此代码示例主要是展示了如何基于secreflow以及SPU隐私计算设备完成一个实际的应用的开发,推荐先看前一个教程spu_basics熟悉基本的SPU概念。

任务介绍#

Vehicle Insurance Claim Fraud Detection

该数据集来源于kaggle,包含 - 车辆数据集-属性、模型、事故详细信息等 - 保单详细信息-保单类型、有效期等

目标是检测索赔申请是否欺诈: 字段FraudFound_P (0 or 1) 即为预测的target值,是一个典型的二分类场景

实验目标#

在本次实验中,我们将会利用一个开源数据集在隐语上完成隐私保护的逻辑斯特回归、神经网络模型和XGB模型。主要涉及到如下的几个流程: 1. 数据加载 2. 数据洞察 3. 数据预处理 4. 模型构建 5. 模型的训练与预测

前置工作#

Ray集群启动(多机部署)#

考虑多机部署的情况,在启动secretflow之前需要先将ray集群启动。在header节点和worker节点上各自执行下述的指令。 > P.S. 启动集群之后,可以执行ray status看一下集群是否正确启动完成

Header节点

RAY_DISABLE_REMOTE_CODE=true \
ray start --head --node-ip-address="head_ip" --port="head_port" --resources='{"alice": 20}' --include-dashboard=False

Worker节点

RAY_DISABLE_REMOTE_CODE=true \
ray start --address="head_ip:head_port" --resources='{"bob": 20}'
[1]:
# 如下是多机版初始化secretflow的代码,需要给出header节点的IP和PORT
# head_ip = "xxx"
# head_port = "xxx"
# sf.init(address=f'{head_ip}:{head_port}')

单机部署#

我们在此使用单机部署的方式做一个样例展示。 通过调用sf.init()我们实例化了一个ray集群,有5个节点,也就对应了5个物理设备。

[2]:
import secretflow as sf
sf.shutdown()
# Standalone Mode
sf.init(
    ['alice', 'bob', 'carol', 'davy', 'eric'],
    address='local'
)
2022-11-09 20:12:39.876005: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst

定义明文计算设备PYU#

我们在启动了上述5个节点之后,明确隐语中的逻辑设备。这里我们将alice、bob、carol三方作为数据的提供方,可以本地执行明文计算,也就是 PYU (PYthon runtime Unit) 设备。

[3]:
alice = sf.PYU('alice')
bob = sf.PYU('bob')
carol = sf.PYU('carol')

print(alice)
alice_

定义密文计算设备SPU (3PC)#

进一步,我们以SPU (Secure Processing Unit) 为例,选择3个物理节点组成基于MPC(下例为三方的ABY3协议)的隐私计算设备。

[4]:
import spu
from secretflow.utils.testing import unused_tcp_port

aby3_cluster_def = {
    'nodes': [
        {
            'party': 'alice',
            'id': 'local:0',
            'address': f'127.0.0.1:{unused_tcp_port()}',
        },
        {'party': 'bob', 'id': 'local:1', 'address': f'127.0.0.1:{unused_tcp_port()}'},
        {
            'party': 'carol',
            'id': 'local:2',
            'address': f'127.0.0.1:{unused_tcp_port()}',
        },
    ],
    'runtime_config': {
        'protocol': spu.spu_pb2.ABY3,
        'field': spu.spu_pb2.FM64,
        'enable_pphlo_profile': False,
        'enable_hal_profile': False,
        'enable_pphlo_trace': False,
        'enable_action_trace': False,
    },
}

my_spu = sf.SPU(aby3_cluster_def)

数据加载#

Load Data (Mock)#

在定义好隐语中的逻辑设备概念之后,我们演示一下如何进行数据的读入。这里使用一个mock的data load方法get_data_mock()做一个演示。

[5]:
def get_data_mock():
    return 2

x_plaintext = get_data_mock()
print(f"x_plaintext: {x_plaintext}")
x_plaintext: 2

指定PYU设备读取数据

[6]:
x_alice_pyu = alice(get_data_mock)()

print(f"Plaintext Python Object: {x_plaintext}, PYU object: {x_alice_pyu}")
print(f"Reveal PYU object: {sf.reveal(x_alice_pyu)}")
Plaintext Python Object: 2, PYU object: <secretflow.device.device.pyu.PYUObject object at 0x7f0931996fd0>
Reveal PYU object: 2

PYU->SPU 数据转换

[7]:
x_alice_spu = sf.to(my_spu, x_alice_pyu)
print(f"SPU object: {x_alice_spu}")

print(f"Reveal SPU object: {sf.reveal(x_alice_spu)}")
SPU object: <secretflow.device.device.spu.SPUObject object at 0x7f0a38d9b460>
Reveal SPU object: 2

Load Data (Distributed)#

我们下面考虑对一个实际应用场景的数据进行读取,也就是全集数据垂直分布在不同的参与方中。 > 出于演示的目的,我们这里将中心化的明文数据进行垂直分割的拆分,首先观察下此数据集的特征。

读入明文全集数据#

[28]:
import os

"""
Create dir to save dataset files
This will create a directory `data` to store the dataset file
"""
if not os.path.exists('data'):
    os.mkdir('data')

"""
The original data is from Kaggle: https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection.
We promise we only use the data for demo only.
"""
path = "https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/vehicle_nsurance_claim/fraud_oracle.csv"
if not os.path.exists('data/fraud_oracle.csv'):
    res = os.system('cd data && wget {}'.format(path))
    if res != 0:
        raise Exception('File: {} download fails!'.format(path))
else:
    print(f'File already downloaded.')
--2022-11-09 20:25:18--  https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/vehicle_nsurance_claim/fraud_oracle.csv
Resolving secretflow-data.oss-accelerate.aliyuncs.com (secretflow-data.oss-accelerate.aliyuncs.com)... 101.133.111.250
Connecting to secretflow-data.oss-accelerate.aliyuncs.com (secretflow-data.oss-accelerate.aliyuncs.com)|101.133.111.250|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3618564 (3.5M) [text/csv]
Saving to: ‘fraud_oracle.csv’

     0K .......... .......... .......... .......... ..........  1% 3.73M 1s
    50K .......... .......... .......... .......... ..........  2% 6.70M 1s
   100K .......... .......... .......... .......... ..........  4% 10.2M 1s
   150K .......... .......... .......... .......... ..........  5% 18.5M 0s
   200K .......... .......... .......... .......... ..........  7% 13.9M 0s
   250K .......... .......... .......... .......... ..........  8% 33.5M 0s
   300K .......... .......... .......... .......... ..........  9% 21.8M 0s
   350K .......... .......... .......... .......... .......... 11% 47.4M 0s
   400K .......... .......... .......... .......... .......... 12% 31.8M 0s
   450K .......... .......... .......... .......... .......... 14% 38.6M 0s
   500K .......... .......... .......... .......... .......... 15% 1.04M 0s
   550K .......... .......... .......... .......... .......... 16% 92.8M 0s
   600K .......... .......... .......... .......... .......... 18% 81.2M 0s
   650K .......... .......... .......... .......... .......... 19% 91.2M 0s
   700K .......... .......... .......... .......... .......... 21%  101M 0s
   750K .......... .......... .......... .......... .......... 22% 79.5M 0s
   800K .......... .......... .......... .......... .......... 24% 84.9M 0s
   850K .......... .......... .......... .......... .......... 25% 83.2M 0s
   900K .......... .......... .......... .......... .......... 26% 79.3M 0s
   950K .......... .......... .......... .......... .......... 28%  110M 0s
  1000K .......... .......... .......... .......... .......... 29% 1014K 0s
  1050K .......... .......... .......... .......... .......... 31%  106M 0s
  1100K .......... .......... .......... .......... .......... 32%  104M 0s
  1150K .......... .......... .......... .......... .......... 33%  108M 0s
  1200K .......... .......... .......... .......... .......... 35%  113M 0s
  1250K .......... .......... .......... .......... .......... 36%  109M 0s
  1300K .......... .......... .......... .......... .......... 38%  114M 0s
  1350K .......... .......... .......... .......... .......... 39%  112M 0s
  1400K .......... .......... .......... .......... .......... 41%  113M 0s
  1450K .......... .......... .......... .......... .......... 42%  111M 0s
  1500K .......... .......... .......... .......... .......... 43%  111M 0s
  1550K .......... .......... .......... .......... .......... 45%  115M 0s
  1600K .......... .......... .......... .......... .......... 46% 15.4M 0s
  1650K .......... .......... .......... .......... .......... 48%  108M 0s
  1700K .......... .......... .......... .......... .......... 49%  104M 0s
  1750K .......... .......... .......... .......... .......... 50%  110M 0s
  1800K .......... .......... .......... .......... .......... 52% 88.6M 0s
  1850K .......... .......... .......... .......... .......... 53%  103M 0s
  1900K .......... .......... .......... .......... .......... 55%  114M 0s
  1950K .......... .......... .......... .......... .......... 56%  110M 0s
  2000K .......... .......... .......... .......... .......... 58%  876K 0s
  2050K .......... .......... .......... .......... .......... 59% 94.3M 0s
  2100K .......... .......... .......... .......... .......... 60%  107M 0s
  2150K .......... .......... .......... .......... .......... 62%  105M 0s
  2200K .......... .......... .......... .......... .......... 63%  103M 0s
  2250K .......... .......... .......... .......... .......... 65%  105M 0s
  2300K .......... .......... .......... .......... .......... 66%  103M 0s
  2350K .......... .......... .......... .......... .......... 67%  108M 0s
  2400K .......... .......... .......... .......... .......... 69%  106M 0s
  2450K .......... .......... .......... .......... .......... 70%  106M 0s
  2500K .......... .......... .......... .......... .......... 72%  110M 0s
  2550K .......... .......... .......... .......... .......... 73%  108M 0s
  2600K .......... .......... .......... .......... .......... 74%  112M 0s
  2650K .......... .......... .......... .......... .......... 76%  108M 0s
  2700K .......... .......... .......... .......... .......... 77%  112M 0s
  2750K .......... .......... .......... .......... .......... 79%  110M 0s
  2800K .......... .......... .......... .......... .......... 80%  111M 0s
  2850K .......... .......... .......... .......... .......... 82% 98.3M 0s
  2900K .......... .......... .......... .......... .......... 83% 89.0M 0s
  2950K .......... .......... .......... .......... .......... 84%  115M 0s
  3000K .......... .......... .......... .......... .......... 86%  109M 0s
  3050K .......... .......... .......... .......... .......... 87% 32.2M 0s
  3100K .......... .......... .......... .......... .......... 89%  104M 0s
  3150K .......... .......... .......... .......... .......... 90%  108M 0s
  3200K .......... .......... .......... .......... .......... 91%  101M 0s
  3250K .......... .......... .......... .......... .......... 93%  110M 0s
  3300K .......... .......... .......... .......... .......... 94%  112M 0s
  3350K .......... .......... .......... .......... .......... 96%  113M 0s
  3400K .......... .......... .......... .......... .......... 97%  114M 0s
  3450K .......... .......... .......... .......... .......... 99%  114M 0s
  3500K .......... .......... .......... ...                  100%  109M=0.2s

2022-11-09 20:25:18 (15.5 MB/s) - ‘fraud_oracle.csv’ saved [3618564/3618564]

[9]:
from sklearn.model_selection import train_test_split
import pandas as pd

"""
This should point to the data downloaded from Kaggle.
By default, the .csv file shall be in the data directory
"""
full_data_path = 'data/fraud_oracle.csv'
df = pd.read_csv(full_data_path)
df.head()
[9]:
Month WeekOfMonth DayOfWeek Make AccidentArea DayOfWeekClaimed MonthClaimed WeekOfMonthClaimed Sex MaritalStatus ... AgeOfVehicle AgeOfPolicyHolder PoliceReportFiled WitnessPresent AgentType NumberOfSuppliments AddressChange_Claim NumberOfCars Year BasePolicy
0 Dec 5 Wednesday Honda Urban Tuesday Jan 1 Female Single ... 3 years 26 to 30 No No External none 1 year 3 to 4 1994 Liability
1 Jan 3 Wednesday Honda Urban Monday Jan 4 Male Single ... 6 years 31 to 35 Yes No External none no change 1 vehicle 1994 Collision
2 Oct 5 Friday Honda Urban Thursday Nov 2 Male Married ... 7 years 41 to 50 No No External none no change 1 vehicle 1994 Collision
3 Jun 2 Saturday Toyota Rural Friday Jul 1 Male Married ... more than 7 51 to 65 Yes No External more than 5 no change 1 vehicle 1994 Liability
4 Jan 5 Monday Honda Urban Tuesday Feb 2 Female Single ... 5 years 31 to 35 No No External none no change 1 vehicle 1994 Collision

5 rows × 33 columns

数据三方垂直拆分#

我们首先对这个数据进行一个拆分的处理,来模拟一个数据垂直分割的三方场景:

  • alice持有前10个属性

  • bob持有中间的10个属性

  • carol持有剩下的所有属性以及标签值

同时为了方便各方之间的样本做对齐,我们加了一个新的特征UID来标识数据样本。

我们预先基于sklearn将全集数据拆分成训练集和测试集,方便后续进行模型训练效果的验证。

[10]:
train_alice_path = "data/alice_train.csv"
train_bob_path = "data/bob_train.csv"
train_carol_path = "data/carol_train.csv"

test_alice_path = "data/alice_test.csv"
test_bob_path = "data/bob_test.csv"
test_carol_path = "data/carol_test.csv"

def load_dataset_full(data_path):
    df = pd.read_csv(data_path)
    df = df.drop([0])
    df = df.loc[df['DayOfWeekClaimed']!='0']
    y = df['FraudFound_P']
    X = df.drop(columns='FraudFound_P')
    return X, y

def split_data():
    x, y = load_dataset_full(full_data_path)
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=0.3, random_state=10
    )

    print(x_train.shape)
    train_alice_csv = x_train.iloc[:, :10]
    train_bob_csv = x_train.iloc[:, 10:20]
    train_carol_csv = pd.concat([x_train.iloc[:, 20:], y_train], axis=1)

    train_alice_csv.to_csv(train_alice_path, index_label='UID')
    train_bob_csv.to_csv(train_bob_path, index_label='UID')
    train_carol_csv.to_csv(train_carol_path, index_label='UID')

    print(x_test.shape)
    test_alice_csv = x_test.iloc[:, :10]
    test_bob_csv = x_test.iloc[:, 10:20]
    test_carol_csv = pd.concat([x_test.iloc[:, 20:], y_test], axis=1)

    test_alice_csv.to_csv(test_alice_path, index_label='UID')
    test_bob_csv.to_csv(test_bob_path, index_label='UID')
    test_carol_csv.to_csv(test_carol_path, index_label='UID')

split_data()
(10792, 32)
(4626, 32)
[11]:
alice_train_df = pd.read_csv(train_alice_path)
alice_train_df.head()
[11]:
UID Month WeekOfMonth DayOfWeek Make AccidentArea DayOfWeekClaimed MonthClaimed WeekOfMonthClaimed Sex MaritalStatus
0 2853 Mar 4 Sunday Toyota Urban Friday Apr 1 Male Married
1 7261 Apr 4 Saturday Honda Urban Monday Apr 4 Male Married
2 9862 Jun 4 Sunday Toyota Rural Monday Jun 4 Female Single
3 14037 Mar 2 Monday Mazda Urban Monday Mar 2 Male Single
4 10199 Jun 3 Friday Mazda Urban Tuesday Jun 4 Female Single
[12]:
bob_train_df = pd.read_csv(train_bob_path)
bob_train_df.head()
[12]:
UID Age Fault PolicyType VehicleCategory VehiclePrice PolicyNumber RepNumber Deductible DriverRating Days_Policy_Accident
0 2853 39 Policy Holder Sedan - All Perils Sedan 20000 to 29000 2854 8 400 2 more than 30
1 7261 58 Policy Holder Sedan - Liability Sport 20000 to 29000 7262 4 400 4 more than 30
2 9862 28 Policy Holder Sedan - All Perils Sedan less than 20000 9863 5 400 4 more than 30
3 14037 28 Policy Holder Sedan - Collision Sedan 20000 to 29000 14038 11 400 4 more than 30
4 10199 35 Policy Holder Sedan - Collision Sedan 20000 to 29000 10200 12 400 4 more than 30
[13]:
carol_train_df = pd.read_csv(train_carol_path)
carol_train_df.head()
[13]:
UID Days_Policy_Claim PastNumberOfClaims AgeOfVehicle AgeOfPolicyHolder PoliceReportFiled WitnessPresent AgentType NumberOfSuppliments AddressChange_Claim NumberOfCars Year BasePolicy FraudFound_P
0 2853 more than 30 1 7 years 36 to 40 No No External more than 5 no change 1 vehicle 1994 All Perils 0
1 7261 more than 30 none more than 7 51 to 65 No No External 1 to 2 no change 1 vehicle 1995 Liability 0
2 9862 more than 30 none 7 years 31 to 35 No No External none no change 1 vehicle 1995 All Perils 0
3 14037 more than 30 1 6 years 31 to 35 No No External none no change 1 vehicle 1996 Collision 0
4 10199 more than 30 none 5 years 31 to 35 No No Internal none no change 1 vehicle 1995 Collision 0

三方数据加载#

注意:这里的接口里面需要显示地指明用于多方之间样本对齐的key,以及明确使用何种设备来执行PSI。

[14]:

from secretflow.data.vertical import read_csv as v_read_csv train_ds = v_read_csv({alice: train_alice_path, bob: train_bob_path, carol: train_carol_path}, keys='UID', drop_keys='UID', spu=my_spu) test_ds = v_read_csv({alice: test_alice_path, bob: test_bob_path, carol: test_carol_path}, keys='UID', drop_keys='UID', spu=my_spu) print(train_ds) print(train_ds.columns)
VDataFrame(partitions={<secretflow.device.device.pyu.PYU object at 0x7f09319baf70>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f09319a2910>), <secretflow.device.device.pyu.PYU object at 0x7f09319bac10>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f0a38d1acd0>), <secretflow.device.device.pyu.PYU object at 0x7f09319bafd0>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f0a38cfe190>)}, aligned=True)
Index(['Month', 'WeekOfMonth', 'DayOfWeek', 'Make', 'AccidentArea',
       'DayOfWeekClaimed', 'MonthClaimed', 'WeekOfMonthClaimed', 'Sex',
       'MaritalStatus', 'Age', 'Fault', 'PolicyType', 'VehicleCategory',
       'VehiclePrice', 'PolicyNumber', 'RepNumber', 'Deductible',
       'DriverRating', 'Days_Policy_Accident', 'Days_Policy_Claim',
       'PastNumberOfClaims', 'AgeOfVehicle', 'AgeOfPolicyHolder',
       'PoliceReportFiled', 'WitnessPresent', 'AgentType',
       'NumberOfSuppliments', 'AddressChange_Claim', 'NumberOfCars', 'Year',
       'BasePolicy', 'FraudFound_P'],
      dtype='object')

数据洞察#

基于上层封装的VDataFrame抽象,隐语提供了多种数据分析的API,例如统计信息、查改某些列的信息等。

[15]:
print(train_ds['WeekOfMonth'].count())

print(train_ds['WeekOfMonth'].max())
print(train_ds['WeekOfMonth'].min())
WeekOfMonth    10792
dtype: int64
WeekOfMonth    5
dtype: int64
WeekOfMonth    1
dtype: int64

数据预处理#

在读取完数据之后,下面我们演示如何在隐语上对一个实际多方持有的数据进行数据预处理。

Label Encoder#

对无序且二值的值,我们可以使用label encoding,转化为0/1表示

[16]:
from secretflow.preprocessing import LabelEncoder

cols = ['AccidentArea', 'Sex', 'Fault', 'PoliceReportFiled', 'WitnessPresent', 'AgentType']
for col in cols:
    print(f"Col name {col}: {df[col].unique()}")

train_ds_v1 = train_ds.copy()
test_ds_v1 = test_ds.copy()

label_encoder = LabelEncoder()
for col in cols:
    label_encoder.fit(train_ds_v1[col])
    train_ds_v1[col] = label_encoder.transform(train_ds_v1[col])
    test_ds_v1[col] = label_encoder.transform(test_ds_v1[col])
Col name AccidentArea: ['Urban' 'Rural']
Col name Sex: ['Female' 'Male']
Col name Fault: ['Policy Holder' 'Third Party']
Col name PoliceReportFiled: ['No' 'Yes']
Col name WitnessPresent: ['No' 'Yes']
Col name AgentType: ['External' 'Internal']

(Ordinal) Categorical Features#

对于有序的类别数据,我们构建映射,将类别数据转化为0~n-1的整数

[17]:
cols1 = [
    "Days_Policy_Accident",
    "Days_Policy_Claim",
    "AgeOfPolicyHolder",
    "AddressChange_Claim",
    "NumberOfCars",
]
col_disc = [
    {
        "Days_Policy_Accident": {
            "more than 30": 31,
            "15 to 30": 22.5,
            "none": 0,
            "1 to 7": 4,
            "8 to 15": 11.5,
        }
    },
    {
        "Days_Policy_Claim": {
            "more than 30": 31,
            "15 to 30": 22.5,
            "8 to 15": 11.5,
            "none": 0,
        }
    },
    {
        "AgeOfPolicyHolder": {
            "26 to 30": 28,
            "31 to 35": 33,
            "41 to 50": 45.5,
            "51 to 65": 58,
            "21 to 25": 23,
            "36 to 40": 38,
            "16 to 17": 16.5,
            "over 65": 66,
            "18 to 20": 19,
        }
    },
    {
        "AddressChange_Claim": {
            "1 year": 1,
            "no change": 0,
            "4 to 8 years": 6,
            "2 to 3 years": 2.5,
            "under 6 months": 0.5,
        }
    },
    {
        "NumberOfCars": {
            "3 to 4": 3.5,
            "1 vehicle": 1,
            "2 vehicles": 2,
            "5 to 8": 6.5,
            "more than 8": 9,
        }
    },
]

cols2 = [
    "Month",
    "DayOfWeek",
    "DayOfWeekClaimed",
    "MonthClaimed",
    "PastNumberOfClaims",
    "NumberOfSuppliments",
    "VehiclePrice",
    "AgeOfVehicle",
]
col_ordering = [
    {
        "Month": {
            "Jan": 1,
            "Feb": 2,
            "Mar": 3,
            "Apr": 4,
            "May": 5,
            "Jun": 6,
            "Jul": 7,
            "Aug": 8,
            "Sep": 9,
            "Oct": 10,
            "Nov": 11,
            "Dec": 12,
        }
    },
    {
        "DayOfWeek": {
            "Monday": 1,
            "Tuesday": 2,
            "Wednesday": 3,
            "Thursday": 4,
            "Friday": 5,
            "Saturday": 6,
            "Sunday": 7,
        }
    },
    {
        "DayOfWeekClaimed": {
            "Monday": 1,
            "Tuesday": 2,
            "Wednesday": 3,
            "Thursday": 4,
            "Friday": 5,
            "Saturday": 6,
            "Sunday": 7,
        }
    },
    {
        "MonthClaimed": {
            "Jan": 1,
            "Feb": 2,
            "Mar": 3,
            "Apr": 4,
            "May": 5,
            "Jun": 6,
            "Jul": 7,
            "Aug": 8,
            "Sep": 9,
            "Oct": 10,
            "Nov": 11,
            "Dec": 12,
        }
    },
    {"PastNumberOfClaims": {"none": 0, "1": 1, "2 to 4": 2, "more than 4": 5}},
    {"NumberOfSuppliments": {"none": 0, "1 to 2": 1, "3 to 5": 3, "more than 5": 6}},
    {
        "VehiclePrice": {
            "more than 69000": 69001,
            "20000 to 29000": 24500,
            "30000 to 39000": 34500,
            "less than 20000": 19999,
            "40000 to 59000": 49500,
            "60000 to 69000": 64500,
        }
    },
    {
        "AgeOfVehicle": {
            "3 years": 3,
            "6 years": 6,
            "7 years": 7,
            "more than 7": 8,
            "5 years": 5,
            "new": 0,
            "4 years": 4,
            "2 years": 2,
        }
    },
]

from secretflow.data.vertical import VDataFrame

def replace(df, col_maps):
    df = df.copy()

    def func_(df, col_map):
        col_name = list(col_map.keys())[0]
        col_dict = list(col_map.values())[0]
        if col_name not in df.columns:
            return
        new_list = []
        for i in df[col_name]:
            new_list.append(col_dict[i])
        df[col_name] = new_list

    for col_map in col_maps:
        func_(df, col_map)
    return df

col_maps = col_disc + col_ordering

train_ds_v2 = train_ds_v1.copy()
test_ds_v2 = test_ds_v1.copy()

# NOTE: Reveal is only used for demo only!!
print(f"orig ds in alice:\n {sf.reveal(train_ds_v2.partitions[alice].data)}")
train_ds_v2.partitions[alice].data = alice(replace)(
    train_ds_v2.partitions[alice].data, col_maps
)
train_ds_v2.partitions[bob].data = bob(replace)(
    train_ds_v2.partitions[bob].data, col_maps
)
train_ds_v2.partitions[carol].data = carol(replace)(
    train_ds_v2.partitions[carol].data, col_maps
)

print(f"orig ds in alice:\n {sf.reveal(train_ds_v2.partitions[alice].data)}")

test_ds_v2.partitions[alice].data = alice(replace)(
    test_ds_v2.partitions[alice].data, col_maps
)
test_ds_v2.partitions[bob].data = bob(replace)(
    test_ds_v2.partitions[bob].data, col_maps
)
test_ds_v2.partitions[carol].data = carol(replace)(
    test_ds_v2.partitions[carol].data, col_maps
)
orig ds in alice:
       Month  WeekOfMonth  DayOfWeek       Make  AccidentArea DayOfWeekClaimed  \
0       Mar            4     Sunday     Toyota             1           Friday
1       Apr            4   Saturday      Honda             1           Monday
2       Jun            4     Sunday     Toyota             0           Monday
3       Mar            2     Monday      Mazda             1           Monday
4       Jun            3     Friday      Mazda             1          Tuesday
...     ...          ...        ...        ...           ...              ...
10787   Sep            2  Wednesday  Chevrolet             1           Monday
10788   Apr            4    Tuesday      Honda             1          Tuesday
10789   Jul            1     Monday       Ford             0        Wednesday
10790   Feb            3  Wednesday    Pontiac             1           Monday
10791   Feb            5     Monday    Mercury             1           Friday

      MonthClaimed  WeekOfMonthClaimed  Sex MaritalStatus
0              Apr                   1    1       Married
1              Apr                   4    1       Married
2              Jun                   4    0        Single
3              Mar                   2    1        Single
4              Jun                   4    0        Single
...            ...                 ...  ...           ...
10787          Sep                   2    0       Married
10788          May                   1    1       Married
10789          Jul                   1    1       Married
10790          Feb                   3    1       Married
10791          Mar                   1    1       Married

[10792 rows x 10 columns]
orig ds in alice:
        Month  WeekOfMonth  DayOfWeek       Make  AccidentArea  \
0          3            4          7     Toyota             1
1          4            4          6      Honda             1
2          6            4          7     Toyota             0
3          3            2          1      Mazda             1
4          6            3          5      Mazda             1
...      ...          ...        ...        ...           ...
10787      9            2          3  Chevrolet             1
10788      4            4          2      Honda             1
10789      7            1          1       Ford             0
10790      2            3          3    Pontiac             1
10791      2            5          1    Mercury             1

       DayOfWeekClaimed  MonthClaimed  WeekOfMonthClaimed  Sex MaritalStatus
0                     5             4                   1    1       Married
1                     1             4                   4    1       Married
2                     1             6                   4    0        Single
3                     1             3                   2    1        Single
4                     2             6                   4    0        Single
...                 ...           ...                 ...  ...           ...
10787                 1             9                   2    0       Married
10788                 2             5                   1    1       Married
10789                 3             7                   1    1       Married
10790                 1             2                   3    1       Married
10791                 5             3                   1    1       Married

[10792 rows x 10 columns]

(Nominal) Categorical Features#

无序的类别数据,我们直接采用onehot encoder进行01编码

Onehot Encoder#

[18]:
from secretflow.preprocessing import OneHotEncoder
onehot_cols = ['Make','MaritalStatus','PolicyType','VehicleCategory','BasePolicy']

onehot_encoder = OneHotEncoder()
onehot_encoder.fit(train_ds_v2[onehot_cols])

enc_feats = onehot_encoder.transform(train_ds_v2[onehot_cols])
feature_names = enc_feats.columns
train_ds_v3 = train_ds_v2.drop(columns=onehot_cols)
train_ds_v3[feature_names] = enc_feats


enc_feats = onehot_encoder.transform(test_ds_v2[onehot_cols])
test_ds_v3 = test_ds_v2.drop(columns=onehot_cols)
test_ds_v3[feature_names] = enc_feats

print(f"orig ds in alice:\n {sf.reveal(train_ds_v3.partitions[alice].data)}")
orig ds in alice:
        Month  WeekOfMonth  DayOfWeek  AccidentArea  DayOfWeekClaimed  \
0          3            4          7             1                 5
1          4            4          6             1                 1
2          6            4          7             0                 1
3          3            2          1             1                 1
4          6            3          5             1                 2
...      ...          ...        ...           ...               ...
10787      9            2          3             1                 1
10788      4            4          2             1                 2
10789      7            1          1             0                 3
10790      2            3          3             1                 1
10791      2            5          1             1                 5

       MonthClaimed  WeekOfMonthClaimed  Sex  Make_Accura  Make_BMW  ...  \
0                 4                   1    1          0.0       0.0  ...
1                 4                   4    1          0.0       0.0  ...
2                 6                   4    0          0.0       0.0  ...
3                 3                   2    1          0.0       0.0  ...
4                 6                   4    0          0.0       0.0  ...
...             ...                 ...  ...          ...       ...  ...
10787             9                   2    0          0.0       0.0  ...
10788             5                   1    1          0.0       0.0  ...
10789             7                   1    1          0.0       0.0  ...
10790             2                   3    1          0.0       0.0  ...
10791             3                   1    1          0.0       0.0  ...

       Make_Pontiac  Make_Porche  Make_Saab  Make_Saturn  Make_Toyota  \
0               0.0          0.0        0.0          0.0          1.0
1               0.0          0.0        0.0          0.0          0.0
2               0.0          0.0        0.0          0.0          1.0
3               0.0          0.0        0.0          0.0          0.0
4               0.0          0.0        0.0          0.0          0.0
...             ...          ...        ...          ...          ...
10787           0.0          0.0        0.0          0.0          0.0
10788           0.0          0.0        0.0          0.0          0.0
10789           0.0          0.0        0.0          0.0          0.0
10790           1.0          0.0        0.0          0.0          0.0
10791           0.0          0.0        0.0          0.0          0.0

       Make_VW  MaritalStatus_Divorced  MaritalStatus_Married  \
0          0.0                     0.0                    1.0
1          0.0                     0.0                    1.0
2          0.0                     0.0                    0.0
3          0.0                     0.0                    0.0
4          0.0                     0.0                    0.0
...        ...                     ...                    ...
10787      0.0                     0.0                    1.0
10788      0.0                     0.0                    1.0
10789      0.0                     0.0                    1.0
10790      0.0                     0.0                    1.0
10791      0.0                     0.0                    1.0

       MaritalStatus_Single  MaritalStatus_Widow
0                       0.0                  0.0
1                       0.0                  0.0
2                       1.0                  0.0
3                       1.0                  0.0
4                       1.0                  0.0
...                     ...                  ...
10787                   0.0                  0.0
10788                   0.0                  0.0
10789                   0.0                  0.0
10790                   0.0                  0.0
10791                   0.0                  0.0

[10792 rows x 31 columns]
[19]:
train_ds_final = train_ds_v3.copy()
test_ds_final = test_ds_v3.copy()

X_train = train_ds_v3.drop(columns=['FraudFound_P'])
y_train = train_ds_final['FraudFound_P']
X_test = test_ds_final.drop(columns='FraudFound_P')
y_test = test_ds_final['FraudFound_P']

print("data load done")
data load done

数据对象转换#

此处我们将PYUObject 转化为 SPUObject,方便输入到SPU device执行基于MPC协议的隐私计算

[20]:
import jax
import jax.numpy as jnp

"""
Convert the VDataFrame object to SPUObject
"""
def vdataframe_to_spu(vdf: VDataFrame):
    spu_partitions = []
    for device in [alice, bob, carol]:
        spu_partitions.append(sf.to(my_spu, vdf.partitions[device].data))
    base_partition = spu_partitions[0]
    for i in range(1, len(spu_partitions)):
        base_partition = my_spu(lambda x, y: jnp.concatenate([x, y], axis=1))(
            base_partition, spu_partitions[i]
        )
    return base_partition


X_train_spu = vdataframe_to_spu(X_train)
y_train_spu = sf.to(my_spu, y_train.partitions[carol].data)
X_test_spu = vdataframe_to_spu(X_test)
y_test_spu = sf.to(my_spu, y_test.partitions[carol].data)
print(f"X_train type: {X_train}\n\nX_train_spu type: {X_train_spu}")

"""
NOTE: This is only for demo only!! This shall not be used in production.
"""
X_train_plaintext = sf.reveal(X_train_spu)
y_train_plaintext = sf.reveal(y_train_spu)
X_test_plaintext = sf.reveal(X_test_spu)
y_test_plaintext = sf.reveal(y_test_spu)

print(f'X_train_plaintext: \n{X_train_plaintext}')
X_train type: VDataFrame(partitions={<secretflow.device.device.pyu.PYU object at 0x7f09319baf70>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f0a38d1a8b0>), <secretflow.device.device.pyu.PYU object at 0x7f09319bac10>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f0a38d1a850>), <secretflow.device.device.pyu.PYU object at 0x7f09319bafd0>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f0a38c98070>)}, aligned=True)

X_train_spu type: <secretflow.device.device.spu.SPUObject object at 0x7f0a38cd7850>
X_train_plaintext:
[[3. 4. 7. ... 1. 0. 0.]
 [4. 4. 6. ... 0. 0. 1.]
 [6. 4. 7. ... 1. 0. 0.]
 ...
 [7. 1. 1. ... 0. 1. 0.]
 [2. 3. 3. ... 0. 0. 1.]
 [2. 5. 1. ... 0. 0. 1.]]

模型构建#

在完成数据的读入之后,下面我们进行模型的构建。在本demo中,主要提供了三种模型的构建: - LR: 逻辑斯特回归 - NN:神经网络模型 - XGB: XGBoost 树模型

注意,本示例主要是演示在隐语上进行算法开发的流程,并没有针对模型 (LR, NN) 进行调参。我们分别提供了明文和密文的计算结果,实验结果显示两者的输出是基本一致的,表明隐语的密态计算能够和明文计算保持精度一致。

LR ( jax ) using SPU#

[21]:
from jax.example_libraries import optimizers, stax
from jax.example_libraries.stax import (
    Conv,
    MaxPool,
    AvgPool,
    Flatten,
    Dense,
    Relu,
    Sigmoid,
    LogSoftmax,
    Softmax,
    BatchNorm,
)


def sigmoid(x):
    x = (x - jnp.min(x)) / (jnp.max(x) - jnp.min(x))
    return 1 / (1 + jnp.exp(-x))

# Outputs probability of a label being true.
def predict_lr(W, b, inputs):
    return sigmoid(jnp.dot(inputs, W) + b)

# Training loss is the negative log-likelihood of the training examples.
def loss_lr(W, b, inputs, targets):
    preds = predict_lr(W, b, inputs)
    label_probs = preds * targets + (1 - preds) * (1 - targets)
    return -jnp.mean(jnp.log(label_probs))

def train_step(W, b, X, y, learning_rate):
    loss_value, Wb_grad = jax.value_and_grad(loss_lr, (0, 1))(W, b, X, y)
    W -= learning_rate * Wb_grad[0]
    b -= learning_rate * Wb_grad[1]
    return loss_value, W, b

def fit(W, b, X, y, epochs=1, learning_rate=1e-2, batch_size=128):
    losses = jnp.array([])

    xs = jnp.array_split(X, len(X) / batch_size, axis=0)
    ys = jnp.array_split(y, len(y) / batch_size, axis=0)

    for _ in range(epochs):
        for (batch_x, batch_y) in zip(xs, ys):
            l, W, b = train_step(
                W, b, batch_x, batch_y, learning_rate=learning_rate
            )
            losses = jnp.append(losses, l)
    return losses, W, b

[22]:
from jax import random
import sys
import time
import logging
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().setLevel(logging.INFO)

from sklearn.metrics import roc_auc_score


# Hyperparameter
key = random.PRNGKey(42)
W = jax.random.normal(key, shape=(64,))
b = 0.0
epochs = 1
learning_rate = 1e-2
batch_size = 128

"""
CPU-version plaintext computation
"""
losses_cpu, W_cpu, b_cpu = fit(
    W,
    b,
    X_train_plaintext,
    y_train_plaintext,
    epochs=epochs,
    learning_rate=learning_rate,
    batch_size=batch_size,
)
y_pred_cpu = predict_lr(W_cpu, b_cpu, X_test_plaintext)
print(f"\033[31m(Jax LR CPU) auc: {roc_auc_score(y_test_plaintext, y_pred_cpu)}\033[0m")

"""
SPU-version secure computation
"""
W_, b_ = (
    sf.to(alice, W).to(my_spu),
    sf.to(alice, b).to(my_spu),
)
losses_spu, W_spu, b_spu = my_spu(
    fit,
    static_argnames=["epochs", "learning_rate", "batch_size"],
    num_returns_policy=sf.device.SPUCompilerNumReturnsPolicy.FROM_USER,
    user_specified_num_returns=3,
)(
    W_,
    b_,
    X_train_spu,
    y_train_spu,
    epochs=epochs,
    learning_rate=learning_rate,
    batch_size=batch_size,
)

y_pred_spu = my_spu(predict_lr)(W_spu, b_spu, X_test_spu)
y_pred = sf.reveal(y_pred_spu)
print(f"\033[31m(Jax LR SPU) auc: {roc_auc_score(y_test_plaintext, y_pred)}\033[0m")
INFO:absl:Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
INFO:absl:Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Interpreter Host
INFO:absl:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(Jax LR CPU) auc: 0.5243556504755678
(Jax LR SPU) auc: 0.5249493212966679

NN ( jax + flax ) using SPU#

[ ]:
import sys
!{sys.executable} -m pip install flax
[23]:
from typing import Sequence
import flax.linen as nn


class MLP(nn.Module):
    features: Sequence[int]

    @nn.compact
    def __call__(self, x):
        for feat in self.features[:-1]:
            x = nn.relu(nn.Dense(feat)(x))
        x = nn.Dense(self.features[-1])(x)
        return x

FEATURES = [1]
flax_nn = MLP(FEATURES)

def predict(params, x):
    from typing import Sequence
    import flax.linen as nn
    class MLP(nn.Module):
        features: Sequence[int]

        @nn.compact
        def __call__(self, x):
            for feat in self.features[:-1]:
                x = nn.relu(nn.Dense(feat)(x))
            x = nn.Dense(self.features[-1])(x)
            return x
    FEATURES = [1]
    flax_nn = MLP(FEATURES)
    return flax_nn.apply(params, x)


def loss_func(params, x, y):
    preds = predict(params, x)
    label_probs = preds * y + (1 - preds) * (1 - y)
    return -jnp.mean(jnp.log(label_probs))


def train_auto_grad(X, y, params, batch_size=10, epochs=10, learning_rate=0.01):
    xs = jnp.array_split(X, len(X) / batch_size, axis=0)
    ys = jnp.array_split(y, len(y) / batch_size, axis=0)

    for _ in range(epochs):
        for (batch_x, batch_y) in zip(xs, ys):
            _, grads = jax.value_and_grad(loss_func)(params, batch_x, batch_y)
            params = jax.tree_util.tree_map(
                lambda p, g: p - learning_rate * g, params, grads
            )
    return params

epochs = 1
learning_rate = 1e-2
batch_size = 128

feature_dim = 64  # from the dataset
init_params = flax_nn.init(jax.random.PRNGKey(1), jnp.ones((batch_size, feature_dim)))

"""
CPU-version plaintext computation
"""
params = train_auto_grad(
    X_train_plaintext, y_train_plaintext, init_params, batch_size, epochs, learning_rate
)
y_pred = predict(params, X_test_plaintext)
print(f"\033[31m(Flax NN CPU) auc: {roc_auc_score(y_test_plaintext, y_pred)}\033[0m")

"""
SPU-version secure computation
"""
params_spu = sf.to(alice, init_params).to(my_spu)
params_spu = my_spu(train_auto_grad, static_argnames=['batch_size', 'epochs', 'learning_rate'])(
    X_train_spu, y_train_spu, params_spu, batch_size=batch_size, epochs=epochs, learning_rate=learning_rate
)
y_pred_spu = my_spu(predict)(params_spu, X_test_spu)
y_pred_ = sf.reveal(y_pred_spu)
print(f"\033[31m(Flax NN SPU) auc: {roc_auc_score(y_test_plaintext, y_pred_)}\033[0m")
(Flax NN CPU) auc: 0.5022025986877814
(Flax NN SPU) auc: 0.5022042816667214

XGB ( jax ) using SPU#

[24]:
from secretflow.ml.boost.ss_xgb_v import Xgb
import time
from sklearn.metrics import roc_auc_score

"""
SPU-version Secure computation
"""
xgb = Xgb(my_spu)
params = {
    # <<< !!! >>> change args to your test settings.
    # for more detail, see Xgb.train.__doc__
    'num_boost_round': 10,
    'max_depth': 4,
    'learning_rate': 0.05,
    'sketch_eps': 0.05,
    'objective': 'logistic',
    'reg_lambda': 1,
    'subsample': 0.75,
    'colsample_bytree': 1,
    'base_score': 0.5,
}

start = time.time()
model = xgb.train(params, X_train, y_train)
print(f"train time: {time.time() - start}")

start =time.time()
spu_yhat = model.predict(X_test)
print(f"predict time: {time.time() - start}")

yhat = sf.reveal(spu_yhat)
print(f"\033[31m(SS-XGB) auc: {roc_auc_score(y_test_plaintext, yhat)}\033[0m")
INFO:root:global_setup time 5.707190036773682s
INFO:root:epoch 0 time 13.447269678115845s
INFO:root:epoch 1 time 11.487506866455078s
INFO:root:epoch 2 time 16.403863430023193s
INFO:root:epoch 3 time 10.947153091430664s
INFO:root:epoch 4 time 10.782062530517578s
INFO:root:epoch 5 time 10.983924865722656s
INFO:root:epoch 6 time 13.287509441375732s
INFO:root:epoch 7 time 10.768491506576538s
INFO:root:epoch 8 time 11.075066804885864s
INFO:root:epoch 9 time 11.336798429489136s
train time: 126.23692321777344
predict time: 0.0565030574798584
(SS-XGB) auc: 0.6917051858471569
[29]:
"""
Plaintext baseline
"""
import xgboost as SKxgb

params = {
    # <<< !!! >>> change args to your test settings.
    # for more detail, see Xgb.train.__doc__
    "n_estimators": 10,
    "max_depth": 4,
    'eval_metric': 'auc',
    "learning_rate": 0.05,
    "sketch_eps": 0.05,
    "objective": "binary:logistic",
    "reg_lambda": 1,
    "subsample": 0.75,
    "colsample_bytree": 1,
    "base_score": 0.5,
}
raw_xgb = SKxgb.XGBClassifier()
raw_xgb.fit(X_train_plaintext, y_train_plaintext)
y_pred = raw_xgb.predict(X_test_plaintext)
print(f"\033[31m(Sklearn-XGB) auc: {roc_auc_score(y_test_plaintext, y_pred)}\033[0m")
/home/haoqi.whq/miniconda3/envs/sf-demo/lib/python3.8/site-packages/xgboost/sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
/home/haoqi.whq/miniconda3/envs/sf-demo/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:98: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/haoqi.whq/miniconda3/envs/sf-demo/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:133: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
[20:26:03] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
(Sklearn-XGB) auc: 0.7106100882806603

The End#

显示地调用sf.shutdown()关闭实例化的集群。 > 注意:如果是在.py文件中运行代码,不需要显示地执行shutdown,在程序进程运行结束后会隐式地执行shutdown函数。

[26]:
sf.shutdown()

小结一下#

  • 介绍了如何针对一个实际场景的应用,在隐语上进行开发,提供隐私保护的能力

  • 隐语上的数据加载、预处理、建模、训练流程

  • 下一步,自己实现任意的计算(jax实现的计算),对于TF,pytorch的支持WIP