DataFrame#

The following codes are demos only. It’s NOT for production due to system security concerns, please DO NOT use it directly in production.

It is recommended to use jupyter to run this tutorial.

Secretflow provides federated data encapsulation in the form of DataFrame. DataFrame is composed of data blocks of multiple parties and supports horizontal or vertical partitioned data.

dataframe.png

Currently secretflow.DataFrame provides a subset of pandas operations, which are basically the same as pandas. During the calculation process, the original data is kept in the data holder and will not go out of the domain.

The following will demonstrate how to use a DataFrame.

Preparation#

Initialize secretflow and create three parties alice, bob and carol.

[1]:
import secretflow as sf

# In case you have a running secretflow runtime already.
sf.shutdown()

sf.init(['alice', 'bob', 'carol'], address='local')
alice, bob, carol = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('carol')

Data preparation#

Here we use iris as example data.

[2]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
data = pd.concat([iris.data, iris.target], axis=1)
data
[2]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2

150 rows × 5 columns

We partition the data according to horizontal (the same features, each holds some samples) and vertical mode (each holds some features) to facilitate subsequent display.

[3]:
# Horizontal partitioning.
h_alice, h_bob, h_carol = data.iloc[:40, :], data.iloc[40:100, :], data.iloc[100:, :]

# Save to temporary files.
import tempfile
import os

temp_dir = tempfile.mkdtemp()

h_alice_path = os.path.join(temp_dir, 'h_alice.csv')
h_bob_path = os.path.join(temp_dir, 'h_bob.csv')
h_carol_path = os.path.join(temp_dir, 'h_carol.csv')
h_alice.to_csv(h_alice_path, index=False)
h_bob.to_csv(h_bob_path, index=False)
h_carol.to_csv(h_carol_path, index=False)
[4]:
h_alice.head(), h_bob.head(), h_carol.head()
[4]:
(   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
 0                5.1               3.5                1.4               0.2
 1                4.9               3.0                1.4               0.2
 2                4.7               3.2                1.3               0.2
 3                4.6               3.1                1.5               0.2
 4                5.0               3.6                1.4               0.2

    target
 0       0
 1       0
 2       0
 3       0
 4       0  ,
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
 40                5.0               3.5                1.3               0.3
 41                4.5               2.3                1.3               0.3
 42                4.4               3.2                1.3               0.2
 43                5.0               3.5                1.6               0.6
 44                5.1               3.8                1.9               0.4

     target
 40       0
 41       0
 42       0
 43       0
 44       0  ,
      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
 100                6.3               3.3                6.0               2.5
 101                5.8               2.7                5.1               1.9
 102                7.1               3.0                5.9               2.1
 103                6.3               2.9                5.6               1.8
 104                6.5               3.0                5.8               2.2

      target
 100       2
 101       2
 102       2
 103       2
 104       2  )
[5]:
# Vertical partitioning.
v_alice, v_bob, v_carol = data.iloc[:, :2], data.iloc[:, 2:4], data.iloc[:, 4:]

# Save to temporary files.
v_alice_path = os.path.join(temp_dir, 'v_alice.csv')
v_bob_path = os.path.join(temp_dir, 'v_bob.csv')
v_carol_path = os.path.join(temp_dir, 'v_carol.csv')
v_alice.to_csv(v_alice_path, index=False)
v_bob.to_csv(v_bob_path, index=False)
v_carol.to_csv(v_carol_path, index=False)
[6]:
v_alice, v_bob, v_carol
[6]:
(     sepal length (cm)  sepal width (cm)
 0                  5.1               3.5
 1                  4.9               3.0
 2                  4.7               3.2
 3                  4.6               3.1
 4                  5.0               3.6
 ..                 ...               ...
 145                6.7               3.0
 146                6.3               2.5
 147                6.5               3.0
 148                6.2               3.4
 149                5.9               3.0

 [150 rows x 2 columns],
      petal length (cm)  petal width (cm)
 0                  1.4               0.2
 1                  1.4               0.2
 2                  1.3               0.2
 3                  1.5               0.2
 4                  1.4               0.2
 ..                 ...               ...
 145                5.2               2.3
 146                5.0               1.9
 147                5.2               2.0
 148                5.4               2.3
 149                5.1               1.8

 [150 rows x 2 columns],
      target
 0         0
 1         0
 2         0
 3         0
 4         0
 ..      ...
 145       2
 146       2
 147       2
 148       2
 149       2

 [150 rows x 1 columns])

Creation#

Horitontal DataFrame#

Create a DataFrame consisting of horizontally partitioned data.

💡 The original data is still stored locally in the data holder and is not transmitted out of the domain.

Here, as a simple show case, we choose secure aggregation and spu comparison. You can refer to Secure Aggregation to learn more about secure aggregation solutions and implement appropriate security policies according to your needs.

[7]:
from secretflow.data.horizontal import read_csv as h_read_csv
from secretflow.security.aggregation import SecureAggregator
from secretflow.security.compare import SPUComparator

# The aggregator and comparator are respectively used to aggregate
# or compare data in subsequent data analysis operations.
aggr = SecureAggregator(device=alice, participants=[alice, bob, carol])

spu = sf.SPU(sf.utils.testing.cluster_def(parties=['alice', 'bob', 'carol']))
comp = SPUComparator(spu)
hdf = h_read_csv({alice: h_alice_path, bob: h_bob_path, carol: h_carol_path},
                 aggregator=aggr,
                 comparator=comp)

Vertical DataFrame#

Create a DataFrame consisting of vertically partitioned data.

💡 The original data is still stored locally in the data holder and is not transmitted out of the domain.

[8]:
from secretflow.data.vertical import read_csv as v_read_csv

vdf = v_read_csv({alice: v_alice_path, bob: v_bob_path, carol: v_carol_path})

Data analysis#

For data privacy protection purposes, DataFrame does not allow the view of raw data. DataFrame provides an interface similar to pandas for users to analyze data. These interfaces usually support both horizontal and vertical partitioned data.

💡 During the following operations, the original data of the DataFrame is still stored locally on the node and is not transmitted out of the domain.

[9]:
hdf.columns
[9]:
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')
[10]:
vdf.columns
[10]:
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

Get the minimum value, you can see that it is consistent with the original data.

[11]:
print('Horizontal df:\n', hdf.min())
print('\nVertical df:\n', vdf.min())
print('\nPandas:\n', data.min())
Horizontal df:
 sepal length (cm)    4.3
sepal width (cm)     2.0
petal length (cm)    1.0
petal width (cm)     0.1
target               0.0
dtype: float64

Vertical df:
 sepal length (cm)    4.3
sepal width (cm)     2.0
petal length (cm)    1.0
petal width (cm)     0.1
target               0.0
dtype: float64

Pandas:
 sepal length (cm)    4.3
sepal width (cm)     2.0
petal length (cm)    1.0
petal width (cm)     0.1
target               0.0
dtype: float64

You can also view information such as maximum value, mean value, and quantity.

[12]:
hdf.max()
[12]:
sepal length (cm)    7.9
sepal width (cm)     4.4
petal length (cm)    6.9
petal width (cm)     2.5
target               2.0
dtype: float64
[13]:
vdf.max()
[13]:
sepal length (cm)    7.9
sepal width (cm)     4.4
petal length (cm)    6.9
petal width (cm)     2.5
target               2.0
dtype: float64
[14]:
hdf.mean(numeric_only=True)
[14]:
sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
target               1.000000
dtype: float64
[15]:
vdf.mean(numeric_only=True)
[15]:
sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
target               1.000000
dtype: float64
[16]:
hdf.count()
[16]:
sepal length (cm)    150
sepal width (cm)     150
petal length (cm)    150
petal width (cm)     150
target               150
dtype: int64
[17]:
vdf.count()
[17]:
sepal length (cm)    150
sepal width (cm)     150
petal length (cm)    150
petal width (cm)     150
target               150
dtype: int64

Selection#

Get partial columns.

[18]:
hdf_part = hdf[['sepal length (cm)', 'target']]
hdf_part.mean(numeric_only=True)
[18]:
sepal length (cm)    5.843333
target               1.000000
dtype: float64
[19]:
vdf_part = hdf[['sepal width (cm)', 'target']]
vdf_part.mean(numeric_only=True)
[19]:
sepal width (cm)    3.057333
target              1.000000
dtype: float64

Modification#

Horizontal DataFrame

[20]:
hdf_copy = hdf.copy()
print('Min of target: ', hdf_copy['target'].min()[0])
print('Max of target: ', hdf_copy['target'].max()[0])
Min of target:  0.0
Max of target:  2.0
[21]:
# Set target to 1。
hdf_copy['target'] = 1

# You can see that the value of target has become 1.
print('Min of target: ', hdf_copy['target'].min()[0])
print('Max of target: ', hdf_copy['target'].max()[0])
Min of target:  1.0
Max of target:  1.0

Vertical DataFrame.

[22]:
vdf_copy = vdf.copy()
print('Min of sepal width (cm): ', vdf_copy['sepal width (cm)'].min()[0])
print('Max of sepal width (cm): ', vdf_copy['sepal width (cm)'].max()[0])
Min of sepal width (cm):  2.0
Max of sepal width (cm):  4.4
[23]:
# Set sepal width (cm) to 20。
vdf_copy['sepal width (cm)'] = 20

# You can see that the value of sepal width (cm) has become 20.
print('Min of sepal width (cm): ', vdf_copy['sepal width (cm)'].min()[0])
print('Max of sepal width (cm): ', vdf_copy['sepal width (cm)'].max()[0])
Min of sepal width (cm):  20
Max of sepal width (cm):  20

Ending#

[24]:
# Clean up temporary files

import shutil

shutil.rmtree(temp_dir, ignore_errors=True)

What’s Next?#

Learn how to do data preprocessing with DataFrame with this tutorial.