.. _anonymize:

=========
Anonymize
=========

This module provides methods to **anonymize or pseudo-anonymize a dataset**.

**Anonymizing** the dataset means removing all the confidential data and 
**not retaining any key** to enable it to be reconstructed. Also the 
remaining patterns in the data should not enable the identity of 
individuals to be identified via other means.

Pseudo-anonymizing the dataset also removes all the confidential
data from the dataset. But we also retain a key or separate
dataset enabling us to decode the original entities if required
at some point in the future.

The anonymizer does not delete the confidential data. Instead it 
**one-way encrypts (hashes) the data** so that the original text or
numbers cannot be retrieved. 

Encryption is better than deletion because the pattern of data is 
useful information we want our models to be able to use. For example 
it may be that one customer has multiple contracts. Deleting the 
customer number loses the information that these contracts are linked
but encryption retains those relationships.


Encryption method
=================

The method used is similar to password hashing.

*   A **salt** is randomly created for this encryption
    run by the system. If desired for reproducability, the
    user can override the salt with a passphrase of their 
    choice.

*   The salt is prepended to each value to be encrypted

*   The encryption algorithm converts the salted value into
    a long hexidecimal hash. This is our encrypted value.

Encryption is done using the **python hashlib package**. 
Using the **md5 algorithm**. More details can be found at  
https://docs.python.org/3/library/hashlib.html

The randomized salt is created using the **bcrypt package**.
More details can be found at https://pypi.org/project/bcrypt/

.. note::

    In a password setting, usually for each 
    individual encryption, a new random salt value
    would be created. This way you get very strong encryption
    with a key (the salt) that changes for every password.
    
    However in our algorithm we only set the salt once for the 
    full dataset. This is on purpose so that we can retain the 
    relationship structure between values whilst still providing 
    a good level of encryption by using a hash algorithm and 
    a salt value. 
    
    What this means is that the same customer
    occuring in multiple rows will have the same hashed value
    using our method.  


Usage
=====

Import the anonymizer class::

    from inforcehub import Anonymize
    
On initializing, the object will use a new randomized salt passphrase::

    anon = Anonymize()

You need to decide which columns in your Pandas dataframe
should be anonymized.

To transform a dataframe use the **transform** method on the anon
object.
We pass in the dataframe to be transformed, and a list of the 
columns to be anonymized::

    anon.transform(df, ['columnA', 'colummZ'])

The dataframe **df** itself will be anonymized to save improve 
memory usage and speed for large datasets. To
retain a copy of the original make a deep copy of the dataframe
before transforming it::

    original_df = df.copy(deep=True)  # Do this first

If **pseudo-anonymization** is required instead of full anonymization
the **lookup** dataframe of encrypted and unencrypted values is 
returned. This can be used later as a lookup to return to the 
confidential data::

    lookup_df = anon.transform(df, ['columnA', 'colummZ'])


Module details
==============

.. automodule:: inforcehub.anonymize
   	:members: