Private AI Developer Documentation

This is an end-user documentation for Private AI’s container-based de-identification system. Private AI’s solution is provided as a Docker image and communicates via POST requests.

To be able to run the container, the following are required:
1. Docker Image
2. Container Orchestration Platform
3. API key

For commercial use or demonstration purposes, the API key has to be obtained from Private AI (info@private-ai.ca).

Text Web Demo

If testing is to be done without deploying the container, please use the Web Demo service available at https://private-ai.ca/webdemo/.
In the left box, please enter the text that you would like to de-identify. The results will appear in the right box after you click the “Remove PII” button.

An API endpoint is also available for testing the text de-identification model at https://demoprivateai.com. See Text Sample Commands below for an example curl command.

Installation Instructions for Text, Image, and Video Packages

1. Install Dockers:

$ sudo apt install gnupg

2. Make sure Docker can address at least 2GB RAM

3. Download & Decrypt the Docker file:

$ sudo apt install gnupg
$ wget http://www.private-ai.ca/238734287342/private_ai_<version number>.tar.gpg
$ gpg private_ai_<version number>.tar.gpg

Enter password when prompted.

4. Load the docker image:

$ docker load -i private_ai_<version_number>.tar

5. Run the image with the following command:

$ docker run --rm -p 8080:8080 -p 8081:8081 -it deid:<version_number>

Note: Temporary video files are written to the /tmp folder. Use tmpfs for maximum performance & security: https://docs.docker.com/storage/tmpfs/

For more information on Docker, please visit https://docs.docker.com/

General & Deployment Tips

Usage is metered by API calls, where an API call is:

  • Text: 128 words, where a word is a whitespace separated piece of text
  • Image: A single image
  • Video: Minutes of video processed, rounded up (e.g. 23s will be rounded up to 1 minute)

The number of API calls used is in the POST request return, see field “api_calls_used”.

Whilst the Private AI docker solution can make use of all available CPU cores, it is optimised to run on a single CPU core machine.

The docker solution relies on the Green Unicorn (Gunicorn) web server to service requests. Gunicorn is capable of multiple simultaneous requests, however this will not result in a processing speed improvement beyond removing any network latency. Inputs are still processed sequentially by our code.

For more information on Gunicorn, please visit https://gunicorn.org/#docs

Recommended deployment setup is to use Kubernetes with a cluster of single CPU core nodes, together with a load balancer to distribute requests. This could be a local on-premise deployment or a cloud provider such as GCP, AWS or Azure.

Recommended worker type is a single core Intel Cascade Lake with 4GB RAM – a 2GB RAM option can be delivered upon request. Other CPU types with AVX512 VNNI support will also perform well.

Please contact us if you’d like to know how your infrastructure can best utilize the runtime.

The health of the container can be monitored by calling ‘healthz’ as follows: 

$ curl -X GET localhost:8081/healthz
{
"last_auth_call_successful": false,
"success": true
}

“last_auth_call_successful” displays whether the attempted call to the Private AI authentication servers was successful. Note that this value defaults to false on startup, until the first deid API call has been made successfully.

Authentication & External Communications

Private AI’s de-identification suite is designed to run entirely on-device, on-premise, or in private cloud. The only outside communications made are for authentication & usage reporting with Private AI’s servers. These communications do not contain any customer data – if training data is required, this must be given to Private AI seperately. An authentication call is made upon the first API call after the Docker image is started, and again at pre-defined intervals based on your subscription.

Text

Once the Docker container is running, you can make requests to de-identify text. This is a POST request with a JSON body that has the following arguments:

  • text: Text to de-identify
  • unique_pii_markers (optional, default True): Specifies whether PII markers in the text should uniquely identify PII.
  • accuracy_mode (optional, default standard): Controls the speed/accuracy tradeoff. Defaults to “standard”, but can be set to “standard_high” or “high” to enable higher accuracy at the expense of processing speed.
  • enabled_classes (optional, default all): Controls which types of PII are removed. When not specified, enabled_classes defaults to all classes. See Supported Entity Types below for the list of possible entities.

The API should return with a JSON body containing the following fields:

  • result: The de-identified text, with each entity found replaced by a marker
  • pii: A list of all entities found in the text. Each PII entry has the following fields:
    • marker: The corresponding marker in the de-identified text (‘result’ field), where the entity exists
    • text: The entity text
    • best_label: The entity label with the highest likelihood
    • stt_idx: Start character index of the entity, in the original text
    • end_idx: End character index of the entity, in the original text
    • labels: A dictionary of all possible labels, together with associated likelihoods. Note that these are not strictly probabilities and do not sum to 1, as a word can belong to multiple classes
  • api_calls_used: The number of API calls used to process a request
  • output_checks_passed: Reports whether the output validity checks passed or not. These checks test:
    • Whether replacing each entity marker with the corresponding information matches the input
    • That every entity marker is bounded by whitespace or punctuation

Sample Commands

Below are some sample commands and corresponding outputs displaying the different options.

Unique PII markers (default)

$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace", "key": "<customer key>"}'
{
    "result": "My name is [NAME_1] and my friend is [NAME_2]",
    "pii": [
        {
           "marker": "NAME_1",
           "text": "John",
           "best_label": "NAME",
           "stt_idx": 11,
           "end_idx": 15,
           "labels":
               {
                   "NAME": 0.923
               }
        },
        {
            "marker": "NAME_2",
            "text": "Grace",
            "best_label": "NAME",
            "stt_idx": 33,
            "end_idx": 38,
            "labels":
                {
                    "NAME": 0.9135
                }
        }
    ],
    "api_calls_used": 1,
    "output_checks_passed": true
}

Non-unique PII markers

$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace", "unique_pii_markers": "False", "key": "<customer key>"}'
{
    "result": "My name is [NAME] and my friend is [NAME]",
    "pii": [
        {
            "marker": "NAME",
            "text": "John",
            "stt_idx": 11,
            "end_idx": 15,
            "best_label": "NAME",
            "labels":
                {
                    "NAME": 0.923
                }
        },
        {
            "marker": "NAME",
            "text": "Grace",
            "stt_idx": 33,
            "end_idx": 38,
            "best_label": "NAME",
            "labels":
                {
                    "NAME": 0.9135
                }
        }
    ],
    "api_calls_used": 1,
    "output_checks_passed": true
}

Enabled classes

$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace and we live in Barcelona", "key": "<customer key>", "enabled_classes": ["AGE", "LOCATION"]}'
{
    "result": "My name is John and my friend is Grace and we live in [LOCATION_1]",
    "pii": [
        {
            "marker": "LOCATION_1",
            "text": "Barcelona",
            "stt_idx": 54,
            "end_idx": 63,
            "best_label": "LOCATION",
            "labels":
                {
                    "LOCATION": 0.9211
                }
        }
    ],
    "api_calls_used": 1,
    "output_checks_passed": true
}

Private AI Demo Server

$ curl -X POST https://demoprivateai.com -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace", "key": "<customer key>"}'
{
    "result": "My name is [NAME_1] and my friend is [NAME_2]",
    "pii": [
        {
            "marker": "NAME_1",
            "text": "John",
            "stt_idx": 11,
            "end_idx": 15,
            "best_label": "NAME",
            "labels":
                {
                    "NAME": 0.923
                }
        },
        {
            "marker": "NAME_2",
            "text": "Grace",
            "stt_idx": 33,
            "end_idx": 38,
            "best_label": "NAME",
            "labels":
                {
                    "NAME": 0.9135
                }
        }
    ],
    "api_calls_used": 1,
    "output_checks_passed": true
}

Supported Entity Types

The currently supported entities are listed below. Note that an entity can have multiple types. E.g. “Mayor of Boston” is OCCUPATION with a LOCATION mentioned,so “Boston” is labeled as both OCCUPATION and LOCATION.

General

 

Label Description
AGE Number associated to an age, e.g. 27
CREDIT_CARD Credit card number, e.g. 0123 0123 0123 0123
CREDIT_CARD_EXPIRATION E.g. Expires: 2/28
CVV Credit Card Verification Code, e.g. CVV: 080
DATE E.g. December 18 or 2011-2014
DOB Date Of Birth, e.g. Date of Birth: March 7,1961
EVENT E.g. Olympics
FILENAME Name of a computer file, e.g., brad_tax_returns.txt, koalabear.jpg
IP_ADDRESS Internet IP address, e.g. 192.168.0.1
LANGUAGE E.g. English, French
LOCATION E.g. Eritrea, Italy
MONEY E.g. 15 dollars, $94.50
NAME Person name, e.g. Harry Potter, Dwayne Johnson
NUMERICAL_PII Numeric PII that doesn’t fall into other categories or that the model is uncertain about
ORGANIZATION E.g. BHP, McDonalds,
OCCUPATION E.g. professor, actors, engineer, MBA, CPA
ORIGIN Origin encompasses nationalities, ethnicities, and races. E.g., Canadian, american, caucasian
PHONE_NUMBER E.g. +4917643476050
RELIGION E.g. Hindu
SSN Social Security Number, e.g. 078-05-1120
TIME E.g. 19:37:28
URL Internet URL, e.g. www.private-ai.ca
PHI (Protected Health Information)

 

Label Description
BLOOD_TYPE Blood type, e.g., O-
CONDITION A medical condition. Includes diseases, syndromes, deficits, disorders. E.g., chronic fatigue syndrome, arrhythmia, depression.
DRUG Medical drug, including vitamins and minerals. E.g., Advil, Acetaminophen, Panadol
INJURY Human injury, e.g., I broke my arm, I have a sprained wrist. Includes mutations, miscarriages and dislocations.
MEDICAL_PROCESS Medical process, including treatments, procedures and tests. E.g., ‘heart surgery’, ‘CT scan’.
Coming Soon
Label Description
AWARD E.g. Nobel Prize
HEALTHCARE_NUMBER Healthcare number, e.g. 5584-486-674-YM
ID_NUMBER E.g. Passport number or driver’s license number, e.g. D6101-40706-60905
PASSWORD E.g. secret_password
PHYSICAL_ATTRIBUTE A body attribute, e.g. I’m 190cm tall.
POLITICAL_AFFILIATION E.g. Democrat, Republican
USERNAME User name or handle, e.g. privateairocks, @_PrivateAI
ZODIAC_SIGN E.g. Aquarius

 

Text Performance Tips

Private AI’s solution uses AI to detect PII based on context. Therefore, for best performance it is advisable to send text through in the largest possible chunks. For example, the following chat log should be sent through in one call, as opposed to line-by-line:

“Hi John, how are you?
I’m good thanks
Great, hope Atlanta is treating you well”

Similarly, text documents should be sent through in a single request, rather than by paragraph or sentence. In addition to improving accuracy, this will minimize the number of API calls made.

The AI model has also been optimised for normal English capitalization, e.g. “Robert is from Sydney, Australia. Muhab is from Wales”. If this is not the case for your data, please contact Private AI so that we can provide you with the optimal model for your use case. Our solution will still work, but some performance will be lost.

Image

Images can be processed via a POST request with a JSON body that has a “image_b64” field, containing a base 64 encoded image file and optional fields “blur_shape” and “text_mode”. “blur_shape” specifies how the blurred region should appear, whilst “text_mode” specifies what type of text-in-image mode should be used.

Valid options for “blur_shape”:

  • “box”, standard blur shape
  • “oval”
  • “rounded_edges”

Valid options for text_mode:

  • “document”, which is optimized for legible images such as photocopier scans
  • “image”, slower than document but optimized to detect hard to read text present in real images
  • “none”, don’t perform text-in-image anonymization

The API will return a de-identified image. For an example on image de-identification, please see the python example ‘image_deid_test.py’.
You can run it as follows:

$ python3 image_deid_test.py --image PATH_TO_IMAGE --blur_shape oval --text_mode none --key API_KEY

When a text in image mode is selected, the “output_checks_passed” self-check from the text mode is also returned.

Video

Videos can be processed via a POST request with a JSON body that has a “video_b64” field, containing a base 64 encoded video file and optional field “blur_shape”. “blur_shape” specifies how the blurred region should appear. Currently only .mp4 videos are supported

Valid options for blur_shape:

  • “box”, standard blur shape
  • “oval”
  • “rounded_edges”

The API will return a de-identified video. For an example on video de-identification, please see the python example ‘video_deid_test.py’.