Text

Once the Docker container is running, you can make requests to de-identify text. This is a POST request with a JSON body that has the following arguments:

  • text: Text to de-identify

  • unique_pii_markers (optional, default True): Specifies whether PII markers in the text should uniquely identify PII.

  • accuracy_mode (optional, default standard): Controls the speed/accuracy tradeoff. Defaults to “standard”, but can be set to “standard_high”, “high” or “high_multilingual” to enable higher accuracy at the expense of processing speed. Support for non-English languages can be enabled by choosing a “*_multilingual” model.

  • enabled_classes (optional, default all): Controls which types of PII are removed. When not specified, enabled_classes defaults to all classes. See ‘Supported Entity Types’ below for the list of possible entities.

  • Beta: fake_entity_accuracy_mode (optional, default none): Enable fake entity generation using the specified model. Currently this feature is in beta and only supports “standard”

  • Beta: preserve_relationships (optional, default True): Specifies whether multiple instances of the same entity should have the same generated fake entity or not. For example, preserve relationships: “Hi John and Rosha, John nice to meet you” -> “Hi Harry and Alev, Harry nice to meet you”. No preserve relationships: “Hi John and Rosha, John nice to meet you” -> “Hi Harry and Alev, Sulav nice to meet you”

The API should return with a JSON body containing the following fields:

  • result: The de-identified text, with each entity found replaced by a marker

  • Beta: result_fake: The pseudonymized (fake) text with each entity found replaced by a generated entity

  • pii: A list of all entities found in the text. Each PII entry has the following fields:

    • marker: The corresponding marker in the de-identified text (‘result’ field), where the entity exists

    • text: The entity text

    • best_label: The entity label with the highest likelihood

    • stt_idx: Start character index of the entity, in the original text

    • end_idx: End character index of the entity, in the original text

    • labels: A dictionary of all possible labels, together with associated likelihoods. Note that these are not strictly probabilities and do not sum to 1, as a word can belong to multiple classes

    • fake_text: The fake entity that was generated to replace the original

    • fake_stt_idx: Start character index of the fake entity, in the pseudonymized/fake text

    • fake_end_idx: End character index of the fake entity, in the pseudonymized/fake text

  • api_calls_used: The number of API calls used to process a request

  • output_checks_passed: Reports whether the output validity checks passed or not. These checks test:

    • Whether replacing each entity marker with the corresponding information matches the input

    • That every entity marker is bounded by whitespace or punctuation

Sample Commands

Below are some sample commands and corresponding outputs displaying the different options.

Unique PII markers (default)
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace", "key": "<customer key>"}'
{"result": "My name is [NAME_1] and my friend is [NAME_2]",
 "pii": [{"marker": "NAME_1",
          "text": "John",
          "best_label": "NAME",
          "stt_idx": 11,
          "end_idx": 15,
          "labels": {"NAME": 0.923}},
         {"marker": "NAME_2",
          "text": "Grace",
          "best_label": "NAME",
          "stt_idx": 33,
          "end_idx": 38,
          "labels": {"NAME": 0.9135}}
        ],
 "api_calls_used": 1,
 "output_checks_passed": true
}
Non-unique PII markers
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace", "unique_pii_markers": "False", "key": "<customer key>"}'
{"result": "My name is [NAME] and my friend is [NAME]",
 "pii": [{"marker": "NAME",
          "text": "John",
          "best_label": "NAME",
          "stt_idx": 11,
          "end_idx": 15,
          "labels": {"NAME": 0.923}},
         {"marker": "NAME",
          "text": "Grace",
          "best_label": "NAME",
          "stt_idx": 33,
          "end_idx": 38,
          "labels": {"NAME": 0.9135}}
        ],
 "api_calls_used": 1,
 "output_checks_passed": true
}
Enabled classes
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace and we live in Barcelona", "key": "<customer key>", "enabled_classes": ["AGE", "LOCATION"]}'
{"result": "My name is John and my friend is Grace and we live in [LOCATION_1]",
 "pii": [{"marker": "LOCATION_1",
          "text": "Barcelona",
          "best_label": "LOCATION",
          "stt_idx": 54,
          "end_idx": 63,
          "labels": {"LOCATION": 0.9211}}
        ],
 "api_calls_used": 1,
 "output_checks_passed": true
}
Private AI Demo Server
$ curl -X POST https://n1fan2hnhf.execute-api.us-east-1.amazonaws.com/May_19_2021/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace", "key": "<customer key>"}'
{"result": "My name is [NAME_1] and my friend is [NAME_2]",
 "pii": [{"marker": "NAME_1",
          "text": "John",
          "best_label": "NAME",
          "stt_idx": 11,
          "end_idx": 15,
          "labels": {"NAME": 0.923}},
         {"marker": "NAME_2",
          "text": "Grace",
          "best_label": "NAME",
          "stt_idx": 33,
          "end_idx": 38,
          "labels": {"NAME": 0.9135}}
        ],
 "api_calls_used": 1,
 "output_checks_passed": true
}
Fake entity generation
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace and we live in Barcelona", "key": "<customer key>", "fake_entity_accuracy_mode": "standard"}'
{"result": "My name is [NAME_1] and my friend is [NAME_2] and we live in [LOCATION_1]",
 "result_fake": "My name is Sarah and my friend is Sarah and we live in California",
 "pii": [{"marker": "NAME_1",
          "text": "John",
          "best_label": "NAME",
          "stt_idx": 11,
          "end_idx": 15,
          "labels": {"NAME":0.9061},
          "fake_text": ["Sarah"],
          "fake_stt_idx": 11,
          "fake_end_idx": 16},
         {"marker": "NAME_2",
          "text": "Grace",
          "best_label": "NAME",
          "stt_idx": 33,
          "end_idx": 38,
          "labels": {"NAME": 0.9032},
          "fake_text": ["Sarah"],
          "fake_stt_idx": 34,
          "fake_end_idx": 39},
         {"marker": "LOCATION_1",
          "text": "Barcelona",
          "best_label": "LOCATION",
          "stt_idx": 54,
          "end_idx": 63,
          "labels": {"LOCATION": 0.8985},
          "fake_text": ["California"],
          "fake_stt_idx": 55,
          "fake_end_idx": 65}
         ],
 "api_calls_used": 1,
 "output_checks_passed": true
}

Supported Entity Types

The currently supported entities are listed below. Note that an entity can have multiple types. E.g. “Mayor of Boston” is OCCUPATION with a LOCATION mentioned, so “Boston” is labeled as both OCCUPATION and LOCATION.

General

Label

Description

AGE

Number associated to an age, e.g. 27

CREDIT_CARD

Credit card number, e.g. 0123 0123 0123 0123

CREDIT_CARD_EXPIRATION

E.g. Expires: 2/28

CVV

Credit Card Verification Code, e.g. CVV: 080

DATE

E.g. December 18 or 2011-2014

DOB

Date Of Birth, e.g. Date of Birth: March 7,1961

EMAIL_ADDRESS

E.g. info@private-ai.ca

EVENT

E.g. Olympics

FILENAME

Name of a computer file, e.g., brad_tax_returns.txt, koalabear.jpg

IP_ADDRESS

Internet IP address, e.g. 192.168.0.1

LANGUAGE

E.g. English, French

LOCATION

E.g. Eritrea, Italy

MONEY

E.g. 15 dollars, $94.50

NAME

Person name, e.g. Harry Potter, Dwayne Johnson

NUMERICAL_PII

Numeric PII that doesn’t fall into other categories or that the model is uncertain about

ORGANIZATION

E.g. BHP, McDonalds

OCCUPATION

E.g. professor, actors, engineer, MBA, CPA

ORIGIN

Origin encompasses nationalities, ethnicities, and races. E.g., Canadian, american, caucasia

PHONE_NUMBER

E.g. +4917643476050

RELIGION

E.g. Hindu

SSN

Social Security Number, e.g. 078-05-1120

TIME

E.g. 19:37:28

URL

Internet URL, e.g. www.private-ai.ca

USERNAME

User name or handle, e.g. privateairocks, @_PrivateAI

ZODIAC_SIGN

E.g. Aquarius

Protected Health Information (PHI)

Label

Description

BLOOD_TYPE

Blood type, e.g., O-

CONDITION

A medical condition. Includes diseases, syndromes, deficits, disorders. E.g., chronic fatigue syndrome, arrhythmia, depression.

DRUG

Medical drug, including vitamins and minerals. E.g., Advil, Acetaminophen, Panadol

INJURY

Human injury, e.g., I broke my arm, I have a sprained wrist. Includes mutations, miscarriages and dislocations.

MEDICAL_PROCESS

Medical process, including treatments, procedures and tests. E.g., ‘heart surgery’, ‘CT scan’.

Coming Soon

Label

Description

AWARD

E.g. Nobel Prize

HEALTHCARE_NUMBER

Healthcare number, e.g. 5584-486-674-YM

ID_NUMBER

E.g. Passport number or driver’s license number, e.g. D6101-40706-60905

PASSWORD

E.g. secret_password

PHYSICAL_ATTRIBUTE

A body attribute, e.g. I’m 190cm tall.

POLITICAL_AFFILIATION

E.g. Democrat, Republican

Performance Tips

Private AI’s solution uses AI to detect PII based on context. Therefore, for best performance it is advisable to send text through in the largest possible chunks. For example, the following chat log should be sent through in one call, as opposed to line-by-line:

“Hi John, how are you? I’m good thanks Great, hope Atlanta is treating you well”

Similarly, text documents should be sent through in a single request, rather than by paragraph or sentence. In addition to improving accuracy, this will minimize the number of API calls made.

The AI model has also been optimised for normal English capitalization, e.g. “Robert is from Sydney, Australia. Muhab is from Wales”. If this is not the case for your data, please contact Private AI so that we can provide you with the optimal model for your use case. Our solution will still work, but some performance will be lost.