Researchers are increasingly asked to share data they generate in the course of their research. However, some of this data contains information about study participants and sharing this data would breach the confidentiality of these participants. The removal of direct (and indirect) personal identifiers from research data can substantially reduce the risk of sharing this sensitive data.

Johns Hopkins Data Services has compiled a list of de-identification software tools and applications that can be used in de-identifying research data for public sharing. The information on this page is provided for informational purposes only and does not constitute an endorsement of any particular tool for data de-identification. Investigators and researchers should ensure that they follow data governance policies and procedures that apply to their data. Refer to your IRB and the Johns Hopkins Privacy Office for further information. Medical and health research subject to oversite by the JHM Data Trust Council has additional requirements regarding disclosure protection and de-identification. Please refer to the Data Trust FAQ for details.

See also our overview document on protecting and removing personal identifiers of research subjects for data sharing. De-Identifying Human Subjects Data (JHU version).  (Version for non-JHU visitors)

Johns Hopkins researchers are encouraged to talk with Johns Hopkins Data Services for advice and guidance on de-identification of human subjects research data. Please contact us at dataservices@jhu.edu.

Tools for De-identifying Unstructured Text

  • NLM-Scrubber
    • Software description: “A freely available, HIPAA compliant, clinical text de-identification tool designed and developed at the National Library of Medicine.”  For records converted to ASCII text, runs on a command line/terminal interface in Linux or Windows
    • Intended purpose: Uses natural language processing to automatically redact direct identifiers typically found in medical records, including addresses below state level, names, dates, and alphanumeric identifiers such as patient account numbers. It attempts to follow HIPAA rules for levels of specifity to retain or remove (i.e., city but not state.)
  • deid software package
    • Software description: “includes code and dictionaries for automated location and removal of protected health information (PHI) in free text from medical records”
    • Intended purpose: For free text in medical records
  • ATLAS.ti
    • Software description: “a suite of tools that supports analysis of written texts, audio clips, video files, and visual/graphic data” (from Wikipedia)
    • Intended purpose: Does not automatically de-identify, but users can create coding terms to manually mark identifiers in text, or audio/video segments and then easily search for those coded clips to redact or alter.
  • Nvivo
    • Software description: “a qualitative data analysis computer software package designed for qualitative researchers working with very rich text-based and/or multimedia information” (from Wikipedia)
    • Intended purpose: Does not automatically de-identify, but users can create coding terms to manually mark identifiers in text, or audio/video segments and then easily search for those coded clips to redact or alter.
  • PARAT text (Privacy Analytics Lexicon)
    • Software description: “Using PARAT Text for anonymization enables organizations to…Extend the practice of anonymization to unstructured formats residing in electronic health and other data formats”
    • Intended purpose: Unstructured medical records

Skill needed: * For those technically proficient enough not to be frightened off by spending a couple of hours learning a new application; ** For users with coding experience

Tools for De-identifying Data in Digital Images

  • DICOMCleaner
    • Software description: “DicomCleaner™ is a free open source tool with a user interface for importing, “”cleaning”” and saving sets of DICOM instances (files)”
    • Intended purpose: Medical Images in DICOM (Digital Imaging and Communications in Medicine) format

Skill needed: * For those technically proficient enough not to be frightened off by spending a couple of hours learning a new application

Tools for De-identifying Microdata, Tabular or Otherwise Structured Data

  • Cornell Anonymization Toolkit (CAT)
    • Software description: “designed for interactively anonymizing published dataset to limit identification disclosure of records under various attacker models”
    • Intended purpose: Medical records – tabular data
  • Open Refine
    • Software description: “a powerful tool for working with messy data: cleaning it; transforming it from one format into another; extending it with web services; and linking it to databases like Freebase”
    • Intended purpose: Working with messy data
  • PARAT Core (Privacy Analytics Eclipse)
    • Software description: “PARAT software masks and de-identifies personal information using a risk-based approach that optimizes the analytic utility of de-identified data sets”
    • Intended purpose: Working with structured medical records
  • mu-Argus 5.1
    • Software description: “μ-ARGUS is a software program designed to create safe micro-data files. Initially developed as a closed-source project but was converted to open source”
    • Intended purpose: Statistical Disclosure Control for microdata
  • tau-Argus 4.1
    • Software description: “τ-ARGUS is a software program designed to protect statistical tables.  Initiatally developed as a closed-source project but was converted to open source”
    • Intended purpose: Statistical Disclosure Control for tabular data
  • The sdcMicro package in R
    • Software description: “This package can be used for the generation of anonymized (micro)data, i.e. for the creation of public- and scientific-use files. In addition, various risk estimation methods are included”
    • Intended purpose: Unstructured medical records
  • The sdcTable package in R
    • Software description: “Methods for statistical disclosure control in tabular data such as primary and secondary cell suppression are covered in this package”
    • Intended purpose: Statistical Disclosure Control for tabular data
  • The University of Texas at Dallas Anonymization Toolbox
    • Software description: a researcher-compiled implementation (from UT Dallas Data Security and Privacy Lab) of various anonymization methods into a toolbox for public use by researchers
    • Intended purpose: Unstructured text files
  • ARX Data Anonymization Tool
    • Software description: “A comprehensive software for risk- and utility-based privacy-preserving microdata publishing” developed at Technical University of Munich, Germany.
    • Intended purpose: “an open source tool for transforming structured (i.e. tabular) sensitive personal data using selected methods from the broad area of statistical disclosure control.”

Skill needed: * For those technically proficient enough not to be frightened off by spending a couple of hours learning a new application; ** For users with coding experience

(This page was last updated in September 2016.)