pgEdge Anonymizer Tutorial

Anonymizer lets you create an experimental data set that preserves the shape and integrity of a Postgres database in just three steps:

Create a configuration file that specifies the replacement patterns for your columns.
Build and run the pgedge-anonymizer to convert your columns.
Review the results.

Before running pgedge-anonymizer, you need to create a configuration file named pgedge-anonymizer.yaml; the file should contain:

a database section, with connection details for your database.
a columns section, listing the fully-qualified columns that you wish to anonymize (in schema_name.table_name.column_name format).
patterns properties for each column that specifies the form that replacement content will take.

For example:

database:
  host: localhost
  port: 5432
  database: myapp
  user: anonymizer

columns:
  - column: public.users.email
    pattern: EMAIL

  - column: public.users.phone
    pattern: US_PHONE

  - column: public.users.ssn
    pattern: US_SSN

After creating a configuration file, run the anonymizer:

pgedge-anonymizer run

Review the list of changes as pgedge-anonymizer runs, displaying statistics:

Processing public.users.email (est. 50000 rows)...
  10000 rows processed
  20000 rows processed
  30000 rows processed
  40000 rows processed
  50000 rows processed
  Completed: 50000 rows, 48234 values anonymized

=== Anonymization Statistics ===
Total columns processed: 1
Total rows processed:    50000
Total values anonymized: 48234
Total duration:          2.34s
Throughput:              21367 rows/sec