After reading my last blogpost on Anonimatron, you must have asked yourself “Great, but how do I actually use Anonimatron to de-personalize my database”? I tried my best to make basic Anonimatron configuration as self-explanatory as possible, just start it without any command line arguments and it will tell you.
Less adventurous or in a big hurry? This blogpost will show how simple it is to install and configure Anonimatron on an example MySQL database.
Setting up a test database
To demonstrate what Anonimatron can do to your data, we will create a little test database to play with. Anonimatron connects to all kinds of databases, including MySQL, Postgress and Oracle. In this example, we use MySQL. Here are all the statements you need to create a little database with 2 tables, a user, and some “private” data:
create database mydb; create user myuser identified by 'mypassword'; grant all privileges on *.* to 'myuser'@'localhost' identified by 'mypassword' with grant option; create table mydb.userdata ( id int not null auto_increment primary key, firstname varchar(20), lastname varchar(20), creditcardnr varchar(20) ); create table mydb.lastnames ( id int not null auto_increment primary key, lastname varchar(20) ); insert into mydb.userdata (firstname,lastname, creditcardnr) values ('Homer', 'Simpson','1234'), ('Marge', 'Simpson','5678'), ('Ned', 'Flanders','3456'), ('Charles', 'Burns','3456'); insert into mydb.lastnames (lastname) values ('Simpson'), ('Flanders'), ('Burns');
After all the hard work, you should be able to connect to the database with ‘myuser’ and see the “private” data in there:
select * from mydb.userdata;
id | firstname | lastname | creditcardnr |
---|---|---|---|
1 | Homer | Simpson | 1234 |
2 | Marge | Simpson | 5678 |
3 | Ned | Flanders | 3456 |
4 | Charles | Burns | 3456 |
select * from mydb.lastnames;
id | lastname |
---|---|
1 | Simpson |
2 | Flanders |
3 | Burns |
Let’s pretend that this is a copy of a production database, and you want to de-personalize or anonymize it.
Installing Anonimatron
To anonymize your data, download Anonimatron and unzip it in a directory of your choice. You should find an “anonimatron.sh” and “anonimatron.bat” file there. Depending on your system, run it without arguments. If you have java installed on your system you should see something like this:
$ ./anonimatron.sh This is Anonimatron 1.7, a command line tool to consistently replace live data in your database with data data which can not be traced back to the original data. You can use this tool to transform a dump from a production database into a large representative dataset you can share with your development and test team. Use the -configexample command line option to get an idea of what your configuration file needs to look like. usage: java -jar anonimatron.jar -config The XML Configuration file describing what to anonymize. -configexample Prints out a demo/template configuration file. -dryrun Do not make changes to the database. -synonyms The XML file to read/write synonyms to. If the file does not exist it will be created.
Victory! You’ve installed Anonimatron. Yes, life can really be that easy.
Configuring Anonimatron
Next, we need to tell Anonimatron how to connect to our database, and which tables and columns to process and how. The hardest part of this is probably figuring out how to create a jdbc connect string. Anonimatron can help you with that. If you start anonimatron with the “-configexample” parameter, it will scan the jdbcdrivers directory for available and supported drivers, and will show you how a jdbc URL for any of these should look:
$ ./anonimatron.sh -configexample Supported Database URL formats: Jdbc URL format By Driver jdbc:oracle:oci8:@[SID] oracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@[HOST]:[PORT]:[SID] oracle.jdbc.driver.OracleDriver jdbc:oracle:oci:@[SID] oracle.jdbc.driver.OracleDriver jdbc:postgresql://[HOST]:[PORT]/[DB] org.postgresql.Driver jdbc:mysql://[HOST]:[PORT]/[DB] org.gjt.mm.mysql.Driver Anonimatron will try to autodetect drivers which are stored in the lib directory. Add you driver there. ...
In this example, we have just created a MySQL database, so we’ll use that URL and fill in the blanks. We use the rest of the configuration example and with some copy-pasting we come up with the following configuration:
<?xml version="1.0" encoding="UTF-8"?> <configuration jdbcurl="jdbc:mysql://localhost:3306/mydb" userid="mydata" password="mypassword"> <table name="userdata"> <column name="firstname" type="ROMAN_NAME" /> <column name="lastname" type="ELVEN_NAME" /> <column name="creditcardnr" type="RANDOMDIGITS"/> </table> <table name="lastnames"> <column name="lastname" type="ELVEN_NAME" /> </table> </configuration>
This simple configuration file will tell Anonimatron the following things:
- How to connect to the mydb database
- The values in username.firstname should be processed with the ROMAN_NAME Anonymizer. Anonymizers are little plugins which are able to generate data with certain properties, sometimes based on the original data. This particular Anonymizer generates Roman Names by picking syllables from a builtin list.
- The values in username.lastname should be replaced by Elven names. This is almost identical to Roman names but with a different Syllable file.
- The userdata.creditcardnr should be replaced by a set of random digits of the same length. In our case, 4 digits will be replaced by 4 differnt digits. Should you really need numbers which are semantically correct credit card numbers, you could write your own Anonymizer plugin. We’ll cover that in a later blogpost.
- The lastnames.lastname column is also an Elven name. Because of the way Anonimatron handles data, strings in this column will be processed exactly the same way as the userdata.lastname column, as we will see below.
Anonymize!
Now that we have configured Anonimatron, it’s time to start it up and tell it to use our configuration file and store synonyms. It will be finished in the blink of an eye, and your output should look somewhat like this:
$ ./anonimatron.sh -config config.xml -synonyms synonyms.xml Anonymization process started Jdbc url : jdbc:mysql://localhost:3306/mydb Database user : mydata To do : 2 tables. Anonymizing table 'lastnames', total progress [100%, ETA 11:36:56 PM] Anonymization process completed. Writing Synonyms to synonyms.xml ...[done].
If Anonimatron complains or does not run, you might want to check out the anonimatron.log file for clues. Most log entries will be pretty self-explanatory. If not, please register an issue and we’ll see what we can do to fix that.
Let’s check the results. First, we can check what synonyms were generated by looking into the synonyms.xml file we told it to create:
$ cat synonyms.xml
<?xml version="1.0" encoding="UTF-8"?> <synonyms> <string type="ELVEN_NAME" from="QnVybnM=" to="RGhvZWxsaWFu"/> <string type="ELVEN_NAME" from="RmxhbmRlcnM=" to="QWhlbGhhbGRldGhlc3M="/> <string type="ELVEN_NAME" from="U2ltcHNvbg==" to="QWhkdWxlbGhhbGVs"/> <string type="ROMAN_NAME" from="SG9tZXI=" to="QmVudWxhdWJlbGl1cw=="/> <string type="ROMAN_NAME" from="TmVk" to="RWN1cw=="/> <string type="ROMAN_NAME" from="TWFyZ2U=" to="QWxudWxhdWN1cw=="/> <string type="ROMAN_NAME" from="Q2hhcmxlcw==" to="QWxudXM="/> <string type="RANDOMDIGITS" from="NTY3OA==" to="ODY5OA=="/> <string type="RANDOMDIGITS" from="MTIzNA==" to="NDM0Mw=="/> <string type="RANDOMDIGITS" from="MzQ1Ng==" to="NjEyNQ=="/> </synonyms>
You’ll note that the “from” and “to” values look a bit garbled. This is because Anonimatron used Base64 encoding to store values of synonyms. This is so that we can store the values bit for bit, without worying about encodings. If you wanted to, you could easily decode these strings by writing a little program.
Even without decoding we can see some interesting things about this file. Remember we had lastnames configured as Elven names? Although we have 4 entries in the username table, we see only 3 Elven names. That is because Homer and Marge have the same lastname. These same synonyms are also used by the lastnames.lastname column. The same goes for the creditcard numbers. As might have noticed that Ned Flanders and Charles Burns used the same credit card number in this system.
Enough staring at XML. Let’s get to what matters most: our database. First, let’s see if the names and numbers have changed:
select * from mydb.userdata;
id | firstname | lastname | creditcardnr |
---|---|---|---|
1 | Benulaubelius | Ahdulelhalel | 4343 |
2 | Alnulaucus | Ahdulelhalel | 8698 |
3 | Ecus | Ahelhaldethess | 6125 |
4 | Alnus | Dhoellian | 6125 |
That looks much better. The first and lastnames are (almost) pronounceable names which probably would look realistic in a screenshot or testcase, yet there is no trace left of the original data that was there. When we check the lastnames table we see that lastnames are being translated consistently with the userdata table:
select * from mydb.lastnames;
id | lastname |
---|---|
1 | Ahdulelhalel |
2 | Ahelhaldethess |
3 | Dhoellian |
This consistent behavior makes sure that queries where the userdata table and the lastnames table are joined based on lastname will still work.
If you want to play some more with Anonimatron, recreate the original tables, add additional (overlapping) data and re-run anonimatron against it with the synonym file you just created. You’ll notice that any “Simpson” lastname will be translated to “Ahdulelhalel” constently on each run. If you don’t want that to happen, simply throw away the synonym file or don’t tell Anonimatron to use it. You can also generate the synonym file first without doing anything to the database by using the -dryrun option, and later do the same run based on the synonyms generated earlier.
Remember: The private data “moved” from the database into the synonyms.xml file we created. So that becomes the new data to watch. Store it in a safe place where nobody can access it.
Have fun experimenting!
Looked promising, but when I download the latest stable, the anonimatron.sh” is not in the top-level folder of the zip file; it has been moved under resources/scripts/. The script throws: Exception in thread “main” java.lang.NoClassDefFoundError: com/rolfje/anonimatron/Anonimatron ; so it appears that it will not just run out of the box right now (running OSX, java 1.6.0_65). (Tried filing a bug, but SourceForge apparently dropped my account.)
Never mind — my bad — downloaded the wrong package!
No problem, thanks for giving Anonimatron a try. Let me know what you think.
Any clues what’s happening here, please? Just downloaded the latest Java from their website and installed it, unzipped the download into C:\Program Files\Anonimatron\anonimatron-1.7 and tried to run the batch file. Win XP SP3.
Thanks….
C:\Program Files\Anonimatron\anonimatron-1.7>anonimatron.bat
Error occurred during initialization of VM
Could not reserve enough space for object heap
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
C:\Program Files\Anonimatron\anonimatron-1.7>
I’ve done a little searching (I’m new to Java) and it’s a memory issue. The most I can get going without an error is :
java -Xmx1600m …
Is this enough to run your program successfully? Thanks
Hi, when not connecting to a database, Anonimatron uses very little memory. The default memory settings should be sufficient. Only when Anonymizing very large databases you need to increase theheap memory (XmX) because Anonimatron keeps the synonyms in memory.
In release 1.7, I added an -Xmx2G option in the batchfile, maybe windows tries to allocate 2 Gigabyte (even when it doesn’t need it) and your machine doesn’t have that much left. You can set it much lower, it only needs that memory for converting large databases.
Can you try to use: “-Xms100m -Xmx2G” to see if you can force it to start at 100m and only use 2G when needed?
Would be much appreciated.
That didn’t work, but “-Xms100m -Xmx1600m” did.
Sorry for delay replying, only doing this at the office, not at home over the weekend 🙂
Thanks for trying. If you run into anything or if you are missing a feature you’re welcome to register those at: http://sourceforge.net/projects/anonimatron/support
OK, I’ve got it to work and I replaced a whole column of names with ROMAN_NAMEs – thanks.
However, maybe I’m being thick but I can’t find a reference list of all the “anonymizers” (i.e. ROMAN_NAME, ELVEN_NAME etc) anywhere online or in any readme.txt files – does such a thing exist anywhere?
Thanks
Listing all the registered Anonymizers is actually very easy: just run “Anonimatron.bat -configexample”. Anonimatron will give you an example configuration, and the configuration will contain all Anonimizers which are currently registered.
In the configuration, you will find a “CUSTOM_TYPES_TABLE” configuration which contains all “custom”, or specific Anonimizers. The “DEFAULT_TYPES_TABLE” will show you the default types Anonimatron uses when you don’t specify what Anonimizer to use. The “DISCRIMINATOR_TABLE” shows you how to make Anonimatron switch between Anonimizers based on the contents of a column.
I realize that you were looking for this information, should I add a clearification in the configexample, or should that be in the README, or both?
Both, I’d say. Would it be easier to put as much as possible in the README and have the the -configexample simply spew out the contents of that txt file, rather than duplicate the effort?
The main README could also usefully mention that there is another one in the ‘anonymizers’ folder.
Many Thanks, it’s been interesting. I might roll my own solution using the MS AdventureWorks data, but it’s handy to know this is available. Good Luck with your endeavors.
Working great on a DB I’m trying… except for a date field in the format YYYY-MM-DD. I can’t seem to get this converted using any of the built-in types, including DATE, always erroring with “java.sql.Date cannot be cast to java.lang.String” or similar.
Any suggestions for this java and mysql newbie?
Thanks for a super handy tool!
Hi Phillip, great to hear you like Anonimatron. I’d like to help you out with the Date problem. This comment thread is somewhat limited, so I’ve created a forum on on the Anonimatron Sourceforge page as a tryout. Please feel free to create a topic on https://sourceforge.net/p/anonimatron/supportforum/userquestions/
[root@localhost scripts]# sh anonimatron.sh
Exception in thread “main” java.lang.NoClassDefFoundError: com/rolfje/anonimatron/Anonimatron
please help me out in resolving this issue.
Hello Akshay, unzipping the downloaded zip file and starting it like you did should work. To help you further, I need a bit more information. Please start a question in the Anonimatron forum at https://sourceforge.net/p/anonimatron/supportforum/ and explain exactly what you did so I can help you.
Works perfect and like a charme. However: I am wondering how to easily integrate additional anonymizers i.e. for Streetname & number information or just parts of any field (like in Excel where you could split fields or extract the first or last string…).
The idea is to keep the entire content as “original” and intact as possible for testing purposes….
However: perfect tool! Thanky for yr effort
Hi, Glad to see you like it. I’ve moved the sources to github here: https://github.com/realrolfje/anonimatron So you may want to check that out. I need to update the sourceforge page to reflect the new location.
Depending on your exact needs adding an anonymizer should be relatively simple, have a look at the email anonymizer here for example: https://github.com/realrolfje/anonimatron/blob/master/src/main/java/com/rolfje/anonimatron/anonymizer/EmailAddressAnonymizer.java
You can add your own anonymizers to the classpath and then reference them in your configuration, anonimatron should pick it up automatically.
If you need more help or information, feel free to add a feature request here: https://github.com/realrolfje/anonimatron/issues