DirtDB: A plugin-based network daemon to dig up Internet dirt
I’m currently working on DirtDB, an important component of the Cloud Mirror for installation at the 2010 Sundance Film Festival. DirtDB is a network daemon that answers requests via XML-RPC to “dig up dirt” for people on the Internet.
Its been up and working for two hours, and in that time I’ve developed a twitter plugin and an IMDb module. Next up is the National Sex Offender Registry and Flickr, with plenty more to come in the weeks ahead.
It expects that a small pool of available information about a victim is available in a MongoDB database. It then invokes a series of plugin modules, each of which use that corpus to mine the Internet (or local databases) for interesting information about the person. It augments the person’s MongoDB document with interesting finds. Because DirtDB is designed with the Cloud Mirror in mind, the plugins are biased to produce small snippets of text (for presentation in the victim’s thought bubble) and images (for augmentation of the victim’s badge).
But DirtDB is a general framework for human data-mining operations. It follows a blackboard pattern I’ve been fixated upon recently.
Developing DirtDB has exposed an interesting problem: plenty of people share a name. How then to distinguish you from all the other people who call themselves you? The solution lies in a technique that I have yet to relate to a formal model, but for which one certainly exists.
Assume that DirtDB knows some additional information about you, beyond merely your name. Say we know your email address. Even armed with this information, many Internet databases and APIs allow you to search by name, but not by email. Realizing, then that a name-only search is likely to bring up you and your dopplegangers, you might believe that knowing your email address is not useful in these cases. But in fact, using a name-only search, DirtDB can mine the internet for representations of many possible identities. Each name-only search yields a new proto-person with attributes that, though juicy, don’t conclusively belong to the victim.
Once a sufficient number of these proto-identities are built, DirtDB can try to collapse them by looking for correlations between their attributes. Did two name-only searches yield a similar physical address? Then those two identities are the same identity! Did one of those results include an email address that matches the one already on file? Then we have a chain of implication.
Now that I think about it, this sounds like “ABox reasoning” which I learned in semantic web research.
tobin coziahr
December 26, 2009
Is there any public interface so we can play with it?