Visions of Aestia

27 Jun 2005

RDF vs control

Filed under: PlanetRDF — JBowtie @ 1:01 pm

Lately I’ve been struggling with a bit of a dilemma. I’m not sure there is an answer or even a consensus that it’s a problem, but I’ve been thinking about it nonetheless.

If we’re serious about bringing about the Semantic Web, there’s a couple of problems that we will have to contend with, some obvious, some less so.

The first issue we wil face is that liars, scam artists, advertisers, and zealots of various persuasions are going to start contaminating our machine-readable data. In other words, we need to find a reasonable, easy-to-implement solution to the trust problem before we are drowning in useless sea of data.

Currently, people who filter their RDF (if they do so at all) use blacklists, whitelists, or spam-processing code. But as the amount of machine-readable data reaches epic proportions, all of these mechanisms start to break down. We need to well and truly distribute the work and build the processing in at the parser level, or we will never get a handle on it. I mean, what good are software agents going to be if you ask them to restock the wine cabinet and they order herbal supplements?

Even assuming we can eliminate spam, there are other, more subtle problems that creep in. People will lie on their FOAF files (or even serve them up selectively) to attract potential dates or deflect attention. RDF feeds will end up carrying propoganda or advertisements. Wikipedi-type wars will rage (where two sides make contradictory assertions). Triplestores will fill up with inconsistent, misattributed data.

There’s also the issue of sensitive data. Personal information may be serialized into the wrong files. If your bot wrongly sucks up my tax ID number, how do I ask it to forget it or not disclose it? And if I can make that request, what keeps me from asking it forget or prevent disclosure of public information, like a Senator’s voting record?

Secrecy and privacy are already under serious threat due to data aggregation. What happens when an autonomous software agent discloses information under court seal? What happens when a computer intelligence is able to infer the identity of a protected witness or victim?

As long as “real” AI is still 20 years off, we can (and have) deferred thinking about these issues. But once we have powerful and reasonably autonomous reasoners harvesting triples and drawing conclusions, the data becomes a black box. We no longer keep track of where the data comes from, how connections are made, or get involved in weighting or filtering information. Instead, we start relying on the computer to do it for us - in fact, we need the machine to do the filtering because otherwise we end up completely overwhelmed by the mountains of data.

I can’t manually process the amount of spam or spam comments I get anymore. I get so much e-mail, I don’t even have time to manually sort it anymore; if I did, I’d never read it - as it is I only cope by scanning the subject lines in the pre-sorted folders. I need a general purpose AI available to me in the next 10 years, because I am barely keeping up with the things I care about as it is. I need people to start publishing machine-readable metadata, or they will become invisible to me. I need planet aggregators and categorized posts. Like most democratic citizens, I need information about various candidates summarized and their positions analyzed, because I don’t have the leisure time to sift through the raw material and cannot rely on the media to do so reliably anymore.

But that circles back to my original issues. How do I know which sources of information to trust? How do I track trustworthiness over time? How do I verify information? How do I detect and weed out mistakes and falsehoods? How do I know when to throw out my assumptions? How do I find bugs in a reasoning engine, and what do I do when multiple reasoners disagree?

Look - these are all the same issues we are struggling with in regards to people, and it’s silly to think we can solve it definitively anytime soon. Just - before we start relying too heavily on our software, we should build in what safeguards we can. I know one day the computer will know more than me, be able to reason more rigourously than I can, write and maintain programs that I couldn’t touch, and be off having refined conversations with other AIs (probably in some n3-derivative language). When that day comes, I want it to be able to exercise critical thinking and have some thought for humanity’s welfare.

2 Responses to “RDF vs control”

  1. Nikolas 'Atrus' Coukouma Says:

    I believe the first step is to track where you get your information. Next, sign documents using the usual public key cryptography (example). The only problem with this web-of-trust approach is that it’s very easy to fabricate identities. The happy news is that although you only know a relatively small number of people yourself, you’re usually connected to a very large number in a short number of hops. For example, on average you can reach any point in the LiveJournal friends network in less than 6 hops (reference)

    Error correction in distributed environments is obviously challenging and there hasn’t been much work done. I haven’t seen anything out of the Semantic Web department, but people interested in P2P are trying to tackle the problem. They’re also trying to figure out how to distribute the information, so they tend to tackle it at the protocol level. Still, the algorithms developed should be applicable to the HTTP-bound (so far) Semantic Web.

  2. Frank Smith Says:

    The trust issue is a very significant one. Because of this I think the first semantic web killer application will involve smaller groups of people in controlled environments — e.g. corporate intranets — where you can realistically make an assumption that everyone is trusted, and not have to solve the trust problem right away.

    IMHO, I believe the first killer application for the semantic web will be knowledge sharing within organizations, a la del.icio.us, with all of the information nestled snugly within the walls of the organization. Once someone comes up with an easy to install and use tool to help accomplish this, I think the semantic web will start to gather momentum. I’m working on such an application myself.

    -Frank

Leave a Reply

Powered by WordPress