Pythonism

code and the oracular

A review of Question Answering Systems and more musings on NLP

with 3 comments

True Knowledge is a company that offers to answer your questions. It is interesting to observe that they scrape data from a number of sources on the web, like askville and answers.com. It makes sense if the question has already been answered to use a look-up method rather than the more compute intensive route of letting the system try and ‘work it out’ itself. It also solicits answers to questions from users, which seems to be the other part of their three pronged approach: mine data from question sites, invite users to answer themselves and finally if those both fail then activate the real engine to parse and answer the question from curated data on site.

Are there any systems like this in the open source world ? Well Ephyra is a Java based one. I would love to be able to scrape data from all across the web, and also Usenet which would be a fine mine of helpful questions and answers. From the developers viewpoint this is all a lot easier than actually developing an NLP capable program that parses and comes nearer to actually ‘understanding’ a question it is posed, let alone ‘working out’ itself how to answer it without cheating by looking the answer up in a static database.

One question that occurs to me is to inquire about the size of the set of all possible human questions. Or more broadly the set of all possible search terms. It can’t be infinite (almost nothing concrete is…) but it must be pretty large. My question here is: is this set of all questions so large that a look up approach is impossible ? If the database of all Q&A records is too huge it becomes easier for a program to work it out afresh each time. This reminds me of the old adage that “remembering is rediscovery”. If I want to remember an equation sometimes the easiest way to get the knowledge to stick is just to work through its derivation a few times. Photographically memorising the equation slavishly and without context is harder in the end. Certainly its a greater coding challenge to make a system that calculates the answer too.

Start is run from MIT, and is another one you can play with.

Of course there is also Wolfram Alpha. This refers to its large repository of ‘curated data’:

As of now, Wolfram|Alpha contains 10+ trillion pieces of data, 50,000+ types of algorithms and models, and linguistic capabilities for 1000+ domains. Built with Mathematica—which is itself the result of more than 20 years of development at Wolfram Research—Wolfram|Alpha’s core code base now exceeds 5 million lines of symbolic Mathematica code. Running on supercomputer-class compute clusters, Wolfram|Alpha makes extensive use of the latest generation of web and parallel computing technologies, including webMathematica and gridMathematica.

But it is more than just a database surely, how much more ? I still find it frustrating that my query “body fluid compared with brine” doesn’t give me the answer to my suggestion that mixing sea water with mineral water should be right to rehdrate after my long summer runs. It is said by Wolfram to be improving all the time so I’ll come back to that one in a few months!

My Google snippet querying code is part of a drive towards getting a bot to do its own mining of data that will help it respond better. Google used to allow auto querying through its search API but that seems to have been stopped now, except for the ‘toy’ service from the Ajax API I used to get the snippets in my last post. I also wanted to mine wikipedia but when I wrote some python to do that the great Wiki in the sky refused my requests for pages. Code to mine Usenet would also be cool as I suggested above. With the snippet engine I need a way to judge which is the best snippet out of the handful the code obtains. This is the Hard Problem again in my view. I still think the “chatbot as front end for google” idea is a good one though. Clever IRC bots can do all sorts of things like this (think the main skill is to give dictionary definitions of a word), so I am not travelling uncharted territory in this, still its fun though. Using the snippets has some advantages though like being able to be automatically ‘trendy’ since goog rankings change much more dynamically and adaptively in time that dictionary.com definitions. Trendy, sure, and also popularist since goog’s snippets reflect a consensus of public web opinion more than the view of an academic elite. This is better for a bot imho. Also I plan that “snippetBot” won’t have its own curated database. It’ll be small and light and sit on top of the web, the whole web will *be* its database.

But just chatting and interacting ‘socially’ seems a bit vacuous to me. I want to focus on the question answering approach. I have posed a question on comp.ai.nat-lang as to whether there is a corpus of questions and answers available from the linguistics community. This would be useful as a training resource for the bot. IF such a thing doesn’t exist then I’ll have to mine the questions myself like the solutions mentioned above do.

Regarding document summarisation (which could be a part of a good question answering system) it would be nice to be able to mine wikipedia because the first paragraph of any of their pages is usually a succinct and well constructed summary of the whole document. Google translate uses a statistical method trained on pages in english and mirror pages in the foreign language. If you could create a system that statisitically inferred how to summarise by being fed whole wikipedia articles and their summaries that would be cool. This would also be good to do with huge collections of scientific papers and their abstracts, which are summaries for free.

If I got rich I’d subscribe to all the repositories of scientific papers on the web, acquire every single paper and fileshare them all, Robin Hood of Open Source here I come…

Again it seems that real understanding slips through the fingers of the student of AI (and particularly NLP). Each time a problem domain is examined and a static algorithm is glimpsed that reduces a given human task to a dead lifeless set of instructions and contingencies. The broad and incredibly effective contingency handling ability of real human intelligence cannot be encapsulated in any code I can dream of (yet).

Advertisements

Written by Luke Dunn

July 11, 2010 at 1:03 pm

3 Responses

Subscribe to comments with RSS.

  1. is QA system functioning as a chatbot?

    cock

    July 22, 2010 at 1:05 pm

  2. Nice article!!!
    Helped me understand the basics of QA systems.

    Boubacar

    September 29, 2011 at 9:00 pm

  3. Helpful info. Lucky me I discovered your web site accidentally, and I am shocked
    why this coincidence didn’t happened earlier! I bookmarked it.

    nlp

    January 10, 2014 at 7:30 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: