Pythonism

code and the oracular

Knowledge Discovery

with one comment

I wanted my chatbot to respond to input statements with random facts that are somehow relevant to those statements. Let the bot search the input for a term it ‘knows’ about and then give back a fact about the term. My downloaded schools’ wikipedia which I torrented might be useful for this. I converted it into plain text with a script and set about getting some code to strip out sentences or phrases, then to index those against a selection of sub phrases which loosely correspond to the subject of each sentence. a dictionary would be fine for this, and since the whole wikipedia was about 150MB of text it should be doable.

My first approach was to allow the bot to be able to converse about famous people. The naive way to extract information like this was to look for two or three consecutive terms which are capitalised and assume that each of these is a proper name.

As I soon found, this code returned a lot of data, some of which were names and some not. Then I considered my options and decided it was fine to stick with any capitalised entity since countries and cities, elements and so on were equally suited to have facts about them. Also every first word of a sentence will be capitalised and so I taught my program to ignore those. A small rate of error such as when the input text has incorrect capitalisation, or when a sentence begins with a proper name, was hard to avoid. This seems to me to be nigh universal in NLP since the input data is so unstructured that, assuming your program is not an AI, no algorithm can cover every bizarre case of language. 98% is pretty useful though.

Once I had the names I built a dictionary of ‘facts’ for each name and then my bot would have something it could talk about. Yeeha ! So far so not AI.

Then a new idea occurred to me which was far more exciting. I got the germ of this idea from a journalistic piece I read about NELL – the Neverending Language Learner. It goes as follows:

we know from previous posts that predicate logic, with its language of triples defining relationships, is a very powerful way to represent knowledge. I had wondered how to mine unstructured text, and extract triples from it before but now I saw a new approach. Start the system off with some existing triple of knowledge as a ‘seed’, then see if you can get it to look for a sentence or phrase which contains some or all of the terms in the triple. If it finds some text with this triple embedded in it, then since it knows the triple is true, it can infer that the statement it reads *might* be an indication of the fact.

I decided to start with the following fact: Isaac Newton was a Physicist. This might be represented as a triple along the following lines.

"Isaac Newton <has_profession> physicist"

I thought I’d look for sentences with the strings “Isaac Newton” and “physicist” then design some code that could save the structure of these matching sentences and use it to infer further facts.

If it finds a string as follows, say: “Isaac Newton, the leading physicist of his day” then it will save that string as a relevant indicator to the truth of the triple.

Here’s the big next step, what if the system could generalise ? Here’s how this might work: if it finds a sentence like “Charles Darwin, the leading biologist of his day” there is a chance that the inference implied by the triple

"Charles Darwin <has_profession> biologist"

could be deduced. After all the human written text has an identical structure, so evidence is there. So then the system knows two facts… and still keeps looking out for these sentences which may be ‘unpacked’ triples. Later, it might then see : “The great British biologist Charles Darwin” and save that structure, whereupon if it sees text such as :”the great British computer scientist Alan Turing” it could infer the triple (again by generalising):

"Alan Turing  <has_profession> computer scientist "

Suddenly in a cascade of knowledge discovery your NLP system is really learning. Growing data from its seeded facts and ever expanding. Then you can whack the knowledgebase into a chatbot and you have certainly started down the road of Question Answering Systems, albeit with the kinds of faltering step that Newton, Darwin or Turing themselves may have begun !

This is what NELL does, and with a bit of hand tweaking it seems to be working. One of the memorable errors they came up with was that an ‘internet cookie’ was a ‘baked item’ or similar. This kind of cruft or bad data will always creep in in my view, but we humans do it too, so human-like AI might unavoidably inherit from us. Goodenough is good enough…

Anyway so I plugged the code into wikipedia, and examined the sentences found with Newton and his profession present, which were as follows:

(…)Prominent mathematical physicists: Isaac Newton(…)

(…)The great seventeenth century English  physicist and mathematician Isaac Newton [1642-1727] developed a wealth of new mathematics (for example, calculus and several  numerical methods (most notably Newton’s method)) to solve problems in physics(…)

(…)Sir Isaac Newton, founder of classical mechanics and famous for the laws of motion and  law of gravity(…)

(…)The scientific revolution is considered to have culminated with the publication of the Philosophiae Naturalis Principia Mathematica in  1687 by the mathematician, physicist, alchemist and inventor Sir Isaac Newton ( 1643- 1727)(…)

(…)Related subjects: Astronomers and physicists; Mathematicians; Mathematics; Physics: Sir Isaac Newton (…)

(…)Sir Isaac Newton  FRS  4 January  1643 &ndash;  31 March  1727 [ OS:  25 December  1642 &ndash;  20 March  1727]) was an English  physicist, mathematician, astronomer,  natural philosopher, alchemist and  theologian(…)

I am just considering this text and working out the best way to create the pattern recognition code to remember a given sentence structure and process it appropriately. I need a way to ignore words like adjectives and adverbs that intervene between the chief symbols of the string.

When this is done my mind-child will be into its next foetal stage !! hahahahaaaaa.. demonic laughter fills the room.

PS the torrent for wikipedia schools is easily googled. some of the code for getting proper names is below. I won’t give out any more of my code on this because it is a far more ethical teaching technique to explain the idea and let people develop their own code – no more homework solutions !

# makes guesses about what constitutes a proper name
import pickle
a = open('phrases_list')
sentences = pickle.load(a)
names = []
for sentence in sentences:
    werds = sentence.split(' ')
    sentence_length = len(werds)
    for ind in range(sentence_length-2):
        tup3 = False
        tup2 = False
        if werds[ind].istitle() and werds[ind+1].istitle():
            tup2 = True
        if werds[ind].istitle() and werds[ind+1].istitle() and werds[ind+2].istitle():
            tup3 = True
        if tup3:
            names.append(werds[ind]+' '+werds[ind+1]+' '+werds[ind+2])
        if tup2 and not tup3:
            names.append(werds[ind]+' '+werds[ind+1])
b = open('proper_names','w')
names2 = list(set(names))
for x in names2:
    b.write(x+'\n')
a.close()
b.close()

Advertisements

Written by Luke Dunn

November 24, 2010 at 4:08 pm

One Response

Subscribe to comments with RSS.

  1. Hi Luke.

    Robert Crowe

    November 18, 2011 at 7:20 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: