code and the oracular

Archive for the ‘Natural Language Processing’ Category

Simple Python Project: Markov Text

with 2 comments

consider this sentence

“the cat sat on the mat”

we can see the following about it

the word “the” is followed by “cat” and “mat”
the word “cat” is followed by “sat”
the word “sat” is followed by “on”
the word “on” is followed by “the”

so from this sentence we can construct a dictionary like this

catsat = {"the":["cat","mat"],

Read the rest of this entry »


Written by Luke Dunn

December 31, 2015 at 9:09 am

Five more unusual adjectives

leave a comment »


seem merely the customary platitudinous british holding up of horrified hands at american slavery

joan rattled on with the platitudinous originality of youth

Read the rest of this entry »

Written by Luke Dunn

March 5, 2014 at 9:00 am

Five unusual adjectives from Gutenberg

leave a comment »


1.swollen and distended or congested
2.(of language or style) tediously pompous or bombastic

he was never so happy as when he was wrapping up some commonplace thought in a garment of sonorous but turgid rhetoric

i not only committed to memory the more turgid poems of the late lord byron

Read the rest of this entry »

Written by Luke Dunn

March 5, 2014 at 8:21 am

mining project gutenberg and using graphviz to display word data

leave a comment »

I downloaded the Project Gutenberg DVD from here:

I mounted the ISO and copied the files across to a folder, preserving structure.

I used this code to unpack the zip archives, ~32,000 in all into a flat folder to make an easily usable corpus.
Read the rest of this entry »

Written by Luke Dunn

February 28, 2014 at 9:05 pm

How to make a chatbot in Python

with 4 comments

First stop for building a real chatbot in Python would be to use PyAIML, which can be downloaded here

AIML (Artificial Intelligence Markup Language) is an XML based format for encoding a chatbots “brain”. It was Developed by Richard Wallace and the resulting bot, ALICE was the best at the time.

you can also download the standard ALICE brain here

Read the rest of this entry »

Written by Luke Dunn

October 7, 2013 at 9:35 am

Pocketsphinx Voice Recognition with Python

with 9 comments

I downloaded pocketsphinx and the corresponding python module with:

sudo apt-get install python-pocketsphinx pocketsphinx-hmm-wsj1 pocketsphinx-lm-wsj

and then downloaded Pyaudio from

Pocketsphinx needs a 16 bit mono wav file at a bitrate of 16k, as you can see I set this in the code.

This code lets you record a bit of speech and then it reads it back to you, just to test the idea. It could be the beginning of a speech recognition system of great usefulness but for me the fidelity wasn’t that good. If you want to train Sphinx to your voice this means creating your own acoustic model which takes some time and is detailed here: I may do this later. The real dream is to have a talking chatbot.

import sys,os
import pyaudio
import wave

hmdir = "/usr/share/pocketsphinx/model/hmm/wsj1"
lmd   = "/usr/share/pocketsphinx/model/lm/wsj/"
dictd = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.dic"

def decodeSpeech(hmmd,lmdir,dictp,wavfile):

    import pocketsphinx as ps
    import sphinxbase

    speechRec = ps.Decoder(hmm = hmmd, lm = lmdir, dict = dictp)
    wavFile = file(wavfile,'rb')
    result = speechRec.get_hyp()

    return result[0]

CHUNK = 1024
FORMAT = pyaudio.paInt16
RATE = 16000

for x in range(10):
    fn = "o"+str(x)+".wav"
    p = pyaudio.PyAudio()
    stream =, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)
    print("* recording")
    frames = []
    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data =
    print("* done recording")
    wf =, 'wb')
    wavfile = fn
    recognised = decodeSpeech(hmdir,lmd,dictd,wavfile)
    print recognised
    cm = 'espeak "'+recognised+'"'

for a screencast example of using pshinx for a talking chatbot see here

Written by Luke Dunn

June 6, 2013 at 9:12 am

The second rap: questionnaire typology

with one comment

All human knowledge is about questions.

  • questions = problems in a high school calculus class
  • questions = problems in an iq test

This new system could revolutionise iq testing, because you could ask the subject about things they were truly interested in, rather than expecting everyone to be interested in trivial logic puzzles or puzzles involving shapes etc. This would be more fair, and moreover less prone to cultural bias where students fail to answer optimally simpy because the question didn’t interest them, rather than because they assessed the question and found it too hard !

A type 3/4 questionnaire in an educational setting would pinpoint the optimal learning modality of a student so their own learning could be accelerated by issuing questions and problems that stretched them in just the right ways. Not to mention the benefit that would result of making university exams into type 3/4 questionnaires, of course with a suitably designed rating system to maximise achievement and fairness of assessment. Read the rest of this entry »

Written by Luke Dunn

June 29, 2012 at 4:12 pm