Pythonism

code and the oracular

Archive for the ‘Python’ Category

Simple Python Project: Markov Text

with 2 comments

consider this sentence

“the cat sat on the mat”

we can see the following about it

the word “the” is followed by “cat” and “mat”
the word “cat” is followed by “sat”
the word “sat” is followed by “on”
the word “on” is followed by “the”

so from this sentence we can construct a dictionary like this

catsat = {"the":["cat","mat"],
          "cat":["sat",],
          "sat":["on",],
          "on":["the",],
          "mat":[]}

Read the rest of this entry »

Written by Luke Dunn

December 31, 2015 at 9:09 am

mining project gutenberg and using graphviz to display word data

leave a comment »

I downloaded the Project Gutenberg DVD from here: http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project

I mounted the ISO and copied the files across to a folder, preserving structure.

I used this code to unpack the zip archives, ~32,000 in all into a flat folder to make an easily usable corpus.
Read the rest of this entry »

Written by Luke Dunn

February 28, 2014 at 9:05 pm

How to make a Talking Chatbot

with 5 comments

Written by Luke Dunn

February 5, 2014 at 10:31 am

Making Ebooks with Python

leave a comment »

The average ebook reader like a kindle, which is the one I have, accepts a variety of file formats. Of course the ultimate vanilla form would be a plain text file like my_file.txt. The ebook interface on the kindle renders a file like this perfectly well and readably, but you can’t navigate around the document so well without having page numbers, because the text file is just one stream broken into pages by the device upon load.

There is also azw, epub and mobi which are the main three formats designed specifically for ebook reader devices. If you want to read these on a tablet, laptop or larger machine you’ll need an app, which is a pain. Thus I conclude that the ultimate all-round format to stay with is simply pdf.

Read the rest of this entry »

Written by Luke Dunn

November 8, 2013 at 11:07 am

How to make a chatbot in Python

with 4 comments

First stop for building a real chatbot in Python would be to use PyAIML, which can be downloaded here

http://sourceforge.net/projects/pyaiml/files/

AIML (Artificial Intelligence Markup Language) is an XML based format for encoding a chatbots “brain”. It was Developed by Richard Wallace and the resulting bot, ALICE was the best at the time.

you can also download the standard ALICE brain here

http://sourceforge.net/projects/pyaiml/files/Other%20Files/Standard%20AIML%20set/

Read the rest of this entry »

Written by Luke Dunn

October 7, 2013 at 9:35 am

Pocketsphinx Voice Recognition with Python

with 5 comments

I downloaded pocketsphinx and the corresponding python module with:

sudo apt-get install python-pocketsphinx pocketsphinx-hmm-wsj1 pocketsphinx-lm-wsj

and then downloaded Pyaudio from http://people.csail.mit.edu/hubert/pyaudio/#downloads

Pocketsphinx needs a 16 bit mono wav file at a bitrate of 16k, as you can see I set this in the code.

This code lets you record a bit of speech and then it reads it back to you, just to test the idea. It could be the beginning of a speech recognition system of great usefulness but for me the fidelity wasn’t that good. If you want to train Sphinx to your voice this means creating your own acoustic model which takes some time and is detailed here: http://cmusphinx.sourceforge.net/wiki/tutorialam. I may do this later. The real dream is to have a talking chatbot.

import sys,os
import pyaudio
import wave

hmdir = "/usr/share/pocketsphinx/model/hmm/wsj1"
lmd   = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.3e-7.vp.tg.lm.DMP"
dictd = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.dic"

def decodeSpeech(hmmd,lmdir,dictp,wavfile):

    import pocketsphinx as ps
    import sphinxbase

    speechRec = ps.Decoder(hmm = hmmd, lm = lmdir, dict = dictp)
    wavFile = file(wavfile,'rb')
    wavFile.seek(44)
    speechRec.decode_raw(wavFile)
    result = speechRec.get_hyp()

    return result[0]

CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
RECORD_SECONDS = 10

for x in range(10):
    fn = "o"+str(x)+".wav"
    p = pyaudio.PyAudio()
    stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)
    print("* recording")
    frames = []
    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data)
    print("* done recording")
    stream.stop_stream()
    stream.close()
    p.terminate()
    wf = wave.open(fn, 'wb')
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(p.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b''.join(frames))
    wf.close()
    wavfile = fn
    recognised = decodeSpeech(hmdir,lmd,dictd,wavfile)
    print recognised
    cm = 'espeak "'+recognised+'"'
    os.system(cm)

for a screencast example of using pshinx for a talking chatbot see here

Written by Luke Dunn

June 6, 2013 at 9:12 am

Google Sets, keywords & taxonomies

leave a comment »

I am slightly fascinated by Google Sets because of its connectedness to ontologies and knowledge extraction / datamining etc. I came up with a script that can be used to auto query Sets to build families of keywords that all share a very approximate subject domain in common. I used this for some personal research on a way to use page meta data keywords to assign pages to a given subject domain. For each subject “Physics…Chemistry…Biology…” etc I had the subject name as dictionary key and corresponding dictionary value as a list of keywords that indicated a likelihood of meta keywords containing that word belonging to text about that subject.

In the small print google says no use for commercial purposes, but they lock you out if you automate more than a handful of queries anyway, so I don’t think I was being that evil.

I wondered if the meta keywords could provide a sufficiently accurate way to classify documents so I fed my system some dummies and it came right about 80% of the time. In my code I started with a seed list of 10 or so keywords, which are then ‘grown’ in sets to populate the list with more similar domain words. I believe the algorithm behind sets has been used in some ontology work elsewhere including NELL and Google Squared.

I visualised my document collection as a bag of rice grains scattered onto the floor. My subject categories were like playing cards thrown onto the grains. In a taxonomic system you want maximum coverage by the cards and minimum overlap in the terminal categories of the bottom level. There are two kinds of overlap, true overlap where parts of each card are not covered, and enclosure where a bigger card completely covers smaller ones. For a taxonomy that covers every instance you want the terminals to have no overlap. Parent and ancestor categories can cover by inclusion but overlap among the terminals causes ambiguity which is bad for the system. If a user is looking for “biophysics” papers and some of those have been placed, unbeknownst to her, under “physical biology” then the user has been failed by the system.

Read the rest of this entry »

Written by Luke Dunn

November 24, 2010 at 6:00 pm