Pythonism

code and the oracular

mining project gutenberg and using graphviz to display word data

leave a comment »

I downloaded the Project Gutenberg DVD from here: http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project

I mounted the ISO and copied the files across to a folder, preserving structure.

I used this code to unpack the zip archives, ~32,000 in all into a flat folder to make an easily usable corpus.


#!/usr/bin/python
import zipfile
import os
pat='Gutenberg'
for x,y,z in os.walk(pat):
    for fil in z:
        if fil[-3:]=='zip':
            fh = open(x+os.sep+fil,'rb')
            z = zipfile.ZipFile(fh)
            for name in z.namelist():
                z.extract(name,'gutty')
            fh.close()

then I separated the text files into 6 subfolders using bash. This was necessary to speed up runtime with parallelisation later (I have a 6 core machine)

then I was free to conduct some mining of the thousands of books.

I used this code to get the jobs done in parallel


import sys
count=0
import glob, pp, time
start_time = time.time()
job_server = pp.Server() 
paths = ['gutty/1/*.txt','gutty/2/*.txt','gutty/3/*.txt','gutty/4/*.txt','gutty/5/*.txt','gutty/6/*.txt',]
def fin(x, stton):
    stri=''
    for tex in glob.glob(x):
        aa=open(tex)
        a=aa.read().lower()
        z = a.split(' ')
        for word_index in range(len(z)-1):
            if z[word_index] == stton and z[word_index+1]=='is':
                stri+=' '.join(z[word_index-3:word_index+8])+'\n'
    return stri
search_string = sys.argv[1]
jobs=[job_server.submit(fin,(pat,search_string),(),("glob",)) for pat in paths]
out=''
for job in jobs:
    out+=job()
fin=open(search_string+'1','w')
fin.write(out)
fin.close()
print "Time elapsed: ", time.time() - start_time, "s"

accessing the script from command line saying

python countword.py love

python countword.py hate

python countword.py joy

python countword.py pain

this created some text files with lots of sentences containing the words “love is *”, “pain is *” etc. I then ran:


a=open('love')
z=a.read().split(' ')
for word_index in range(len(z)-1):
    if z[word_index] == 'love' and z[word_index+1]=='is':
        print ' '.join(z[word_index-3:word_index+8])

as:

python count_word.py > word.txt

and then


import collections
a=open('love_is')
b=a.readlines()
q=collections.Counter()
for x in b:
    q[x]+=1
e=[]
for key in q:
    e.append([q[key],key])
e.sort()
for phrase in e[-1:-200:-1]:
    print phrase[1]

on each file to get the immediate textual neighbourhood of the phrases “love is *”, “hate is *” etc
this generated some of the most commonly found instances of the search strings in the entire Gutenberg corpus. These resulting text files had to be edited because of the approximate and kludgey nature of the process. With crude NLP methods like these you get cruft and error. It took me about 10 minutes to edit so that I had about 100 phrases each of which were commonly found examples of “love is * “, “joy is *” etc.

then I used this script to make DOT graphs out of the data, with


import sys
name = sys.argv[1]
a=open(name+'.dot','w')
a.write("""digraph G {
    node [  color = "#597566",
            margin="0.2,0.13",
            style="filled",
            fillcolor="#CFDBA9",
            shape = Mrecord,
            fontname = "Helvetica-Outline",
            fontcolor ="#634320",
            fontsize=22];
    graph [    overlap=scale,
            margin=0.5,
            pad=0.5,
            splines=true,
            bgcolor="#E8CCCC",
            fontname = "Helvetica-Oblique",
            label= "\n\n\n*"""+name+""" is* according\n to project Gutenberg",
            fontsize = 50,
            nodesep=0.15,
            fontcolor="#352C45",
            ranksep=0.75];
    
    edge [    penwidth=4,
            weight=2.5];
""")
b=open(name)
c=b.readlines()
d=len(c)
a.write('    '+name+'[label="'+name.upper()+'", color = "#D585AD", fillcolor="#E0CDF6", shape=Mrecord, fontsize=54];')
for line in range(len(c)):
    a.write( '    node'+str(line)+'[label="'+c[line][:-1]+'"];\n')
for node in range(d):
    a.write('    '+name+' -> node'+str(node)+';\n')
a.write('}')
a.close()

then I ran


sfdp -Tjpg hate.dot > hate.jpg

using the graphviz sfdp algorithm.

finally I ran


convert pain.jpg -resize 35% pain2.jpg

on the resulting jpegs to trim them down a bit and make the filesize suitable for the web.

These are the images I obtained through this process

love

joy

hate

pain

Advertisements

Written by Luke Dunn

February 28, 2014 at 9:05 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: