code and the oracular

Archive for the ‘parallelisation’ Category

mining project gutenberg and using graphviz to display word data

leave a comment »

I downloaded the Project Gutenberg DVD from here:

I mounted the ISO and copied the files across to a folder, preserving structure.

I used this code to unpack the zip archives, ~32,000 in all into a flat folder to make an easily usable corpus.
Read the rest of this entry »


Written by Luke Dunn

February 28, 2014 at 9:05 pm