Archive for the ‘parallelisation’ Category
I downloaded the Project Gutenberg DVD from here: http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project
I mounted the ISO and copied the files across to a folder, preserving structure.
I used this code to unpack the zip archives, ~32,000 in all into a flat folder to make an easily usable corpus.
Read the rest of this entry »