Text to Speech – how to make free audiobooks

7 12 2011

I have recently tried to break my habit of listening to the BBC World Service at bedtime. This is complex and for a number of reasons. One: it was an addiction which rendered me less flexible in life because when no radio is available you get withdrawals. Two: I am currently concerned that most broadcast media is just propaganda. Three: the particular tone of voice and inflection that newsreaders use became annoying. Four: many of them, including newsreaders, are cokeheads and hypocrites who care as much about the problems of humanity as their idol Attila the Hun, who they are usually well to the right of on the real number line. I learned this after spending time associating with some London based journalists through a friend.

Wanting something to listen to I decided to use audiobooks, mostly torrented because I am a cheapskate and avowed supporter of the Pirate Party. The audio version of “Dune” was superbly done. (google to find) John Shirley’s “Demons” was good too. After a while though I became frustrated that there aren’t that many audiobooks that have actually been produced. Most of these are a bit too popularist for my taste so I began to feel audio-impoverished.

Then I realised that there must be ways of using text to speech for a solution. The small Linux utility I found for this purpose was “espeak”. It should come pre installed with many distros, certainly with Ubuntu.

The program takes a plain text file as input and renders it into a WAV with a few options needing to be specified. I used:

espeak -s 120 --split="60" -f some_book.txt -w some_book.wav

the -s 120 switch specifies the speed in words per minute. -split produces a series of output files of length 60 minutes. -f shows the input text and -w the name of the file(s) to output.

to convert to mp3 I used lame, doing a

sudo apt-get install lame

first and then

lame -f some_book.wav some_book.mp3

Theres also the gui based “sound converter” I noticed that this is much slower than lame.

the -f switch specifies lower audio quality which I decided upon based on the assumption that the speech produced by espeak was already quite low-fi and therefore I might as well save time by not wasting space on placing a low-fi voice into high resolution.

Now I have Stephen Hawking reading my bedtime stories :-) Any book in the world that I can obtain as text may now be an instant audiobook, for using while out and about or at home. My little mp3 player has an external speaker which I use when at home, and headphones for runs, walks, shopping etc.

Many people might imagine that a computer voice is not that good to listen to, but now I have to say I disagree. Once you are used to it, and the occasional mispronunciation, it is perfectly acceptable. You learn to tune out noise and non-human inflection and eventually it is no worse than a human reader. After all when we read off paper the words are what we focus on not the typeface etc. My computer voice is just a neutral way of getting words as text into words as audio. I am loving it.

I have the ocean of text available to me to listen in a more leisurely and relaxed way than poring over screens or paper. I’ve already copy-pasted stuff like blog posts and wikipedia, too. I tend to amass about a few hundred KB of text which I put into one text file. as a guideline using my methods a 500KB text file will result in about 12 hours of mp3 audio.

It occurred to me that this should be built into ipods. I know that Kindle has a blind-reader, which has been said to be fairly bad. I’d love an ipod which I could upload a text file to and it would read to me. Should be a definite feature in my view in the future. Espeak for Linux is not a very advanced text to speech anyway. I have heard that the apple one is better, and various windows ones too. I suppose that to be good a reader needs to have a lot of pronunciation rules developed by linguists and perhaps even some semantic parsing to allow for as stab at natural inflection based on meaning, as well as based on pronunciation rules. It would be a good challenge for code. The way different phonemes elide together needs to be coded and perhaps with some training on large audio data it could be done a lot better than espeak. Linux needs a really good app to do this so it can compete with win and mac, in my view. A note was made in an ubuntu suggested features forum to this extent. Let’s go for it people.


Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.