Pythonism

code and the oracular

Making Ebooks with Python

leave a comment »

The average ebook reader like a kindle, which is the one I have, accepts a variety of file formats. Of course the ultimate vanilla form would be a plain text file like my_file.txt. The ebook interface on the kindle renders a file like this perfectly well and readably, but you can’t navigate around the document so well without having page numbers, because the text file is just one stream broken into pages by the device upon load.

There is also azw, epub and mobi which are the main three formats designed specifically for ebook reader devices. If you want to read these on a tablet, laptop or larger machine you’ll need an app, which is a pain. Thus I conclude that the ultimate all-round format to stay with is simply pdf.

One problem – a pdf formatted in the most common way has a page size of A4. This makes text too small to read on my kindle for example.

I tried using calibre to convert A4 pdfs to one of the ebook formats
above but it
didnt perform well with preserving the formatting and usually didn’t
wrap lines
properly so you get files that read like this where lines have been
truncated.

This was annoying so I set about learning how to make my own nicely formatted ebooks. I wanted lots of mathematical and Python documentation on my kindle. This usually started in either text or html form. To leverage the usefulness of the device as much as possible there was no problem with having quite large data dumped onto it, like files of > 50MB for example.

One such supply of data came from a download of part of OEIS, the Online Encyclopaedia of Integer Sequences. I had a text file of 70 MB with details about hundreds of thousands of integer sequences. Bedtime reading but not for the fainthearted 😉

So I wrote a script to take the text and put it into a simple html file with neatly formatted headings and page breaks, to print at A5. Page breaks and paper size can be set from css, then you can feed the html to a converter which will render it as pdf.

I hunted for ways to convert html to pdf. The most obvious is to open your html in libre office writer and use GUI to convert to a pdf. With filesize of 90MB on my machine with 4GB RAM this looked like it was going to take forever to load. There is a headless libre office document conversion utility called unoconv which might make things quicker I thought. I tried it but still very slow.

Finally I settled on pisa which is a fairly mature python based system. It can be accessed from command line by

pisa input_htm_file output_pdf_file

which isn’t too challenging in the way of complex command switches !

But it still got constipation with a 90MB html file as input. So I changed the html generating script to break into chunks, and batch converted them all with pisa. This finally worked and I was happy.

I’ll only show the helpful bit of my code, since it then easier for someone to re-use for their specific need. All I did was to

open('output.html','w')

then write a header which was

'''<html><head><style>
    .break { page-break-before: always; }
    p {font-size: 130%;}
    .page {margin: 0px; padding: 0px;}
    @page {size: A5;}
    </style></head><body>
'''

then use an iterator to put in all the content like

'

'+title.next()+'

' '

'+res[paragraph.next()]+'

'

and closed the html with

'
</body></html>'

then I ran

import glob,os
for x in glob.glob('/home/luke/seq*.htm'):
    os.system('pisa '+x+' '+x[:-3]+'pdf')
    print ".",

and copied it all onto my kindle so home lab now has a very useful additional reference library !

Addendum: What I left out from this earlier when I wrote was all the rushing around looking at how to make a document with rich-ish formatting using code. I tried another module while on this quest, which was PyRTF. I reasoned that RTF which was created by Microsoft, was a simple format that could support the usual word processing forms. I tried PyRTF and tried again… and again. I had to read the code because there is ZERO documentation available with this module.

My conclusion, which may sound extreme unless you bear with me, O reader, is that for all the stuff I want to do Word Processors are nearly pointless. I already know html, and with this additional possibility of using a specific page size , usually A4 or A5 (although A6 is ok on a kindle too) specified by CSS for print view, far and away the easiest way I could see to autogenerate reports and large structured documents is to write them to html and then convert to pdf. ODT is the format mostly used by Open Office and these files are stored as binary so I couldn’t see a hacky way to get Python to write them. RTF is not really designed to be human readable like html, although it seems to have a tag based structure.

So I can honestly say that if you have biggish data that you need to present to a user, print or browse you can steer clear of all word processor formats. Whether this means that wordprocessing is moving to obsolescence I am not sure. Wysiwig webpage editors are still flawed (or at least when I last looked) but ultimately that is *all* a wordprocessing app is doing, because the docx format for example is still an xml variant which is tag based, so that the app is read/writing a code file full of tags and presenting it to a user wysiwig.

So why not save learning curves, learn html and do all your document generation from there? You’re killing two birds with one stone because you are able to share via the web and make neat print documents with the same process. Blow me out of the sky Gates but I think I’m right !

Advertisements

Written by Luke Dunn

November 8, 2013 at 11:07 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: