Wednesday, December 7, 2011

How to obtain high quality ebooks for reading in Kindle

In this post I record some tips for getting high quality ebooks for reading in Kindle. All the tools described in this post can be easily found in Ubuntu Linux. However, most of these tools should have Windows or Mac equivalent, too.
  • azw and mobi formats are the native formats of Kindle. Ebooks in these formats can usually be directly opened in Kindle. However, some azw files are DRM protected. In that case, there is not much you can do.
  • txt, html (zipped or unzipped), chm and epub formats. Although these formats are not native to Kindle, they can usually be converted to high quality mobi formats easily using Calibre. Some complicated structure chm books might have some trouble. However, since chm files are essentially a zipped html file, you can always decompile it and then process using Calibre.
  • pdf files. These files are no good for Kindle, for the following reasons. (1)The font size cannot be adjusted. (2)Dictionary look-up usually does not work (sometimes it does work). (3)Table of contents and internal links do not work either, which means it is very difficult to navigate inside the pdf document.
    • For pdf files with 100% copyable text (such as novels), you can try Calibre. Usually you can get very good results.
    • For pdf files that are scanned or have a lot of math equations, figures, tables, the results are usually unsatisfactory. What you can do, is to cut the white margins and read it in the landscape mode. I usually use pdfcrop to cut the white margins, which can use different bounding box for odd and even pages. For non-scanned pdf files, the white margins can usually be automatically calculated by pdfcrop.pl (even so, I usually add 2 to the margins of top and bottom, so that the black pdf bounding box of Kindle won't hide the content over those margins). For scanned pdf files, you usually have to find the bounding boxes manually using trial and error. For scanned pdf files with uneven margins, this process can be painful and non-practical at all.
    • For dual-column pdf files, I first trim the white margins using pdfcrop.pl, and then use pdfquarter to split each page in the middle. The outcome is usually very good, because the fonts are larger than those single-column ones.
    • The cropped pdf can usually be very large (100M+ in some cases). After applying pdfcrop.pl, I usually then apply a nautilus script to compress the pdf. The pdf size can usually be significantly reduced.
    • For pdf files with non-embedded Chinese fonts. I usually first print it to ps from acroread, and then use "ps2pdf -dEmbedAllFonts=true ***.ps" to get a pdf file with the fonts embedded. The resulting pdf file can then be processed using pdfcrop.pl. See this web page for more details.
  • latex files. There are many latex to html utilities, such as latex2html, tex4ht, tth, pandoc, plastex etc. You can also import to WYSIWYG scientific writing software such as TeXmacs and LyX, and then export to html from there. For small and non-complex document, usually you can get very good quality html files. And then you can create mobi ebooks using Calibre. However, most of these utilities are far from satisfactory. For example, most of them seem to only accept eps figures, and do not like pdf figures. Plus, only standard latex packages are supported. Another way to import latex documents to Kindle is to compile them to pdf files, but use layout for small screen (such as A5) instead of large A4 paper. The following forum thread provides several such templates.
  • By the way, in case you don't know already, Calibre can fetch news from online and produce high quality newspapers and send them to your Kindle over wifi. You just choose from the 1000+ news source inside Calibre, and configure Calibre to send them to "yourname@free.kindle.com". Then every morning, when you turn on the wifi on Kindle, the selected newspapers will be automatically downloaded to your Kindle. Isn't that amazing? Just remember that you have to keep Calibre running to get this done (kind of obvious).

No comments:

Visitors