Learning how to parse PDF
Since I joined OASIS, I’ve decided I’d better bone up on existing document formats in order to maximise my contribution to the OpenDocument spec. So I jumped headfirst into the wonderful world of PDF parsing.
Actually, the Adobe spec is really well written and organized. Sadly, the same cannot be said of some of all the poppler code; the newer stuff is all right but the older XPDF-based code is pretty hard to slog through.
After dealing with the poorly implemented/documented ball of lint that is RTF, I was bracing myself for the worst. PDF is actually not that hard to follow - has some odd quirks, but nothing too daunting. I’ve written an analysis over at my company site. Hopefully I will be able to implement some of this in the near future.