From piotr_chamera at poczta.onet.pl Fri Jun 10 16:48:25 2011 From: piotr_chamera at poczta.onet.pl (Piotr Chamera) Date: Fri, 10 Jun 2011 18:48:25 +0200 Subject: [cl-pdf-devel] Some problems with pdf-parser Message-ID: <4DF24AD9.90601@poczta.onet.pl> Hi, I just started with cl-pdf and it works great for me :) but I found some problems in pdf-parser and need advice how to fix it. I am rather novice Lisper so I can be wrong in my guesses below... 1. In file cl-pdf, function find-cross-reference-start function searches for 'startxref' in buffer _from beginning_ and can find incorrect place if at end of file (in buffer) are two such sections (eg small incremental change at end of file). Proposition: change (let ((position (search "startxref" buffer))) to (let ((position (search "startxref" buffer :from-end t))) 2. In file cl-pdf, function make-indirect-object: (defun make-indirect-object (obj-number gen-number position) (let ((object (or (car (gethash (cons obj-number gen-number) *indirect-objects*)) (make-instance 'indirect-object :obj-number obj-number :gen-number gen-number :content :unread :no-link t)))) (setf (gethash (cons obj-number gen-number) *indirect-objects*) (cons object position)) object)) I am working on file generated from Adobe Acrobat Distiller and then cropped in Adobe Acrobat so at end of file there are few modified objects with duplicate numbers (and generations ??? whih is maybe bug in Acrobat?). When indirect-object objects are read from file (in order from cross reference tables which a read from newest to oldest) then newer one are overwritten by older one with the same number. We end with readable pdf but with some object revisions dropped. I have added some print for debuggind in above function (and some others) and for sample file got such a reading order: startxref position: 89502 xref position: 89502 making obj: 4 0 position 85386 making obj: 5 0 position 89106 making obj: 8 0 position 89309 making obj: 7 0 position 0 xref position: 116 making obj: 6 0 position 16 making obj: 7 0 position 1150 making obj: 8 0 position 1227 making obj: 9 0 position 1411 making obj: 10 0 position 1554 (..) making obj: 37 0 position 936 xref position: 85210 making obj: 1 0 position 81250 making obj: 2 0 position 81284 making obj: 3 0 position 81308 making obj: 4 0 position 81359 making obj: 5 0 position 85007 Which shows that in file are 4 duplicated objects and they are overwritten by older versions (4 0, 5 0, 8 0, 7 0). I think that solution would be to drop older objects when newer wersion with the same number and generation was already read? Something like this: (defun make-indirect-object (obj-number gen-number position) (let ((object (gethash (cons obj-number gen-number) *indirect-objects*))) (if object (progn (format T "obj alredy present: ~s ~s at position ~s (dropped older one at position ~s)~%" obj-number gen-number (cdr object) position) (car object)) (progn (format T "making obj: ~s ~s position ~s ~%" obj-number gen-number position) (let ((new-object (make-instance 'indirect-object :obj-number obj-number :gen-number gen-number :content :unread :no-link t))) (setf (gethash (cons obj-number gen-number) *indirect-objects*) (cons new-object position)) new-object))))) Which gives on the same example file startxref position: 89502 xref position: 89502 making obj: 4 0 position 85386 making obj: 5 0 position 89106 making obj: 8 0 position 89309 making obj: 7 0 position 0 xref position: 116 making obj: 6 0 position 16 obj alredy present: 7 0 at position 0 (dropped older one at position 1150) obj alredy present: 8 0 at position 89309 (dropped older one at position 1227) making obj: 9 0 position 1411 making obj: 10 0 position 1554 (...) making obj: 37 0 position 936 xref position: 85210 making obj: 1 0 position 81250 making obj: 2 0 position 81284 making obj: 3 0 position 81308 obj alredy present: 4 0 at position 85386 (dropped older one at position 81359) obj alredy present: 5 0 at position 89106 (dropped older one at position 85007) But this reveals another problem in read-xref-and-trailer (defun read-xref-and-trailer (position) (let (first-trailer) (loop (format T "xref position: ~s~%" position) (read-cross-reference-subsections position) (let* ((trailer (read-trailer))) (unless first-trailer (setf first-trailer trailer)) (let ((prev-position (get-dict-value trailer "/Prev"))) (if prev-position (setq position prev-position) (return first-trailer))))))) If I correctly read it, it reads trailers from most recent to older and returns oldest instead of first read? So in read-pdf document gets incorrect information. Can someone rewiew above and tell me if I search in good direction or I am entirely wrong... -- pozdrawiam Piotr Chamera From piotr_chamera at poczta.onet.pl Fri Jun 10 18:12:24 2011 From: piotr_chamera at poczta.onet.pl (Piotr Chamera) Date: Fri, 10 Jun 2011 20:12:24 +0200 Subject: [cl-pdf-devel] Some problems with pdf-parser In-Reply-To: <4DF24AD9.90601@poczta.onet.pl> References: <4DF24AD9.90601@poczta.onet.pl> Message-ID: <4DF25E88.5080809@poczta.onet.pl> Some corrections to previous post Ad 2. In file cl-pdf, function make-indirect-object This code seems to work for me correcting unexpected behaviour but I am not sure it is correct in every case (added some formats for debugging): (defun make-indirect-object (obj-number gen-number position) (let ((object (gethash (cons obj-number gen-number) *indirect-objects*))) (if object (progn (format T "obj alredy present: ~s ~s at position ~s" obj-number gen-number(cdr object)) (if (zerop (cdr object)) ; some objects are created and marked with invalid position = 0 (progn (format T " (marked for update with position=0), update position to ~s~%" position) (setf (gethash (cons obj-number gen-number) *indirect-objects*) (cons (car object) position))) (format T " (dropped older one at position ~s)~%" position)) (car object)) (progn (format T "making obj: ~s ~s position ~s ~%" obj-number gen-number position) (let ((new-object (make-instance 'indirect-object :obj-number obj-number :gen-number gen-number :content :unread :no-link t))) (setf (gethash (cons obj-number gen-number) *indirect-objects*) (cons new-object position)) new-object))))) Now order of reading example file is as follow: startxref position: 89502 xref position: 89502 making obj: 4 0 position 85386 making obj: 5 0 position 89106 making obj: 8 0 position 89309 making obj: 7 0 position 0 xref position: 116 making obj: 6 0 position 16 obj alredy present: 7 0 at position 0 (marked for update with position=0), update position to 1150 obj alredy present: 8 0 at position 89309 (dropped older one at position 1227) making obj: 9 0 position 1411 making obj: 10 0 position 1554 (...) making obj: 37 0 position 936 xref position: 85210 making obj: 1 0 position 81250 making obj: 2 0 position 81284 making obj: 3 0 position 81308 obj alredy present: 4 0 at position 85386 (dropped older one at position 81359) obj alredy present: 5 0 at position 89106 (dropped older one at position 85007) > But this reveals another problem in read-xref-and-trailer ... Please ignore this, I misread this function, it seems ok. -- pozdrawiam Piotr Chamera