[Webtest] verifyPdfText error

Lisa Crispin webtest@lists.canoo.com
Tue, 04 Oct 2005 16:28:05 +0000


Hi Etienne,
Thank you so much.  I will wait and see what Paul & Co. are able to do.  I know it takes a lot of time!  I sure appreciate you and the other contributers to WebTest.  
-- Lisa

 -------------- Original message ----------------------
From: Etienne Studer <etienne.studer.mailinglist@canoo.com>
> Hi Lisa
> 
> I'm glad you found the source of your PDF problem!
> 
> You're right, upgrading to the newest PDF version involves more than 
> just replacing the jar file since some PDFBox APIs have changed in the 
> meantime.
> 
> Some weeks ago, I tried the PDFbox upgrade myself. a) Adapting to the 
> new API was quite easy. b) One big change in PDFBox behaviour is that 
> PDF files with the same input field ocurring multiple times, the field 
> is returned only once - which was detected by my unit tests. I found no 
> quick solution for that issue. But, I'm sure Ben (author of PDFBox) 
> would know how to deal with these duplicated input fields (a real world 
> scenario, at least in one big banking project that I know of).
> 
> I handed over the whole pdfunit code to Paul King of the webtest 
> community. The idea is that webtest maintains pdfunit in the future 
> (since I cannot do it anymore due to time constraints).
> 
> Of course, it would be great if pdfunit (and webtest) works with the 
> newest PDFBox version!
> 
> --Etienne
> 
> 
> 
> Lisa Crispin wrote:
> > Hi Etienne,
> > One of my coworkers helped me and found this:
> > PDFBox was choking on a use of a font called "Symbol" or a character of that
> > font. He tried PDFBox 0.7.1 and it had no problem with the document.
> > 
> > I tried downloading the latest PDFBox version and replacing the jar file in
> > the WebTest lib, but this caused a big blow up on verifyPdfText.  How would
> > I go about integrating a newer version of PDFBox? Or is it something the
> > WebTest developer community would consider doing?
> > Thank you,
> > Lisa
> > 
> > -------------- Forwarded Message: --------------
> > From: Etienne Studer <etienne.studer.mailinglist@canoo.com>
> > To: webtest@lists.canoo.com
> > Subject: Re: [Webtest] verifyPdfText error
> > Date: Fri, 2 Sep 2005 14:36:25 +0000
> > 
> >>Hi Lisa
> >>
> >>I suggest to send the PDF file and your question to the developers of 
> >>http://www.pdfbox.org/ since this is the tool used by pdftest for text 
> >>extraction. I'm sure Ben will be able to help you out.
> >>
> >>The code below will allow you see the whole text that pdfbox extracts 
> >>from a given PDF file. You can run it as a standalone app having the 
> >>pdfbox lib on your classpath.
> >>
> >>     PDFParser pdfParser = new PDFParser(new FileInputStream(fPdfFile));
> >>     pdfParser.parse();
> >>     PDDocument fPdDocument = pdfParser.getPDDocument();
> >>
> >>     PDFTextStripper textStripper = new PDFTextStripper();
> >>     textStripper.setLineSeparator(" ");
> >>     textStripper.setPageSeparator(" ");
> >>     textStripper.setStartPage(startPage);
> >>     textStripper.setEndPage(endPage);
> >>     String text = textStripper.getText(fPdDocument);
> >>
> >>--Etienne
> >>
> >>
> >>
> >>Lisa Crispin wrote:
> >>
> >>>Hi Paul,
> >>>Our PDFsare not encrypted.  A programmer I work with who has worked a
> > 
> > lot with 
> > 
> >>PDFs looked at one of the docs that gets the error, he didn't see anything
> > 
> > too 
> > 
> >>odd about it, except that it does embed font information, like for
> > 
> > wingding 
> > 
> >>characters.  The PDFs from which WebTest is able to extract text are
> > 
> > created 
> > 
> >>with a third-party tool called Windward, they don't have any embedded font
> > 
> > info.  
> > 
> >>Could this be what is tripping up WebTest?
> >>
> >>>thanks
> >>>Lisa
> >>>
> >>> -------------- Original message ----------------------
> >>>From: Paul King <paulk@asert.com.au>
> >>>
> >>>>Hi Lisa, we had problems between 40-bit and 128-bit encryption at one
> > 
> > point. 
> > 
> >>>>From memory, I think 40-bit worked but 128-bit didn't. We changed things
> > 
> > 
> >>around 
> >>
> >>>>and got what we needed working but I haven't had time to go back and
> > 
> > test this 
> > 
> >>>>further. I have slotted this on my todo to explore further but it will
> > 
> > take me 
> > 
> >>>>some time (recent versions of pdfbox including the one we are using are 
> >>
> >>supposed 
> >>
> >>>>to support different encryption strengths automatically).
> >>>>
> >>>>If you get time and are able to test between different strengths, that
> > 
> > would 
> > 
> >>be 
> >>
> >>>>useful feedback for me when I do get a chance to look at it again.
> >>>>
> >>>>Cheers, Paul.
> >>>>
> >>>>Lisa Crispin wrote:
> >>>>
> >>>>
> >>>>>I've posted this question before and gotten no response, but hope
> > 
> > springs 
> > 
> >>>>eternal.
> >>>>
> >>>>
> >>>>>With some of our PDF docs, verifyPdfText works just fine.  For others, 
> >>>>
> >>>>verifyPdfTitle works, but verifyPdfText gets this error:  
> >>>>
> >>>>
> >>>>>com.canoo.webtest.engine.StepFailedException: Error while extracting
> > 
> > text 
> > 
> >>from 
> >>
> >>>>document., Step: VerifyPdfTextStep at ... 
> >>>>
> >>>>
> >>>>>And here is what it points to:
> >>>>>	<verifyPdfText description="verify advisor disclosure"
> >>>>>	          text="OMB Approval"/>
> >>>>>
> >>>>>The PDF was created from Microsoft Word.  I don't know if that somehow
> > 
> > makes 
> > 
> >>>>it unreadable to verifyPdfText?  When I open the pdf in acrobat reader,
> > 
> > it 
> > 
> >>looks 
> >>
> >>>>normal to me.
> >>>>
> >>>>
> >>>>>We tend to get problems with these documents as they differ for each of
> > 
> > our 
> > 
> >>>>different 'brands', so we would really like to have automated regression
> > 
> > tests 
> > 
> >>>>for them.
> >>>>
> >>>>
> >>>>>Has anyone else ever had this error?
> >>>>>thanks,
> >>>>>Lisa
> >>>>>_______________________________________________
> >>>>>WebTest mailing list
> >>>>>WebTest@lists.canoo.com
> >>>>>http://lists.canoo.com/mailman/listinfo/webtest
> >>>>>
> >>>>>
> >>>>
> >>>>_______________________________________________
> >>>>WebTest mailing list
> >>>>WebTest@lists.canoo.com
> >>>>http://lists.canoo.com/mailman/listinfo/webtest
> >>>
> >>>
> >>>
> >>>_______________________________________________
> >>>WebTest mailing list
> >>>WebTest@lists.canoo.com
> >>>http://lists.canoo.com/mailman/listinfo/webtest
> >>>
> >>>
> >>
> >>_______________________________________________
> >>WebTest mailing list
> >>WebTest@lists.canoo.com
> >>http://lists.canoo.com/mailman/listinfo/webtest
> > 
> > 
> > 
> > 
> > 
> _______________________________________________
> WebTest mailing list
> WebTest@lists.canoo.com
> http://lists.canoo.com/mailman/listinfo/webtest