[Webtest] verifyPdfText error
Lisa Crispin
webtest@lists.canoo.com
Tue, 04 Oct 2005 16:28:05 +0000
Hi Etienne,
Thank you so much. I will wait and see what Paul & Co. are able to do. I know it takes a lot of time! I sure appreciate you and the other contributers to WebTest.
-- Lisa
-------------- Original message ----------------------
From: Etienne Studer <etienne.studer.mailinglist@canoo.com>
> Hi Lisa
>
> I'm glad you found the source of your PDF problem!
>
> You're right, upgrading to the newest PDF version involves more than
> just replacing the jar file since some PDFBox APIs have changed in the
> meantime.
>
> Some weeks ago, I tried the PDFbox upgrade myself. a) Adapting to the
> new API was quite easy. b) One big change in PDFBox behaviour is that
> PDF files with the same input field ocurring multiple times, the field
> is returned only once - which was detected by my unit tests. I found no
> quick solution for that issue. But, I'm sure Ben (author of PDFBox)
> would know how to deal with these duplicated input fields (a real world
> scenario, at least in one big banking project that I know of).
>
> I handed over the whole pdfunit code to Paul King of the webtest
> community. The idea is that webtest maintains pdfunit in the future
> (since I cannot do it anymore due to time constraints).
>
> Of course, it would be great if pdfunit (and webtest) works with the
> newest PDFBox version!
>
> --Etienne
>
>
>
> Lisa Crispin wrote:
> > Hi Etienne,
> > One of my coworkers helped me and found this:
> > PDFBox was choking on a use of a font called "Symbol" or a character of that
> > font. He tried PDFBox 0.7.1 and it had no problem with the document.
> >
> > I tried downloading the latest PDFBox version and replacing the jar file in
> > the WebTest lib, but this caused a big blow up on verifyPdfText. How would
> > I go about integrating a newer version of PDFBox? Or is it something the
> > WebTest developer community would consider doing?
> > Thank you,
> > Lisa
> >
> > -------------- Forwarded Message: --------------
> > From: Etienne Studer <etienne.studer.mailinglist@canoo.com>
> > To: webtest@lists.canoo.com
> > Subject: Re: [Webtest] verifyPdfText error
> > Date: Fri, 2 Sep 2005 14:36:25 +0000
> >
> >>Hi Lisa
> >>
> >>I suggest to send the PDF file and your question to the developers of
> >>http://www.pdfbox.org/ since this is the tool used by pdftest for text
> >>extraction. I'm sure Ben will be able to help you out.
> >>
> >>The code below will allow you see the whole text that pdfbox extracts
> >>from a given PDF file. You can run it as a standalone app having the
> >>pdfbox lib on your classpath.
> >>
> >> PDFParser pdfParser = new PDFParser(new FileInputStream(fPdfFile));
> >> pdfParser.parse();
> >> PDDocument fPdDocument = pdfParser.getPDDocument();
> >>
> >> PDFTextStripper textStripper = new PDFTextStripper();
> >> textStripper.setLineSeparator(" ");
> >> textStripper.setPageSeparator(" ");
> >> textStripper.setStartPage(startPage);
> >> textStripper.setEndPage(endPage);
> >> String text = textStripper.getText(fPdDocument);
> >>
> >>--Etienne
> >>
> >>
> >>
> >>Lisa Crispin wrote:
> >>
> >>>Hi Paul,
> >>>Our PDFsare not encrypted. A programmer I work with who has worked a
> >
> > lot with
> >
> >>PDFs looked at one of the docs that gets the error, he didn't see anything
> >
> > too
> >
> >>odd about it, except that it does embed font information, like for
> >
> > wingding
> >
> >>characters. The PDFs from which WebTest is able to extract text are
> >
> > created
> >
> >>with a third-party tool called Windward, they don't have any embedded font
> >
> > info.
> >
> >>Could this be what is tripping up WebTest?
> >>
> >>>thanks
> >>>Lisa
> >>>
> >>> -------------- Original message ----------------------
> >>>From: Paul King <paulk@asert.com.au>
> >>>
> >>>>Hi Lisa, we had problems between 40-bit and 128-bit encryption at one
> >
> > point.
> >
> >>>>From memory, I think 40-bit worked but 128-bit didn't. We changed things
> >
> >
> >>around
> >>
> >>>>and got what we needed working but I haven't had time to go back and
> >
> > test this
> >
> >>>>further. I have slotted this on my todo to explore further but it will
> >
> > take me
> >
> >>>>some time (recent versions of pdfbox including the one we are using are
> >>
> >>supposed
> >>
> >>>>to support different encryption strengths automatically).
> >>>>
> >>>>If you get time and are able to test between different strengths, that
> >
> > would
> >
> >>be
> >>
> >>>>useful feedback for me when I do get a chance to look at it again.
> >>>>
> >>>>Cheers, Paul.
> >>>>
> >>>>Lisa Crispin wrote:
> >>>>
> >>>>
> >>>>>I've posted this question before and gotten no response, but hope
> >
> > springs
> >
> >>>>eternal.
> >>>>
> >>>>
> >>>>>With some of our PDF docs, verifyPdfText works just fine. For others,
> >>>>
> >>>>verifyPdfTitle works, but verifyPdfText gets this error:
> >>>>
> >>>>
> >>>>>com.canoo.webtest.engine.StepFailedException: Error while extracting
> >
> > text
> >
> >>from
> >>
> >>>>document., Step: VerifyPdfTextStep at ...
> >>>>
> >>>>
> >>>>>And here is what it points to:
> >>>>> <verifyPdfText description="verify advisor disclosure"
> >>>>> text="OMB Approval"/>
> >>>>>
> >>>>>The PDF was created from Microsoft Word. I don't know if that somehow
> >
> > makes
> >
> >>>>it unreadable to verifyPdfText? When I open the pdf in acrobat reader,
> >
> > it
> >
> >>looks
> >>
> >>>>normal to me.
> >>>>
> >>>>
> >>>>>We tend to get problems with these documents as they differ for each of
> >
> > our
> >
> >>>>different 'brands', so we would really like to have automated regression
> >
> > tests
> >
> >>>>for them.
> >>>>
> >>>>
> >>>>>Has anyone else ever had this error?
> >>>>>thanks,
> >>>>>Lisa
> >>>>>_______________________________________________
> >>>>>WebTest mailing list
> >>>>>WebTest@lists.canoo.com
> >>>>>http://lists.canoo.com/mailman/listinfo/webtest
> >>>>>
> >>>>>
> >>>>
> >>>>_______________________________________________
> >>>>WebTest mailing list
> >>>>WebTest@lists.canoo.com
> >>>>http://lists.canoo.com/mailman/listinfo/webtest
> >>>
> >>>
> >>>
> >>>_______________________________________________
> >>>WebTest mailing list
> >>>WebTest@lists.canoo.com
> >>>http://lists.canoo.com/mailman/listinfo/webtest
> >>>
> >>>
> >>
> >>_______________________________________________
> >>WebTest mailing list
> >>WebTest@lists.canoo.com
> >>http://lists.canoo.com/mailman/listinfo/webtest
> >
> >
> >
> >
> >
> _______________________________________________
> WebTest mailing list
> WebTest@lists.canoo.com
> http://lists.canoo.com/mailman/listinfo/webtest