[Webtest] verifyPdfText error
Paul King
webtest@lists.canoo.com
Sun, 09 Oct 2005 09:17:19 +1000
Hi Lisa, I do plan to get back to looking at pdfunit (and to see if we can
leverage off jpdfunit too if that makes sense) but it won't be for a couple
of weeks. I have to prepare for a couple of Agile conference talks and that
is taking up my free time at the moment.
Cheers, Paul.
Lisa Crispin wrote:
> Hi Etienne,
> Thank you so much. I will wait and see what Paul & Co. are able to do. I know it takes a lot of time! I sure appreciate you and the other contributers to WebTest.
> -- Lisa
>
> -------------- Original message ----------------------
> From: Etienne Studer <etienne.studer.mailinglist@canoo.com>
>
>>Hi Lisa
>>
>>I'm glad you found the source of your PDF problem!
>>
>>You're right, upgrading to the newest PDF version involves more than
>>just replacing the jar file since some PDFBox APIs have changed in the
>>meantime.
>>
>>Some weeks ago, I tried the PDFbox upgrade myself. a) Adapting to the
>>new API was quite easy. b) One big change in PDFBox behaviour is that
>>PDF files with the same input field ocurring multiple times, the field
>>is returned only once - which was detected by my unit tests. I found no
>>quick solution for that issue. But, I'm sure Ben (author of PDFBox)
>>would know how to deal with these duplicated input fields (a real world
>>scenario, at least in one big banking project that I know of).
>>
>>I handed over the whole pdfunit code to Paul King of the webtest
>>community. The idea is that webtest maintains pdfunit in the future
>>(since I cannot do it anymore due to time constraints).
>>
>>Of course, it would be great if pdfunit (and webtest) works with the
>>newest PDFBox version!
>>
>>--Etienne
>>
>>
>>
>>Lisa Crispin wrote:
>>
>>>Hi Etienne,
>>>One of my coworkers helped me and found this:
>>>PDFBox was choking on a use of a font called "Symbol" or a character of that
>>>font. He tried PDFBox 0.7.1 and it had no problem with the document.
>>>
>>>I tried downloading the latest PDFBox version and replacing the jar file in
>>>the WebTest lib, but this caused a big blow up on verifyPdfText. How would
>>>I go about integrating a newer version of PDFBox? Or is it something the
>>>WebTest developer community would consider doing?
>>>Thank you,
>>>Lisa
>>>
>>>-------------- Forwarded Message: --------------
>>>From: Etienne Studer <etienne.studer.mailinglist@canoo.com>
>>>To: webtest@lists.canoo.com
>>>Subject: Re: [Webtest] verifyPdfText error
>>>Date: Fri, 2 Sep 2005 14:36:25 +0000
>>>
>>>
>>>>Hi Lisa
>>>>
>>>>I suggest to send the PDF file and your question to the developers of
>>>>http://www.pdfbox.org/ since this is the tool used by pdftest for text
>>>>extraction. I'm sure Ben will be able to help you out.
>>>>
>>>>The code below will allow you see the whole text that pdfbox extracts
>>>
>>>>from a given PDF file. You can run it as a standalone app having the
>>>
>>>>pdfbox lib on your classpath.
>>>>
>>>> PDFParser pdfParser = new PDFParser(new FileInputStream(fPdfFile));
>>>> pdfParser.parse();
>>>> PDDocument fPdDocument = pdfParser.getPDDocument();
>>>>
>>>> PDFTextStripper textStripper = new PDFTextStripper();
>>>> textStripper.setLineSeparator(" ");
>>>> textStripper.setPageSeparator(" ");
>>>> textStripper.setStartPage(startPage);
>>>> textStripper.setEndPage(endPage);
>>>> String text = textStripper.getText(fPdDocument);
>>>>
>>>>--Etienne
>>>>
>>>>
>>>>
>>>>Lisa Crispin wrote:
>>>>
>>>>
>>>>>Hi Paul,
>>>>>Our PDFsare not encrypted. A programmer I work with who has worked a
>>>
>>>lot with
>>>
>>>
>>>>PDFs looked at one of the docs that gets the error, he didn't see anything
>>>
>>>too
>>>
>>>
>>>>odd about it, except that it does embed font information, like for
>>>
>>>wingding
>>>
>>>
>>>>characters. The PDFs from which WebTest is able to extract text are
>>>
>>>created
>>>
>>>
>>>>with a third-party tool called Windward, they don't have any embedded font
>>>
>>>info.
>>>
>>>
>>>>Could this be what is tripping up WebTest?
>>>>
>>>>
>>>>>thanks
>>>>>Lisa
>>>>>
>>>>>-------------- Original message ----------------------
>>>>>From: Paul King <paulk@asert.com.au>
>>>>>
>>>>>>Hi Lisa, we had problems between 40-bit and 128-bit encryption at one
>>>
>>>point.
>>>
>>>
>>>>>>From memory, I think 40-bit worked but 128-bit didn't. We changed things
>>>
>>>
>>>>around
>>>>
>>>>
>>>>>>and got what we needed working but I haven't had time to go back and
>>>
>>>test this
>>>
>>>
>>>>>>further. I have slotted this on my todo to explore further but it will
>>>
>>>take me
>>>
>>>
>>>>>>some time (recent versions of pdfbox including the one we are using are
>>>>
>>>>supposed
>>>>
>>>>
>>>>>>to support different encryption strengths automatically).
>>>>>>
>>>>>>If you get time and are able to test between different strengths, that
>>>
>>>would
>>>
>>>
>>>>be
>>>>
>>>>
>>>>>>useful feedback for me when I do get a chance to look at it again.
>>>>>>
>>>>>>Cheers, Paul.
>>>>>>
>>>>>>Lisa Crispin wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>I've posted this question before and gotten no response, but hope
>>>
>>>springs
>>>
>>>
>>>>>>eternal.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>With some of our PDF docs, verifyPdfText works just fine. For others,
>>>>>>
>>>>>>verifyPdfTitle works, but verifyPdfText gets this error:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>com.canoo.webtest.engine.StepFailedException: Error while extracting
>>>
>>>text
>>>
>>>>from
>>>
>>>>>>document., Step: VerifyPdfTextStep at ...
>>>>>>
>>>>>>
>>>>>>
>>>>>>>And here is what it points to:
>>>>>>> <verifyPdfText description="verify advisor disclosure"
>>>>>>> text="OMB Approval"/>
>>>>>>>
>>>>>>>The PDF was created from Microsoft Word. I don't know if that somehow
>>>
>>>makes
>>>
>>>
>>>>>>it unreadable to verifyPdfText? When I open the pdf in acrobat reader,
>>>
>>>it
>>>
>>>
>>>>looks
>>>>
>>>>
>>>>>>normal to me.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>We tend to get problems with these documents as they differ for each of
>>>
>>>our
>>>
>>>
>>>>>>different 'brands', so we would really like to have automated regression
>>>
>>>tests
>>>
>>>
>>>>>>for them.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Has anyone else ever had this error?
>>>>>>>thanks,
>>>>>>>Lisa
>>>>>>>_______________________________________________