[Webtest] verifyPdfText error
Etienne Studer
webtest@lists.canoo.com
Mon, 03 Oct 2005 22:13:00 -0700
Hi Lisa
I'm glad you found the source of your PDF problem!
You're right, upgrading to the newest PDF version involves more than
just replacing the jar file since some PDFBox APIs have changed in the
meantime.
Some weeks ago, I tried the PDFbox upgrade myself. a) Adapting to the
new API was quite easy. b) One big change in PDFBox behaviour is that
PDF files with the same input field ocurring multiple times, the field
is returned only once - which was detected by my unit tests. I found no
quick solution for that issue. But, I'm sure Ben (author of PDFBox)
would know how to deal with these duplicated input fields (a real world
scenario, at least in one big banking project that I know of).
I handed over the whole pdfunit code to Paul King of the webtest
community. The idea is that webtest maintains pdfunit in the future
(since I cannot do it anymore due to time constraints).
Of course, it would be great if pdfunit (and webtest) works with the
newest PDFBox version!
--Etienne
Lisa Crispin wrote:
> Hi Etienne,
> One of my coworkers helped me and found this:
> PDFBox was choking on a use of a font called "Symbol" or a character of that
> font. He tried PDFBox 0.7.1 and it had no problem with the document.
>
> I tried downloading the latest PDFBox version and replacing the jar file in
> the WebTest lib, but this caused a big blow up on verifyPdfText. How would
> I go about integrating a newer version of PDFBox? Or is it something the
> WebTest developer community would consider doing?
> Thank you,
> Lisa
>
> -------------- Forwarded Message: --------------
> From: Etienne Studer <etienne.studer.mailinglist@canoo.com>
> To: webtest@lists.canoo.com
> Subject: Re: [Webtest] verifyPdfText error
> Date: Fri, 2 Sep 2005 14:36:25 +0000
>
>>Hi Lisa
>>
>>I suggest to send the PDF file and your question to the developers of
>>http://www.pdfbox.org/ since this is the tool used by pdftest for text
>>extraction. I'm sure Ben will be able to help you out.
>>
>>The code below will allow you see the whole text that pdfbox extracts
>>from a given PDF file. You can run it as a standalone app having the
>>pdfbox lib on your classpath.
>>
>> PDFParser pdfParser = new PDFParser(new FileInputStream(fPdfFile));
>> pdfParser.parse();
>> PDDocument fPdDocument = pdfParser.getPDDocument();
>>
>> PDFTextStripper textStripper = new PDFTextStripper();
>> textStripper.setLineSeparator(" ");
>> textStripper.setPageSeparator(" ");
>> textStripper.setStartPage(startPage);
>> textStripper.setEndPage(endPage);
>> String text = textStripper.getText(fPdDocument);
>>
>>--Etienne
>>
>>
>>
>>Lisa Crispin wrote:
>>
>>>Hi Paul,
>>>Our PDFsare not encrypted. A programmer I work with who has worked a
>
> lot with
>
>>PDFs looked at one of the docs that gets the error, he didn't see anything
>
> too
>
>>odd about it, except that it does embed font information, like for
>
> wingding
>
>>characters. The PDFs from which WebTest is able to extract text are
>
> created
>
>>with a third-party tool called Windward, they don't have any embedded font
>
> info.
>
>>Could this be what is tripping up WebTest?
>>
>>>thanks
>>>Lisa
>>>
>>> -------------- Original message ----------------------
>>>From: Paul King <paulk@asert.com.au>
>>>
>>>>Hi Lisa, we had problems between 40-bit and 128-bit encryption at one
>
> point.
>
>>>>From memory, I think 40-bit worked but 128-bit didn't. We changed things
>
>
>>around
>>
>>>>and got what we needed working but I haven't had time to go back and
>
> test this
>
>>>>further. I have slotted this on my todo to explore further but it will
>
> take me
>
>>>>some time (recent versions of pdfbox including the one we are using are
>>
>>supposed
>>
>>>>to support different encryption strengths automatically).
>>>>
>>>>If you get time and are able to test between different strengths, that
>
> would
>
>>be
>>
>>>>useful feedback for me when I do get a chance to look at it again.
>>>>
>>>>Cheers, Paul.
>>>>
>>>>Lisa Crispin wrote:
>>>>
>>>>
>>>>>I've posted this question before and gotten no response, but hope
>
> springs
>
>>>>eternal.
>>>>
>>>>
>>>>>With some of our PDF docs, verifyPdfText works just fine. For others,
>>>>
>>>>verifyPdfTitle works, but verifyPdfText gets this error:
>>>>
>>>>
>>>>>com.canoo.webtest.engine.StepFailedException: Error while extracting
>
> text
>
>>from
>>
>>>>document., Step: VerifyPdfTextStep at ...
>>>>
>>>>
>>>>>And here is what it points to:
>>>>> <verifyPdfText description="verify advisor disclosure"
>>>>> text="OMB Approval"/>
>>>>>
>>>>>The PDF was created from Microsoft Word. I don't know if that somehow
>
> makes
>
>>>>it unreadable to verifyPdfText? When I open the pdf in acrobat reader,
>
> it
>
>>looks
>>
>>>>normal to me.
>>>>
>>>>
>>>>>We tend to get problems with these documents as they differ for each of
>
> our
>
>>>>different 'brands', so we would really like to have automated regression
>
> tests
>
>>>>for them.
>>>>
>>>>
>>>>>Has anyone else ever had this error?
>>>>>thanks,
>>>>>Lisa
>>>>>_______________________________________________
>>>>>WebTest mailing list
>>>>>WebTest@lists.canoo.com
>>>>>http://lists.canoo.com/mailman/listinfo/webtest
>>>>>
>>>>>
>>>>
>>>>_______________________________________________
>>>>WebTest mailing list
>>>>WebTest@lists.canoo.com
>>>>http://lists.canoo.com/mailman/listinfo/webtest
>>>
>>>
>>>
>>>_______________________________________________
>>>WebTest mailing list
>>>WebTest@lists.canoo.com
>>>http://lists.canoo.com/mailman/listinfo/webtest
>>>
>>>
>>
>>_______________________________________________
>>WebTest mailing list
>>WebTest@lists.canoo.com
>>http://lists.canoo.com/mailman/listinfo/webtest
>
>
>
>
>