[Webtest] ReportSite step
Jon Poulton [kinomi]
webtest@lists.canoo.com
Tue, 10 Jan 2006 11:36:31 -0000
This is a multi-part message in MIME format.
------=_NextPart_000_000D_01C615DA.1A30AFD0
Content-Type: text/plain;
charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Hi Denis,
Multiple errors (including StepFailedExceptions thrown by any steps the
container runs) are simply recorded by the Spider for later access by the
reportSite step. At the moment this behavior is not configurable, but
perhaps it would be useful for some people to cause reportSite to fail as
soon as it hits its first error. When the spider finishes its run,
reportSite then calls methods on the spider to obtain any errors that have
occurred; if it finds any, it simply logs them to LOG.error, and throws a
StepFailedException. Perhaps this isn't the neatest way of approaching
things, but it's made life a little easier for me.
I found the set of classes used for reporting and verification slightly
confusing, and decided that the best course of action was to strip the
package down to its bare bones so I could better understand what was going
on. As a result there are currently only two classes (Spider and
ReportSite).
I've send you the source code if you fancy a look/play. As I said, they need
tidying up, but they do work. I've placed the two modified classes in a
different package to your original code.
Jon
-----Original Message-----
From: webtest-admin@gate2.canoo.com [mailto:webtest-admin@gate2.canoo.com]
On Behalf Of Denis N. Antonioli
Sent: 10 January 2006 11:10
To: webtest@gate2.canoo.com
Subject: Re: [Webtest] ReportSite step
Hi
On 9 janv. 06, at 17:38, Jon Poulton [kinomi] wrote:
> I found a largely undocumented class which seemed to do something
> along these lines called <reportSite>, which is an extension step
> written by someone called Denis Antonioli.
Indeed. I wrote the step for the previous version of webtest (http-
based), improved for a client's project, and never quite finished the
port to the newer webtest version.
> After having a look at the source it seemed to do more or less what
> was required, although unfortunately the Spider class it used
> wasn't quite working correctly (it tries to click mailto: links,
> and occasionally does other strange things). Anyway, I've written a
> replacement class in my own package. It's not entirely finished,
> but it does the job as far as our requirements go. I was wondering
> a few things:
>
>
>
> 1) What are the spider package extension steps supposed to
> do, exactly? Are they still under development (hence lack of
> documentation)? Is anyone using them?
I'm still using it with webtest 1-6, and plan to move it to the newer
version.
> 2) Is anyone interested in the replacement classes I've written?
If it's working better, definitely.
> 3) If so, what would be the best way forward? Replace the
> existing spider classes? Deprecate the existing classes and place
> mine in a separate package?
I would replace the existing classes, but I don't know who else is
using them?
> Let me know what you think,
According to my experience, the weakness of the existing spider was
its handling of multiple errors and its reporting, which was quite
difficult to read.
How is your solution handling this?
Best
dna
--
A Perl module would prefer that you stayed out of its living room
because you weren't invited, not because it has a shotgun.
-- Larry Wall & Tom Christiansen in the Camel
_______________________________________________
WebTest mailing list
WebTest@lists.canoo.com
http://lists.canoo.com/mailman/listinfo/webtest
------=_NextPart_000_000D_01C615DA.1A30AFD0
Content-Type: text/x-java;
name="Spider.java"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
filename="Spider.java"
/*
* Copyright (c) 2005 Canoo Engineering. All Rights Reserved.
*/
package com.canoo.webtest.extension.spidertwo;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
import org.apache.log4j.Logger;
import org.apache.oro.text.perl.Perl5Util;
import com.canoo.webtest.engine.CallBlock;
import com.canoo.webtest.engine.Context;
import com.canoo.webtest.engine.RegExStringVerifier;
import com.canoo.webtest.engine.StepFailedException;
import com.canoo.webtest.steps.Step;
import com.canoo.webtest.steps.request.AbstractTargetAction;
import com.canoo.webtest.steps.request.ClickLink;
import com.canoo.webtest.steps.request.TargetHelper;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
/**
* Spider class to crawl through a webpage, visiting each link, but not =
visiting each page more
* than once.
*=20
* @author Denis N. Antonioli, Jon Poulton
*/
public class Spider {
private static final Logger LOG =3D Logger.getLogger(Spider.class);
=09
private static final Perl5Util PERL;
static {
PERL =3D new Perl5Util();
}
=09
/** Set of failed links. */
private final Set fFailedVisits =3D new HashSet();
=09
/** Set of visited urls. */
private final Set fVisitedUrls =3D new HashSet();
=09
/** The step sequence to run for each page. */
private List fStepSequence;
=09
/** File name to report to. */
private String fFileName;
=09
/** Depth parameter. Only set to make sure we bottom out (don't run =
forever). */
private int fDepth;
=09
/** Whether to fail on error or not. */
private boolean fFailOnError;
=09
/** Starting context. */
private Context fContext;
=09
/** Pattern of links that must be checked. */
private String fIncludesPattern;
=09
/** Pattern of links to ignore. */
private String fExcludesPattern;
=09
/** Used to keep follow() happy */
private AbstractTargetAction fDummyStep =3D new ClickLink();=20
=09
/** TargetHelper using dummy step above*/
private TargetHelper fTargetHelper =3D new TargetHelper(fDummyStep);
=09
public Set getFailedVisits(){
return fFailedVisits;
}
=09
public void setIncludes(String includes) {
fIncludesPattern =3D includes;
}
=09
public void setExcludes(String excludes) {
fExcludesPattern =3D excludes;
}
=09
public void setStepSequence(List stepSequence) {
fStepSequence =3D stepSequence;
}
=09
public void setFailOnError(boolean failOnError) {
fFailOnError =3D failOnError;
}
=09
public void setDepth(final int depth) {
fDepth =3D depth;
}
=09
public void setContext(final Context context) {
fContext =3D context;
}
=09
/**
* Start the spider.
*=20
* @param context the starting context for the spider.
*/
public void execute(Context context) {
// Validate the spider has been set up correctly
validate();
=09
// Set the starting context
setContext(context);
// Call the recursive visit method
visit((HtmlPage) fContext.getCurrentResponse(), fDepth);
}
=09
/**
* Validate the spider has been set up correctly before execution.
* <p>
* Call before <code>execute()<code>
*/
void validate() {
if (fDepth < 0) {
throw new IllegalArgumentException("depth must be >=3D 0");
}
}
=09
/**
* Visit the Html page specified by the current response and spider =
through all
* of its (unvisited and valid) links.
*=20
* @param currentResponse the current response
* @param depth
*/
void visit(final Page currentResponse, final int depth) { =09
=09
// Check the current response is an HTML page, rather than a text =
file, or image etc.
if (fContext.getCurrentResponse() instanceof HtmlPage) {
=09
// First validate the contents of the page from the current response
LOG.info("Validating contents of page: =
"+currentResponse.getWebResponse().getUrl());
for (Iterator iter =3D fStepSequence.iterator(); iter.hasNext();) {
Step step =3D (Step) iter.next();
step.setContext(fContext);
try
{
step.execute();
}
catch (StepFailedException e)
{
// Log the failure for report at end.
LOG.info("Step failed exception thrown: "+e.getMessage());
FailedPage failure =3D new FailedPage("Step failed exception =
thrown: "+e.getMessage(), currentResponse.getWebResponse().getUrl());
fFailedVisits.add(failure);
// Only fail on error if the flag for it is set
if (fFailOnError) {
throw e;
}
}
}
=09
HtmlPage response =3D (HtmlPage) fContext.getCurrentResponse();
LOG.info("Looking for links in " + response);
=09
// Iterate through all the anchors in the current response.
for (Iterator iter =3D response.getAnchors().iterator(); =
iter.hasNext();) {
HtmlAnchor link =3D (HtmlAnchor) iter.next();
=09
// If we still have not exceeded our depth limit and we have not =
visited the link
// + its valid to vist then process the link.
if (depth > 0 && needsReport(response, link)) {
try { =20
follow(link);
} catch (Throwable t) {
LOG.info("Error following link: "+ link +" : message =
was: "+t.getMessage());=09
FailedPage failure =3D new FailedPage("Error following =
link: "+link +
" : message was: "+t.getMessage(), =
response.getWebResponse().getUrl());
fFailedVisits.add(failure);
// Only fail on error if the flag for it is set
if (fFailOnError) {
throw new LinkFollowException("Error following link: =
"+link);
}
}
=20
visit(fContext.getCurrentResponse(), depth - 1);
}
}
}
else
{
LOG.info("Ignoring contents of current response (not HTML page): =
"+fContext.getCurrentResponse().getWebResponse().getUrl());
}
}
=09
/**
* Follow the specified link.
* <p>
* This method should only be called if the link has not yet been =
followed by the spider.
*=20
* @param link the link to follow
*/
void follow(final HtmlAnchor link) {
fTargetHelper.protectedGoto("Link href=3D" + link.getHrefAttribute(), =
new CallBlock()
{
public void call() throws Exception {
LOG.debug("Clicking on link with href: " + =
link.getHrefAttribute());
link.click();
}
}, fDummyStep, "spider");
}
=09
/**
* Determine if this link needs reporting or not.
* <p>
* A link needs reporting if: it matches the regular expression (if =
any) for links to check,
* it is an HTTP protocol link, the link has not been visited before.
* @param response the response
* @param link the candidate link to check
* @return true if the link should be evaluated, false otherwise
*/
boolean needsReport(final HtmlPage response, final HtmlAnchor link) {
=09
// Check to see if we've visited before.
if (fVisitedUrls.contains(link.getHrefAttribute())) {
return false;
}
=09
URL url =3D null;
try
{
url =3D response.getFullyQualifiedUrl(link.getHrefAttribute());
=09
// Only ever follow HTTP links
if (!url.getProtocol().equalsIgnoreCase("http")) {
LOG.info("Skipped:" + link.getHrefAttribute() + ": ignoring all =
non-HTTP protocols.");
return false;
}=09
=09
// Check the host is the same host we started on
if =
(!response.getWebResponse().getUrl().getHost().equalsIgnoreCase(url.getHo=
st())) {
LOG.info("Skipped:" + link.getHrefAttribute() + ": host is not an =
on-site host.");
return false;
}
=09
} catch (MalformedURLException e){
// Record the details of broken/malformed links and return false
FailedPage failure =3D new FailedPage("Malformed URL: "+url, =
response.getWebResponse().getUrl());
fFailedVisits.add(failure);
LOG.info("Skipped:" + url + ": url is malformed.");
return false;
}
=09
RegExStringVerifier verifier =3D new RegExStringVerifier();
boolean match =3D false;
=09
// If we have an includes pattern make sure it matches it
if (fIncludesPattern !=3D null && fIncludesPattern.length() > 0){
if (!verifier.verifyStrings(fIncludesPattern, url.toString())) {
LOG.info("Skipped: "+ url + ": does not match 'includes' regex: =
"+fIncludesPattern);
return false;
}
}
// If we have an excludes pattern make sure it DOESN'T match it
if (fExcludesPattern !=3D null && fExcludesPattern.length() > 0){
if (verifier.verifyStrings(fExcludesPattern, url.toString())) {
LOG.info("Skipped: "+ url + ": matches 'excludes' regex: =
"+fExcludesPattern);
return false;
}
}
=09
// If we've made it this far, then we've passed all the above checks, =
so return true and add to visited URLs
fVisitedUrls.add(link.getHrefAttribute());
return true;
}=09
}
/**
* Utility data holder
*/
class FailedPage
{
private String fFailReason;
private URL fFailedPage;
FailedPage(final String failReason, final URL failedUrl) {
fFailReason =3D failReason;
fFailedPage =3D failedUrl;
}
public String getFailedReason() {
return fFailReason;
}
public URL getFailedUrl() {
return fFailedPage;
}
}
/**
* Exception to flag following link didnt work
*/
class LinkFollowException extends RuntimeException
{
LinkFollowException(String str){
super(str);
}
}
------=_NextPart_000_000D_01C615DA.1A30AFD0
Content-Type: text/x-java;
name="SiteReportStep.java"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
filename="SiteReportStep.java"
/*
* Copyright (c) 2005 Canoo Engineering. All Rights Reserved.
*/
package com.canoo.webtest.extension.spidertwo;
import java.util.Iterator;
import java.util.Set;
import org.apache.log4j.Logger;
import com.canoo.webtest.engine.Context;
import com.canoo.webtest.engine.StepFailedException;
import com.canoo.webtest.steps.AbstractStepContainer;
/**
* For all pages on the target site, run the same set of steps, as =
specified within this
* step container.
* <p>
* This class will navigate spider-like through all pages on the site, =
performing the same
* set of test steps on each page.
* @author Jon Poulton
* @webtest.step category=3D"Extension"
* name=3D"siteReport"
* description=3D"This step is used to test a complete site."
*/
public class SiteReportStep extends AbstractStepContainer {
private static final Logger LOG =3D =
Logger.getLogger(SiteReportStep.class);
public static final String DEFAULT_STEPTYPE =3D "siteReport";
private int fDepth;
private String fExcludes =3D "";
private String fIncludes =3D "";
/**
* @webtest.parameter required=3D"no"
* default=3D"<empty>"
* description=3D"If <em>excludes</em> is set then each link found =
is compared to the defined string (via regexp), if it matches then the =
link is not followed."
*/
public void setExcludes(String regex) {
fExcludes =3D regex;
}
public String getExcludes() {
return fExcludes;
}
/**
* Set the includes regex on the step.
* @webtest.parameter required=3D"no"
* default=3D"<all>"
* description=3D"If <em>includes</em> is set then each link found =
is compared to the defined string (via regexp), if it matches then the =
link is processed, others are ignored."
*/
public void setIncludes(String regex) {
fIncludes =3D regex;
}
/**
* Get the includes regex
* @return the includes regex as a String
*/
public String getIncludes() {
return fIncludes;
}
/* (non-Javadoc)
* @see com.canoo.webtest.steps.Step#doExecute()
*/
public void doExecute() throws CloneNotSupportedException {
=20
// Execute this step, and all the steps it contains
final Context context =3D getContext();
setStepType(DEFAULT_STEPTYPE);
=20
// Create a new Spider to crawl the site
Spider spider =3D new Spider();
=20
// Set a decent level of depth on the spider.
spider.setDepth(100);=09
=20
// Set the set of steps to perform on each page on the Spider
spider.setStepSequence(getSteps());
=20
// Set any regular expressions
spider.setExcludes(fExcludes);
spider.setIncludes(fIncludes);
=20
// Set fail on error flag
spider.setFailOnError(false);
=20
// Execute
spider.execute(context);
=20
// Get any failures
Set failures =3D spider.getFailedVisits();
for (Iterator iter =3D failures.iterator(); iter.hasNext(); ){
FailedPage page =3D (FailedPage)iter.next();
LOG.error("Failure reason: "+page.getFailedReason() + " on =
page: "+page.getFailedUrl());
}
=20
if (failures.size() > 0){
throw new StepFailedException("Site spider found failed =
pages.");
}
}
}
------=_NextPart_000_000D_01C615DA.1A30AFD0--