org.apache.cocoon.components.crawler
Class SimpleCocoonCrawlerImpl

java.lang.Object
  |
  +--org.apache.avalon.framework.logger.AbstractLoggable
        |
        +--org.apache.cocoon.components.crawler.SimpleCocoonCrawlerImpl
All Implemented Interfaces:
CocoonCrawler, org.apache.avalon.framework.component.Component, org.apache.avalon.framework.configuration.Configurable, org.apache.avalon.framework.activity.Disposable, org.apache.avalon.framework.logger.Loggable, org.apache.avalon.excalibur.pool.Poolable, org.apache.avalon.excalibur.pool.Recyclable

public class SimpleCocoonCrawlerImpl
extends org.apache.avalon.framework.logger.AbstractLoggable
implements CocoonCrawler, org.apache.avalon.framework.configuration.Configurable, org.apache.avalon.framework.activity.Disposable, org.apache.avalon.excalibur.pool.Recyclable

A simple cocoon crawler.

Version:
CVS $Id: SimpleCocoonCrawlerImpl.java,v 1.9.2.2 2002/08/17 04:10:12 vgritsenko Exp $
Author:
Bernhard Huber

Inner Class Summary
static class SimpleCocoonCrawlerImpl.CocoonCrawlerIterator
          Helper class implementing an Iterator This Iterator implementation calculates the links of an URL before returning in the next() method.
 
Field Summary
static java.lang.String ACCEPT_CONFIG
          Config element name specifying http header value for accept.
static java.lang.String ACCEPT_DEFAULT
          Default value of accept configuration value.
static java.lang.String EXCLUDE_CONFIG
          Config element name specifying excluding regular expression pattern.
static java.lang.String INCLUDE_CONFIG
          Config element name specifying including regular expression pattern.
static java.lang.String LINK_CONTENT_TYPE_CONFIG
          Config element name specifying expected link content-typ.
 java.lang.String LINK_CONTENT_TYPE_DEFAULT
          Default value of link-content-type configuration value.
static java.lang.String LINK_VIEW_QUERY_CONFIG
          Config element name specifying query-string appendend for requesting links of an URL.
static java.lang.String LINK_VIEW_QUERY_DEFAULT
          Default value of link-view-query configuration value.
static java.lang.String USER_AGENT_CONFIG
          Config element name specifying http header value for user-Agent.
static java.lang.String USER_AGENT_DEFAULT
          Default value of user-agent configuration value.
 
Fields inherited from interface org.apache.cocoon.components.crawler.CocoonCrawler
ROLE
 
Constructor Summary
SimpleCocoonCrawlerImpl()
          Constructor for the SimpleCocoonCrawlerImpl object
 
Method Summary
 void configure(org.apache.avalon.framework.configuration.Configuration configuration)
          Configure the crawler component.
 void crawl(java.net.URL url)
          Start crawling a URL.
 void dispose()
          dispose at end of life cycle, releasing all resources.
 java.util.Iterator iterator()
          Return iterator, iterating over all links of the currently crawled URL.
 void recycle()
          recylcle this object, relasing resources
 
Methods inherited from class org.apache.avalon.framework.logger.AbstractLoggable
getLogger, setLogger, setupLogger, setupLogger, setupLogger
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LINK_CONTENT_TYPE_CONFIG

public static final java.lang.String LINK_CONTENT_TYPE_CONFIG
Config element name specifying expected link content-typ.

Its value is link-content-type.

Since:
 

LINK_CONTENT_TYPE_DEFAULT

public final java.lang.String LINK_CONTENT_TYPE_DEFAULT
Default value of link-content-type configuration value.

Its value is application/x-cocoon-links.

Since:
 

LINK_VIEW_QUERY_CONFIG

public static final java.lang.String LINK_VIEW_QUERY_CONFIG
Config element name specifying query-string appendend for requesting links of an URL.

Its value is link-view-query.

Since:
 

LINK_VIEW_QUERY_DEFAULT

public static final java.lang.String LINK_VIEW_QUERY_DEFAULT
Default value of link-view-query configuration value.

Its value is ?cocoon-view=links.

Since:
 

EXCLUDE_CONFIG

public static final java.lang.String EXCLUDE_CONFIG
Config element name specifying excluding regular expression pattern.

Its value is exclude.

Since:
 

INCLUDE_CONFIG

public static final java.lang.String INCLUDE_CONFIG
Config element name specifying including regular expression pattern.

Its value is include.

Since:
 

USER_AGENT_CONFIG

public static final java.lang.String USER_AGENT_CONFIG
Config element name specifying http header value for user-Agent.

Its value is user-agent.

Since:
 

USER_AGENT_DEFAULT

public static final java.lang.String USER_AGENT_DEFAULT
Default value of user-agent configuration value.
See Also:
Constants.COMPLETE_NAME
Since:
 

ACCEPT_CONFIG

public static final java.lang.String ACCEPT_CONFIG
Config element name specifying http header value for accept.

Its value is accept.

Since:
 

ACCEPT_DEFAULT

public static final java.lang.String ACCEPT_DEFAULT
Default value of accept configuration value.

Its value is * / *

Since:
 
Constructor Detail

SimpleCocoonCrawlerImpl

public SimpleCocoonCrawlerImpl()
Constructor for the SimpleCocoonCrawlerImpl object
Since:
 
Method Detail

configure

public void configure(org.apache.avalon.framework.configuration.Configuration configuration)
               throws org.apache.avalon.framework.configuration.ConfigurationException
Configure the crawler component.

Configure can specify which URI to include, and which URI to exclude from crawling. You specify the patterns as regular expressions.

Morover you can configure the required content-type of crawling request, and the query-string appended to each crawling request.


 <include>.*\.html?</exclude> or <exclude>.*\.html?, .*\.xsp</exclude>
 <exclude>.*\.gif</exclude> or <exclude>.*\.gif, .*\.jpe?g</exclude>
 <link-content-type> application/x-cocoon-links </link-content-type>
 <link-view-query> ?cocoon-view=links </link-view-query>
 
Specified by:
configure in interface org.apache.avalon.framework.configuration.Configurable
Parameters:
configuration - XML configuration of this avalon component.
Throws:
org.apache.avalon.framework.configuration.ConfigurationException - is throwing if configuration is invalid.
Since:
 

dispose

public void dispose()
dispose at end of life cycle, releasing all resources.
Specified by:
dispose in interface org.apache.avalon.framework.activity.Disposable
Since:
 

recycle

public void recycle()
recylcle this object, relasing resources
Specified by:
recycle in interface org.apache.avalon.excalibur.pool.Recyclable
Since:
 

crawl

public void crawl(java.net.URL url)
Start crawling a URL.

Use this method to start crawling. Get the this url, and all its children by using iterator(). The Iterator object will return URL objects.

You may use the crawl(), and iterator() methods the following way:


   SimpleCocoonCrawlerImpl scci = ....;
   scci.crawl( "http://foo/bar" );
   Iterator i = scci.iterator();
   while (i.hasNext()) {
     URL url = (URL)i.next();
     ...
   }
 

The i.next() method returns a URL, and calculates the links of the URL before return it.

Specified by:
crawl in interface CocoonCrawler
Parameters:
url - Crawl this URL, getting all links from this URL.
Since:
 

iterator

public java.util.Iterator iterator()
Return iterator, iterating over all links of the currently crawled URL.

The Iterator object will return URL objects at its next() method.

Specified by:
iterator in interface CocoonCrawler
Returns:
Iterator iterator of all links from the crawl URL.
Since:
 


Copyright © 1999-2002 Apache Software Foundation. All Rights Reserved.