|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--org.apache.avalon.framework.logger.AbstractLoggable | +--org.apache.cocoon.components.crawler.SimpleCocoonCrawlerImpl
A simple cocoon crawler.
Inner Class Summary | |
static class |
SimpleCocoonCrawlerImpl.CocoonCrawlerIterator
Helper class implementing an Iterator This Iterator implementation calculates the links of an URL before returning in the next() method. |
Field Summary | |
static java.lang.String |
ACCEPT_CONFIG
Config element name specifying http header value for accept. |
static java.lang.String |
ACCEPT_DEFAULT
Default value of accept configuration value.
|
static java.lang.String |
EXCLUDE_CONFIG
Config element name specifying excluding regular expression pattern. |
static java.lang.String |
INCLUDE_CONFIG
Config element name specifying including regular expression pattern. |
static java.lang.String |
LINK_CONTENT_TYPE_CONFIG
Config element name specifying expected link content-typ. |
java.lang.String |
LINK_CONTENT_TYPE_DEFAULT
Default value of link-content-type configuration value.
|
static java.lang.String |
LINK_VIEW_QUERY_CONFIG
Config element name specifying query-string appendend for requesting links of an URL. |
static java.lang.String |
LINK_VIEW_QUERY_DEFAULT
Default value of link-view-query configuration value.
|
static java.lang.String |
USER_AGENT_CONFIG
Config element name specifying http header value for user-Agent. |
static java.lang.String |
USER_AGENT_DEFAULT
Default value of user-agent configuration value. |
Fields inherited from interface org.apache.cocoon.components.crawler.CocoonCrawler |
ROLE |
Constructor Summary | |
SimpleCocoonCrawlerImpl()
Constructor for the SimpleCocoonCrawlerImpl object |
Method Summary | |
void |
configure(org.apache.avalon.framework.configuration.Configuration configuration)
Configure the crawler component. |
void |
crawl(java.net.URL url)
Start crawling a URL. |
void |
dispose()
dispose at end of life cycle, releasing all resources. |
java.util.Iterator |
iterator()
Return iterator, iterating over all links of the currently crawled URL. |
void |
recycle()
recylcle this object, relasing resources |
Methods inherited from class org.apache.avalon.framework.logger.AbstractLoggable |
getLogger, setLogger, setupLogger, setupLogger, setupLogger |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final java.lang.String LINK_CONTENT_TYPE_CONFIG
Its value is link-content-type
.
public final java.lang.String LINK_CONTENT_TYPE_DEFAULT
link-content-type
configuration value.
Its value is application/x-cocoon-links
.
public static final java.lang.String LINK_VIEW_QUERY_CONFIG
Its value is link-view-query
.
public static final java.lang.String LINK_VIEW_QUERY_DEFAULT
link-view-query
configuration value.
Its value is ?cocoon-view=links
.
public static final java.lang.String EXCLUDE_CONFIG
Its value is exclude
.
public static final java.lang.String INCLUDE_CONFIG
Its value is include
.
public static final java.lang.String USER_AGENT_CONFIG
Its value is user-agent
.
public static final java.lang.String USER_AGENT_DEFAULT
user-agent
configuration value.Constants.COMPLETE_NAME
public static final java.lang.String ACCEPT_CONFIG
Its value is accept
.
public static final java.lang.String ACCEPT_DEFAULT
accept
configuration value.
Its value is * / *
Constructor Detail |
public SimpleCocoonCrawlerImpl()
Method Detail |
public void configure(org.apache.avalon.framework.configuration.Configuration configuration) throws org.apache.avalon.framework.configuration.ConfigurationException
Configure can specify which URI to include, and which URI to exclude from crawling. You specify the patterns as regular expressions.
Morover you can configure the required content-type of crawling request, and the query-string appended to each crawling request.
<include>.*\.html?</exclude> or <exclude>.*\.html?, .*\.xsp</exclude> <exclude>.*\.gif</exclude> or <exclude>.*\.gif, .*\.jpe?g</exclude> <link-content-type> application/x-cocoon-links </link-content-type> <link-view-query> ?cocoon-view=links </link-view-query>
configure
in interface org.apache.avalon.framework.configuration.Configurable
configuration
- XML configuration of this avalon component.org.apache.avalon.framework.configuration.ConfigurationException
- is throwing if configuration is invalid.public void dispose()
dispose
in interface org.apache.avalon.framework.activity.Disposable
public void recycle()
recycle
in interface org.apache.avalon.excalibur.pool.Recyclable
public void crawl(java.net.URL url)
Use this method to start crawling.
Get the this url, and all its children by using iterator()
.
The Iterator object will return URL objects.
You may use the crawl(), and iterator() methods the following way:
SimpleCocoonCrawlerImpl scci = ....; scci.crawl( "http://foo/bar" ); Iterator i = scci.iterator(); while (i.hasNext()) { URL url = (URL)i.next(); ... }
The i.next() method returns a URL, and calculates the links of the URL before return it.
crawl
in interface CocoonCrawler
url
- Crawl this URL, getting all links from this URL.public java.util.Iterator iterator()
The Iterator object will return URL objects at its next()
method.
iterator
in interface CocoonCrawler
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |