When true
, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.
Array of URL's to be excluded while crawling. The crawler will not follow links which contains this string.
For example, listing https://ibm.com/watson
also excludes https://ibm.com/watson/discovery
.
The number of concurrent URLs to fetch. gentle
means one URL is fetched at a time with a delay between
each call. normal
means as many as two URLs are fectched concurrently with a short delay between fetch calls.
aggressive
means that up to ten URLs are fetched concurrently with a short delay between fetch calls.
When true
, crawls of the specified URL are limited to the host part of the url field.
The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
When true
, the crawler will ignore any robots.txt
encountered by the crawler. This should only ever be
done when crawling a web site the user owns. This must be be set to true
when a gateway_id is specied in
the credentials.
The maximum milliseconds to wait for a response from the web server.
The starting URL to crawl.
Generated using TypeDoc
Object defining which URL to crawl and how to crawl it.