Class SourceOptionsWebCrawl
- All Implemented Interfaces:
com.ibm.cloud.sdk.core.service.model.ObjectModel
public class SourceOptionsWebCrawl
extends com.ibm.cloud.sdk.core.service.model.GenericModel
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
SourceOptionsWebCrawl.Builder
Builder.static interface
SourceOptionsWebCrawl.CrawlSpeed
The number of concurrent URLs to fetch. -
Method Summary
Modifier and Type Method Description Boolean
allowUntrustedCertificate()
Gets the allowUntrustedCertificate.List<String>
blacklist()
Gets the blacklist.String
crawlSpeed()
Gets the crawlSpeed.Boolean
limitToStartingHosts()
Gets the limitToStartingHosts.Long
maximumHops()
Gets the maximumHops.SourceOptionsWebCrawl.Builder
newBuilder()
New builder.Boolean
overrideRobotsTxt()
Gets the overrideRobotsTxt.Long
requestTimeout()
Gets the requestTimeout.String
url()
Gets the url.Methods inherited from class com.ibm.cloud.sdk.core.service.model.GenericModel
equals, hashCode, toString
-
Method Details
-
newBuilder
New builder.- Returns:
- a SourceOptionsWebCrawl builder
-
url
Gets the url.The starting URL to crawl.
- Returns:
- the url
-
limitToStartingHosts
Gets the limitToStartingHosts.When `true`, crawls of the specified URL are limited to the host part of the **url** field.
- Returns:
- the limitToStartingHosts
-
crawlSpeed
Gets the crawlSpeed.The number of concurrent URLs to fetch. `gentle` means one URL is fetched at a time with a delay between each call. `normal` means as many as two URLs are fectched concurrently with a short delay between fetch calls. `aggressive` means that up to ten URLs are fetched concurrently with a short delay between fetch calls.
- Returns:
- the crawlSpeed
-
allowUntrustedCertificate
Gets the allowUntrustedCertificate.When `true`, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.
- Returns:
- the allowUntrustedCertificate
-
maximumHops
Gets the maximumHops.The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the **maximum_hops** from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
- Returns:
- the maximumHops
-
requestTimeout
Gets the requestTimeout.The maximum milliseconds to wait for a response from the web server.
- Returns:
- the requestTimeout
-
overrideRobotsTxt
Gets the overrideRobotsTxt.When `true`, the crawler will ignore any `robots.txt` encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to `true` when a **gateway_id** is specied in the **credentials**.
- Returns:
- the overrideRobotsTxt
-
blacklist
Gets the blacklist.Array of URL's to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing `https://ibm.com/watson` also excludes `https://ibm.com/watson/discovery`.
- Returns:
- the blacklist
-