public class SourceOptionsWebCrawl
extends com.ibm.cloud.sdk.core.service.model.GenericModel
| Modifier and Type | Class and Description | 
|---|---|
static class  | 
SourceOptionsWebCrawl.Builder
Builder. 
 | 
static interface  | 
SourceOptionsWebCrawl.CrawlSpeed
The number of concurrent URLs to fetch. 
 | 
| Modifier and Type | Method and Description | 
|---|---|
Boolean | 
allowUntrustedCertificate()
Gets the allowUntrustedCertificate. 
 | 
List<String> | 
blacklist()
Gets the blacklist. 
 | 
String | 
crawlSpeed()
Gets the crawlSpeed. 
 | 
Boolean | 
limitToStartingHosts()
Gets the limitToStartingHosts. 
 | 
Long | 
maximumHops()
Gets the maximumHops. 
 | 
SourceOptionsWebCrawl.Builder | 
newBuilder()
New builder. 
 | 
Boolean | 
overrideRobotsTxt()
Gets the overrideRobotsTxt. 
 | 
Long | 
requestTimeout()
Gets the requestTimeout. 
 | 
String | 
url()
Gets the url. 
 | 
public SourceOptionsWebCrawl.Builder newBuilder()
public String url()
The starting URL to crawl.
public Boolean limitToStartingHosts()
When `true`, crawls of the specified URL are limited to the host part of the **url** field.
public String crawlSpeed()
The number of concurrent URLs to fetch. `gentle` means one URL is fetched at a time with a delay between each call. `normal` means as many as two URLs are fectched concurrently with a short delay between fetch calls. `aggressive` means that up to ten URLs are fetched concurrently with a short delay between fetch calls.
public Boolean allowUntrustedCertificate()
When `true`, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.
public Long maximumHops()
The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the **maximum_hops** from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
public Long requestTimeout()
The maximum milliseconds to wait for a response from the web server.
public Boolean overrideRobotsTxt()
When `true`, the crawler will ignore any `robots.txt` encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to `true` when a **gateway_id** is specied in the **credentials**.
Copyright © 2021 IBM Cloud. All rights reserved.