Skip to content

Configuring

Configure the Schedulers, Filters, and Security from the respective tabs in the Create a New Content Source page.

Configuring the scheduler

To configure the schedule, click the Scheduler tab to display the following options:

  • Define Schedule

    From this box update, the date, time, and interval and click Create to add new schedule.

  • Scheduled updates

    The schedule updates of the crawls are displayed.

Note

The time interval between the crawler runs must be more than the maximum execution time. A crawler cannot be started if it is running. If a crawler job is started while the crawler is running, this execution is ignored and the crawler is only started at the next scheduled time.

Configuring the Filters

The crawler filters control the crawler progress and the type of documents that are indexed and cataloged. Crawler filters are divided into the following two types:

  • URL filters

    These filters control which documents are crawled and indexed, based on the URL where the documents are found.

  • Type filters

    These filters control which documents are crawled and indexed, based on the document type.

If you define no filters, all documents from a content source are fetched and crawled. If you click Include filters, only those documents that pass the included filters are crawled and indexed. If you click Exclude filters, they override the included filters. If you define no included filters, they limit the number of documents that are crawled and indexed. More specifically, if a document passes one of the included filters, but also passes one of the excluded filters, it is not crawled, indexed, or cataloged.

To configure filters, click the Filters tab. The defined filters are listed in the Filtering Rules box.

  • Define Filter Rules

    You can define new filters in the Define Filter Rules box.

    • Rule Name

      Provide a Rule Name in this mandatory field.

Configure the settings for when to apply rule, set the rule type, and basis and click Create. The defined filters are displayed in the Filtering Rules box.

When you configure the followings setting:

  • Apply rule while: Collecting documents
  • Rule type: Include

Make sure that the URL in the field Collects documents that are linked from this URL: in the General Parameters tab fits the specified rule; otherwise, no documents are collected. For example, crawling the URL https://www.hcltechsw.com/wps/portal/products with the URL filter */products/* does not give any results because the rule has a training slash, but the URL does not. But either crawling https://www.hcltechsw.com/wps/portal/products with the URL filter */products/* (both with the trailing slash), or crawling https://www.hcltechsw.com/wps/portal/products with the URL filter */products* (no trailing slash) works.

Configuring security for a content source

You can configure the security for indexing secured content sources and repositories that require authentication. Click the Security tab to display the following two boxes:

  • Define security realm

    This box is used to add new secured content sources.

    In the Define security realm box, enter the following data entry fields and click the Create icon.

    • User name

      Enter the User ID with by which the crawler can access the secured content source or repository.

    • Password

      Enter the password for the User ID that you completed for the user name.

    • Host name

      Enter the name of the server. For portal sites and seedlist providers, this entry is not required. If you leave it blank, the host name is inferred from the provided root URL.

    • Realm

      Enter the realm of the secured content source or repository.

  • Security realms

    This box displays a list of existing security realms. You can edit or delete a Security realm as needed.