Web crawler job

Create Web Crawler Job

Creates a web crawler job whose objective is to crawl the provided URLs/sitemaps and generate corresponding webpages as artifacts.

POST
/web-crawler-jobs.create
AuthorizationBearer <token>

All the DevRev APIs require a token to authenticate the user. Provide Authorization: Bearer <TOKEN> as a header to every API request. How do I find my token?

In: header

accept_regex?stringDeprecated

The regex a URL must satisfy to be crawled.

Formattext
accept_regexes?array<string>

The list of regexes a URL must satisfy to be crawled.

Itemsitems <= 100
applies_to_partsarray<string>

The parts to which created webpage/articles during this crawler job will be linked to.

Items1 <= items <= 1
description?string

The description of the job.

Formattext
domain_names?array<string>

The list of allowed domain names to crawl.

frequency?integer

Number of days between re-sync job runs. If 0, the job will run only once.

Formatint32
max_depth?integer

The maximum depth to crawl.

Formatint32
notify_on_complete?boolean

Whether to notify the user when the job is complete. Default is true.

reject_regex?stringDeprecated

The regex which if satisfied by a URL results in rejection of the URL. If a URL matches both accept and reject regexes, it is rejected.

Formattext
reject_regexes?array<string>

The list of regexes which if satisfied by a URL results in rejection of the URL. If a URL matches both accept and reject regexes, it is rejected.

Itemsitems <= 100
sitemap_index_urls?array<string>

The list of sitemap index URLs to crawl.

Itemsitems <= 2
sitemap_urls?array<string>

The list of sitemap URLs to crawl.

Itemsitems <= 2
urls?array<string>

The list of URLs to crawl.

Itemsitems <= 50
user_agent?string

User agent to use for crawling websites in this job.

Formattext
Lengthlength <= 1024

Response Body

curl -X POST "https://api.devrev.ai/web-crawler-jobs.create" \  -H "Content-Type: application/json" \  -d '{    "applies_to_parts": [      "PROD-12345"    ]  }'
{
  "web_crawler_job": {
    "created_by": {
      "type": "dev_user",
      "display_id": "string",
      "id": "string",
      "display_name": "string",
      "display_picture": {
        "display_id": "string",
        "id": "string",
        "file": {
          "type": "string",
          "name": "string",
          "size": 0
        }
      },
      "email": "string",
      "full_name": "string",
      "state": "active"
    },
    "created_date": "2023-01-01T12:00:00.000Z",
    "display_id": "string",
    "id": "string",
    "modified_by": {
      "type": "dev_user",
      "display_id": "string",
      "id": "string",
      "display_name": "string",
      "display_picture": {
        "display_id": "string",
        "id": "string",
        "file": {
          "type": "string",
          "name": "string",
          "size": 0
        }
      },
      "email": "string",
      "full_name": "string",
      "state": "active"
    },
    "modified_date": "2023-01-01T12:00:00.000Z",
    "accept_regexs": [
      "string"
    ],
    "description": "string",
    "domain_names": [
      "string"
    ],
    "frequency": 0,
    "max_depth": 0,
    "no_parent": true,
    "notify_on_complete": true,
    "num_bytes": 0,
    "num_timeout_urls": 0,
    "num_urls_scraped": 0,
    "reject_regexs": [
      "string"
    ],
    "sitemap_index_urls": [
      "string"
    ],
    "sitemap_urls": [
      "string"
    ],
    "state": "aborted",
    "urls": [
      "string"
    ],
    "user_agent": "string"
  }
}
{
  "detail": "string",
  "message": "string",
  "type": "artifact_already_attached_to_a_parent",
  "existing_parent": "string",
  "is_same": true
}
{
  "detail": "string",
  "message": "string",
  "type": "unauthenticated"
}
{
  "detail": "string",
  "message": "string",
  "type": "forbidden"
}
{
  "detail": "string",
  "message": "string",
  "type": "too_many_requests",
  "retry_after": 0
}
{
  "detail": "string",
  "message": "string",
  "type": "internal_error",
  "reference_id": "string"
}
{
  "detail": "string",
  "message": "string",
  "type": "service_unavailable"
}