Rastreador de contenido web de dominios específico con control de profundidad y extracción de texto
Este es unContent Creation, Multimodal AIflujo de automatización del dominio deautomatización que contiene 18 nodos.Utiliza principalmente nodos como If, Set, Code, Html, Merge. Rastreador de contenido web de dominios específicos con control de profundidad y extracción de texto
- •Punto final de HTTP Webhook (n8n generará automáticamente)
- •Pueden requerirse credenciales de autenticación para la API de destino
Nodos utilizados (18)
Categoría
{
"meta": {
"instanceId": "9a562c06a632241f66aadd52a495ad98e76b760ef5cfce9c319a4759c47cd94e"
},
"nodes": [
{
"id": "ed429607-b22c-494c-b767-7dc2eca5a561",
"name": "Nota adhesiva",
"type": "n8n-nodes-base.stickyNote",
"position": [
-2160,
-112
],
"parameters": {
"width": 720,
"height": 592,
"content": "# n8n Workflow Explanation: Web Crawler\n\nThis workflow implements a web crawler in n8n that scrapes website pages starting from a given URL, up to a maximum depth of 3. It fetches HTML content, extracts links and body text, deduplicates URLs, limits crawling to the same domain, excludes non-HTML files like PDFs, and collects page data for output via webhook.\n\n## Key Features\n- **Depth-Limited Crawling**: Stops at maxDepth to prevent infinite loops.\n- **Deduplication**: Tracks visited and queued URLs using global static data to avoid re-fetching.\n- **Same-Site Only**: Only follows links within the initial domain (apex or www variants).\n- **Link Filtering**: Ignores mailto, tel, javascript, anchors (#), and file types like PDF, DOCX, etc.\n- **State Management**: Uses n8n's static data for pending count, visited list, queued dict, and accumulated pages across iterations.\n- **Batching and Chunking**: Processes links in batches; chunks collected content by character limits for efficient output.\n- **Error Handling**: Nodes like Fetch HTML and Queue & Dedup have onError: continueRegularOutput to skip failures.\n- **Output**: Combines all page contents (URL, depth, text) into a single string, optionally appending extra JSON, and responds via webhook.\n\n"
},
"typeVersion": 1
},
{
"id": "26230b6f-528a-41fa-b9f0-9597659e2f23",
"name": "Nota adhesiva1",
"type": "n8n-nodes-base.stickyNote",
"position": [
-1376,
-112
],
"parameters": {
"width": 800,
"height": 1136,
"content": "## Step-by-Step Detailed Breakdown\n\n1. **Webhook**: Entry point. Receives JSON payload with 'url' key. Triggers on POST to specific path.\n\n2. **Init Crawl Params (Set Node)**: Processes input. Sets 'url' and 'domain' from body.url, maxDepth=3, depth=0. Keeps only these fields.\n\n3. **Init Globals (Code Node)**: Initializes global static data: pending=1, visited=[], queued={}, pages=[]. Normalizes domain from URL, handling malformed cases without URL().\n\n4. **Seed Root Crawl Item (Merge Node)**: Combines initial params with globals output. Mode: combine by position, prefer last on clashes, include unpaired.\n\n5. **Fetch HTML Page (HTTP Request Node)**: GETs the current URL. Timeout 5s, continues on error. Outputs raw HTML body.\n\n6. **Attach URL/Depth to HTML (Code Node)**: Attaches url and depth from seed to the HTML response item.\n\n7. **Extract Body & Links (HTML Node)**: Operation: extractHtmlContent. Gets 'content' from body selector (trimmed, cleaned), 'links' as array of href attributes from a[href].\n\n8. **Queue & Dedup Links (Code Node)**: Core logic. Normalizes URLs (absolute, no trailing /). Extracts hosts ignoring www/protocol/path. Marks current as visited, dequeues it. Filters links: same-site, unvisited, not queued, depth < max. Queues new links with depth+1. Decrements pending, adds new to pending. Outputs new link items + current page item with content.\n\n9. **IF Crawl Depth OK? (IF Node)**: Checks if type='link' and depth <= maxDepth. True: requeue for fetch. False: store page.\n\n10. **Requeue Link Item (Code Node)**: Removes 'type', returns item for looping back to fetch.\n\n11. **Loop Links (Batches) (SplitInBatches Node)**: Batch size 1, no reset. Loops through queued links one by one, feeding back to Seed Root Crawl Item for next fetch.\n\n12. **Store Page Data (Set Node)**: Keeps url, content, depth from page item.\n\n13. **Collect Pages & Emit When Done (Code Node)**: Appends page to global pages[]. If pending <=0, emits combined content string (URL/depth/content per page, separated). Else, empty output.\n\n14. **Merge Web Pages (Merge Node)**: Combines collected pages from loop with initial globals (for extras?).\n\n15. **Combine & Chunk (Code Node)**: Merges stored/incoming pages, normalizes. Appends extra JSON if present. Builds full combinedContent. Chunks pages by max chars (12000) then subgroups of 5. Outputs batch items with index, pages subset, full combinedContent, accId.\n\n16. **Respond to Webhook (RespondToWebhook Node)**: Sends the chunked output as response.\n\n## Additional Notes\n- **Loop Mechanism**: Uses SplitInBatches to iterate queue, feeding back to merge for recursive crawling.\n- **Termination**: Pending counter ensures emission only when all pages processed (no more queue).\n- **Limitations**: No external domains, basic link cleaning, assumes HTTP/HTTPS, no auth/cookies.\n- **Usage**: Trigger via webhook with {\"url\": \"https://example.com\"}. Output: JSON with batched page data."
},
"typeVersion": 1
},
{
"id": "c3ea4128-8963-4000-af38-e7f2be48bb7e",
"name": "Disparador Webhook",
"type": "n8n-nodes-base.webhook",
"position": [
-2128,
-336
],
"webhookId": "603a09ed-516c-4c7d-bad3-b05b030503a2",
"parameters": {
"path": "603a09ed-516c-4c7d-bad3-b05b030503a2",
"options": {
"rawBody": false
},
"httpMethod": "POST",
"responseMode": "responseNode"
},
"typeVersion": 2.1
},
{
"id": "a35808cb-d2ea-4797-86a6-a36670377560",
"name": "Loop Links (Batches)",
"type": "n8n-nodes-base.splitInBatches",
"notes": "Iterates through the queue of links to be crawled one at a time.",
"position": [
48,
-480
],
"parameters": {
"options": {
"reset": false
},
"batchSize": 1
},
"executeOnce": false,
"typeVersion": 1
},
{
"id": "798444a5-0df4-4727-818f-657901ad60a1",
"name": "Si Crawl Depth OK?",
"type": "n8n-nodes-base.if",
"notes": "Validates whether the current depth is below the maximum depth allowed.",
"onError": "continueRegularOutput",
"position": [
-352,
-464
],
"parameters": {
"conditions": {
"number": [
{
"value1": "={{ $json.depth }}",
"value2": "={{ $json.maxDepth}} ",
"operation": "smallerEqual"
}
],
"string": [
{
"value1": "={{ $json.type }}",
"value2": "link"
}
]
}
},
"typeVersion": 1
},
{
"id": "ecc2707f-0605-4c88-98eb-8c8ea234e9ff",
"name": "Extract Body & Links",
"type": "n8n-nodes-base.html",
"notes": "Parses HTML content and extracts body text and anchor href links.",
"position": [
-784,
-464
],
"parameters": {
"options": {
"trimValues": true,
"cleanUpText": true
},
"operation": "extractHtmlContent",
"extractionValues": {
"values": [
{
"key": "links",
"attribute": "href",
"cssSelector": "a[href]",
"returnArray": true,
"returnValue": "attribute"
},
{
"key": "content",
"cssSelector": "body"
}
]
}
},
"typeVersion": 1
},
{
"id": "d4dfda4a-e20a-4014-b024-c0fde8f41aed",
"name": "Attach URL/Depth to HTML",
"type": "n8n-nodes-base.code",
"position": [
-976,
-464
],
"parameters": {
"mode": "runOnceForEachItem",
"jsCode": " return {\n json: {\n url:$('Seed Root Crawl Item').item.json.url,\n depth: $('Seed Root Crawl Item').item.json.depth,\n ...item.json // Preserve original HTML response (optional)\n }\n };\n"
},
"typeVersion": 2
},
{
"id": "239040b9-3c08-47d9-a188-18776817df23",
"name": "Fetch HTML Page",
"type": "n8n-nodes-base.httpRequest",
"notes": "Makes HTTP request to fetch the content of the current URL.",
"onError": "continueRegularOutput",
"position": [
-1200,
-464
],
"parameters": {
"url": "={{ $json.url }}",
"options": {
"timeout": 5000,
"response": {
"response": {}
}
}
},
"typeVersion": 4.2
},
{
"id": "3d960fb8-2224-4f50-becf-b2f03bd7de6e",
"name": "Seed Root Crawl Item",
"type": "n8n-nodes-base.merge",
"position": [
-1408,
-464
],
"parameters": {
"mode": "combine",
"options": {
"clashHandling": {
"values": {
"resolveClash": "preferLast",
"overrideEmpty": true
}
},
"includeUnpaired": true
},
"combineBy": "combineByPosition"
},
"typeVersion": 3.2
},
{
"id": "3e02f965-84f5-40da-90d4-ae91bbf0434e",
"name": "Collect Pages & Emit When Done",
"type": "n8n-nodes-base.code",
"position": [
32,
-288
],
"parameters": {
"jsCode": "const s = $getWorkflowStaticData('global');\nif (!s.pages) s.pages = [];\ns.pages.push({\n url: $json.url,\n depth: $json.depth,\n content: $json.content\n});\nconsole.log(s.pending)\nif (s.pending <= 0) {\n const pages = s.pages || [];\n let combinedContent = pages.map(page => `URL: ${page.url}\\nDepth: ${page.depth}\\nContent: ${page.content}\\n`).join('\\n-----------------\\n');\n return { json: { content: combinedContent } };\n} else {\n return [];\n}"
},
"typeVersion": 2
},
{
"id": "63f581a0-4794-4908-be22-dda1136e7593",
"name": "Store Page Data",
"type": "n8n-nodes-base.set",
"notes": "Captures the URL, page content, and depth for storage or export.",
"position": [
-128,
-304
],
"parameters": {
"values": {
"number": [
{
"name": "depth",
"value": "={{ $json.depth || 0 }}"
}
],
"string": [
{
"name": "url",
"value": "={{ $json.url || '' }}"
},
{
"name": "content",
"value": "={{ $json.content || '' }}"
}
]
},
"options": {},
"keepOnlySet": true
},
"typeVersion": 2
},
{
"id": "c3cf4541-c31f-4257-8729-44f8ed211bcd",
"name": "Fusionar Web Pages",
"type": "n8n-nodes-base.merge",
"position": [
208,
-176
],
"parameters": {},
"typeVersion": 3.2
},
{
"id": "a7d480bc-ef4b-4cad-989f-0eda36a26a00",
"name": "Combine & Chunk",
"type": "n8n-nodes-base.code",
"position": [
400,
-176
],
"parameters": {
"jsCode": "/* Combine static pages + extra JSON, then chunk pages for model calls */\nconst s = $getWorkflowStaticData('global');\nif (!s.pages) s.pages = [];\n\nfunction normPage(p = {}) {\n return {\n url: p.url || '',\n depth: p.depth ?? null,\n content: typeof p.content === 'string' ? p.content : ''\n };\n}\n\nconst incomingPageItems = items\n .filter(i => typeof i.json.content === 'string')\n .map(i => normPage(i.json));\n\nconst storedPages = (s.pages || []).map(normPage);\nconst pages = storedPages.length ? storedPages : incomingPageItems;\n\nconst extraJson = items\n .filter(i => typeof i.json.content !== 'string')\n .map(i => i.json);\n\nlet combinedContent = pages\n .map(p => `URL: ${p.url}\\nDepth: ${p.depth}\\nContent:\\n${p.content}\\n`)\n .join('\\n-----------------\\n');\n\nif (extraJson.length) {\n combinedContent += `\\n\\nLINKEDIN_DATA::\\n\\n${JSON.stringify(extraJson)}`;\n}\n\nconst CHUNK_SIZE = 5;\nconst MAX_CHARS_PER_BATCH = 12000;\n\nfunction chunkByChars(arr, maxChars) {\n const batches = [];\n let current = [];\n let chars = 0;\n for (const it of arr) {\n const len = (it.content || '').length;\n if (current.length && chars + len > maxChars) {\n batches.push(current);\n current = [];\n chars = 0;\n }\n current.push(it);\n chars += len;\n }\n if (current.length) batches.push(current);\n return batches;\n}\n\nconst charBatches = chunkByChars(pages, MAX_CHARS_PER_BATCH);\nconst groups = [];\nfor (const batch of charBatches) {\n for (let i = 0; i < batch.length; i += CHUNK_SIZE) {\n groups.push(batch.slice(i, i + CHUNK_SIZE));\n }\n}\n\nreturn groups.length\n ? groups.map((g, idx) => ({ json: { batchIndex: idx, pages: g, combinedContent,accId:s.accountId } }))\n : [{ json: { batchIndex: 0, pages: [], combinedContent } }];\n"
},
"typeVersion": 2
},
{
"id": "1e36bc72-2db7-4ce7-a42e-51609a0c9065",
"name": "Respond to Disparador Webhook",
"type": "n8n-nodes-base.respondToWebhook",
"position": [
608,
-176
],
"parameters": {
"options": {}
},
"typeVersion": 1.4
},
{
"id": "99f16b20-3398-45a9-a652-7b51351283b2",
"name": "Init Globals",
"type": "n8n-nodes-base.code",
"notes": "Initializes the pending count in static data for crawl completion tracking.",
"position": [
-1632,
-336
],
"parameters": {
"mode": "runOnceForEachItem",
"jsCode": "const s = $getWorkflowStaticData('global');\ns.pending = 1;\ns.visited = [];\ns.queued = {};\ns.pages = [];\n\n// Ensure url has a scheme so URL() won't throw\nconst ensureUrl = u => (/^https?:\\/\\//i.test(u) ? u : `https://${u}`);\n\ntry {\n $json.domain = new URL(ensureUrl($json.url)).hostname; // => \"www.crmaiinsight.com\"\n} catch (e) {\n // Fallback if url is malformed\n $json.domain = String($json.url || '')\n .replace(/^[a-z]+:\\/\\//i, '')\n .replace(/\\/.*$/, '')\n .replace(/:\\d+$/, '');\n}\n\nreturn $json;\n"
},
"typeVersion": 2
},
{
"id": "e56c711e-c7eb-4024-bd31-66680514d62c",
"name": "Init Crawl Params",
"type": "n8n-nodes-base.set",
"notes": "Defines the root URL, domain name, and max crawl depth.",
"position": [
-1856,
-336
],
"parameters": {
"values": {
"number": [
{
"name": "maxDepth",
"value": 3
},
{
"name": "depth"
}
],
"string": [
{
"name": "url",
"value": "={{ $json.body.url }}"
},
{
"name": "domain",
"value": "={{ $json.body.url }}"
}
]
},
"options": {},
"keepOnlySet": true
},
"typeVersion": 2
},
{
"id": "29bf5f0a-97dc-4631-a485-f7ef9bcfd852",
"name": "Requeue Link Item",
"type": "n8n-nodes-base.code",
"notes": "Removes internal 'type' field and re-enqueues the link for next crawl.",
"position": [
-144,
-480
],
"parameters": {
"mode": "runOnceForEachItem",
"jsCode": "const s = $getWorkflowStaticData('global');\n\ndelete $json.type\nreturn item;"
},
"typeVersion": 2
},
{
"id": "3f81f588-a041-4ae9-92b5-2f79ae855355",
"name": "Queue & Dedup Links",
"type": "n8n-nodes-base.code",
"notes": "Cleans and deduplicates links. Tracks visited URLs. Prepares next crawl queue.",
"onError": "continueRegularOutput",
"position": [
-560,
-464
],
"parameters": {
"jsCode": "const staticData = $getWorkflowStaticData('global');\nif (!Array.isArray(staticData.visited)) staticData.visited = [];\nif (typeof staticData.pending !== 'number') staticData.pending = 0;\nif (!staticData.queued || typeof staticData.queued !== 'object') staticData.queued = {};\n\nconst currentUrl = $('Attach URL/Depth to HTML').item.json.url.replace(/\\/+$/, '');\nconst currentDepth = $('Attach URL/Depth to HTML').item.json.depth || 0;\nconst maxDepth = $('Seed Root Crawl Item').first().json.maxDepth;\nconst domainParamRaw = ($('Init Crawl Params').first().json.domain || '').toString();\nconst content = typeof $json.content === 'string' ? $json.content : '';\n\nconst PROTO_RE = /^[a-zA-Z][a-zA-Z0-9+.-]*:\\/\\//;\n\n// Normalize a host string: strip protocol, path, and leading \"www.\"\nfunction hostOf(u) {\n if (!u) return '';\n let s = u.toString();\n if (PROTO_RE.test(s)) s = s.replace(PROTO_RE, '');\n const i = s.indexOf('/');\n if (i !== -1) s = s.slice(0, i);\n return s.toLowerCase().replace(/^www\\./, '');\n}\n\n// Build absolute URL from href + base without using URL()\nfunction toAbsolute(href, base) {\n if (!href) return '';\n const h = href.trim();\n if (PROTO_RE.test(h)) return h.replace(/\\/+$/, '');\n if (h.startsWith('//')) {\n const proto = (base.match(PROTO_RE) || ['https://'])[0];\n return (proto + h.slice(2)).replace(/\\/+$/, '');\n }\n if (h.startsWith('/')) {\n const baseHost = base.replace(PROTO_RE, '').split('/')[0];\n const proto = (base.match(PROTO_RE) || ['https://'])[0];\n return (proto + baseHost + h).replace(/\\/+$/, '');\n }\n // relative path\n let dir = base;\n if (!dir.endsWith('/')) {\n const cut = dir.lastIndexOf('/');\n dir = cut > (dir.indexOf('://') + 2) ? dir.slice(0, cut + 1) : (dir + '/');\n }\n return (dir + h).replace(/\\/+$/, '');\n}\n\nfunction extractHostname(abs) {\n let s = abs.replace(PROTO_RE, '');\n const i = s.indexOf('/');\n const host = (i === -1 ? s : s.slice(0, i)).toLowerCase();\n return host.replace(/^www\\./, '');\n}\n\nconst allowedHost = hostOf(domainParamRaw) || hostOf(currentUrl);\nconst currentHost = hostOf(currentUrl);\n\n// mark current as visited & dequeue\nif (!staticData.visited.includes(currentUrl)) staticData.visited.push(currentUrl);\ndelete staticData.queued[currentUrl];\n\nconst links = Array.isArray($json.links) ? $json.links : [];\nconst newLinks = [];\nconst queuedLocal = new Set();\n\nfor (const link of links) {\n if (!link) continue;\n const l = String(link).trim();\n if (!l || l.startsWith('mailto:') || l.startsWith('tel:') || l.startsWith('javascript:')) continue;\n if (l.includes('#')) continue;\n if (/\\.(pdf|docx?|xlsx?|pptx?)($|\\?)/i.test(l)) continue;\n\n const absolute = toAbsolute(l, currentUrl);\n const host = extractHostname(absolute);\n\n // treat apex and www as same-site\n const sameSite = (host === allowedHost) || (host === currentHost);\n\n if (\n sameSite &&\n !staticData.visited.includes(absolute) &&\n !staticData.queued[absolute] &&\n !queuedLocal.has(absolute) &&\n currentDepth < maxDepth\n ) {\n newLinks.push({\n json: { url: absolute, depth: currentDepth + 1, type: 'link', maxDepth }\n });\n queuedLocal.add(absolute);\n staticData.queued[absolute] = true;\n }\n}\n\nstaticData.pending += newLinks.length;\nstaticData.pending--; // this page done\n\nreturn newLinks.concat({\n json: { url: currentUrl, depth: currentDepth, content, type: 'page', maxDepth }\n});\n"
},
"typeVersion": 2
}
],
"pinData": {},
"connections": {
"Webhook": {
"main": [
[
{
"node": "e56c711e-c7eb-4024-bd31-66680514d62c",
"type": "main",
"index": 0
}
]
]
},
"99f16b20-3398-45a9-a652-7b51351283b2": {
"main": [
[
{
"node": "3d960fb8-2224-4f50-becf-b2f03bd7de6e",
"type": "main",
"index": 0
},
{
"node": "Merge Web Pages",
"type": "main",
"index": 1
}
]
]
},
"a7d480bc-ef4b-4cad-989f-0eda36a26a00": {
"main": [
[
{
"node": "Respond to Webhook",
"type": "main",
"index": 0
}
]
]
},
"239040b9-3c08-47d9-a188-18776817df23": {
"main": [
[
{
"node": "d4dfda4a-e20a-4014-b024-c0fde8f41aed",
"type": "main",
"index": 0
}
]
]
},
"Merge Web Pages": {
"main": [
[
{
"node": "a7d480bc-ef4b-4cad-989f-0eda36a26a00",
"type": "main",
"index": 0
}
]
]
},
"63f581a0-4794-4908-be22-dda1136e7593": {
"main": [
[
{
"node": "3e02f965-84f5-40da-90d4-ae91bbf0434e",
"type": "main",
"index": 0
}
]
]
},
"e56c711e-c7eb-4024-bd31-66680514d62c": {
"main": [
[
{
"node": "99f16b20-3398-45a9-a652-7b51351283b2",
"type": "main",
"index": 0
}
]
]
},
"29bf5f0a-97dc-4631-a485-f7ef9bcfd852": {
"main": [
[
{
"node": "a35808cb-d2ea-4797-86a6-a36670377560",
"type": "main",
"index": 0
}
]
]
},
"IF Crawl Depth OK?": {
"main": [
[
{
"node": "29bf5f0a-97dc-4631-a485-f7ef9bcfd852",
"type": "main",
"index": 0
}
],
[
{
"node": "63f581a0-4794-4908-be22-dda1136e7593",
"type": "main",
"index": 0
}
]
]
},
"3f81f588-a041-4ae9-92b5-2f79ae855355": {
"main": [
[
{
"node": "IF Crawl Depth OK?",
"type": "main",
"index": 0
}
]
]
},
"ecc2707f-0605-4c88-98eb-8c8ea234e9ff": {
"main": [
[
{
"node": "3f81f588-a041-4ae9-92b5-2f79ae855355",
"type": "main",
"index": 0
}
]
]
},
"a35808cb-d2ea-4797-86a6-a36670377560": {
"main": [
[
{
"node": "3d960fb8-2224-4f50-becf-b2f03bd7de6e",
"type": "main",
"index": 1
}
]
]
},
"3d960fb8-2224-4f50-becf-b2f03bd7de6e": {
"main": [
[
{
"node": "239040b9-3c08-47d9-a188-18776817df23",
"type": "main",
"index": 0
}
]
]
},
"d4dfda4a-e20a-4014-b024-c0fde8f41aed": {
"main": [
[
{
"node": "ecc2707f-0605-4c88-98eb-8c8ea234e9ff",
"type": "main",
"index": 0
}
]
]
},
"3e02f965-84f5-40da-90d4-ae91bbf0434e": {
"main": [
[
{
"node": "Merge Web Pages",
"type": "main",
"index": 0
}
]
]
}
}
}¿Cómo usar este flujo de trabajo?
Copie el código de configuración JSON de arriba, cree un nuevo flujo de trabajo en su instancia de n8n y seleccione "Importar desde JSON", pegue la configuración y luego modifique la configuración de credenciales según sea necesario.
¿En qué escenarios es adecuado este flujo de trabajo?
Avanzado - Creación de contenido, IA Multimodal
¿Es de pago?
Este flujo de trabajo es completamente gratuito, puede importarlo y usarlo directamente. Sin embargo, tenga en cuenta que los servicios de terceros utilizados en el flujo de trabajo (como la API de OpenAI) pueden requerir un pago por su cuenta.
Flujos de trabajo relacionados recomendados
Le Nguyen
@leeseiferSalesforce Architect with 10+ years of experience in CRM, integrations, and automation. Skilled in Apex, LWC, REST APIs, and full-stack dev (JavaScript, .NET). I build secure, scalable workflows in n8n—connecting Salesforce, Stripe, and more. Passionate about lead scoring, data sync, and secure field masking. Certified Application Architect with deep expertise in platform, integration, and data architecture.
Compartir este flujo de trabajo