Cargar el contenido de todas las páginas de un sitio web y almacenar embeddings en Pinecone usando Gemini
Este es unDocument Extraction, AI RAGflujo de automatización del dominio deautomatización que contiene 16 nodos.Utiliza principalmente nodos como Xml, Code, Html, Wait, Merge. Obtener todo el contenido de las páginas de un sitio web y almacenar en Pinecone usando embeddings de Gemini
- •Pueden requerirse credenciales de autenticación para la API de destino
- •Clave de API de Pinecone
Nodos utilizados (16)
Categoría
{
"nodes": [
{
"id": "5ad6a510-3c4a-47e4-b8ff-c0e565e25d25",
"name": "Nota adhesiva",
"type": "n8n-nodes-base.stickyNote",
"position": [
368,
944
],
"parameters": {
"width": 832,
"height": 816,
"content": "This n8n workflow builds a Pinecone knowledge base from website content, handling both sitemap and direct URL inputs.\n---\n### 1. URL Input & Consolidation\n\nThis section gathers and refines the URLs to be processed.\n* **Input Sitemap or page URLs (Form Trigger):** Start by providing a sitemap URL or a list of specific page URLs.\n* **Switch:** Routes input based on whether a sitemap or individual URLs are provided.\n* **Split Pages URL (Code):** Parses and cleans individual page URLs.\n* **Fetch Sitemap (HTTP Request):** Downloads the sitemap XML.\n* **XML Conversion (XML):** Converts sitemap XML to JSON.\n* **Extract Page URLs (Code):** Pulls page URLs from the JSON sitemap.\n* **Merge URLs (Merge):** Combines all URLs into one list.\n* **Remove Duplicate URLs (Remove Duplicates):** Eliminates any duplicate URLs.\n---\n\n### 2. Content Extraction\n\nThis section fetches and cleans content from each unique URL.\n* **Loop Over Page URLs (Split In Batches):** Processes URLs in batches.\n* **Fetch Page HTML For content (HTTP Request):** Downloads HTML for each page.\n* **Wait 5 sec (Wait):** Adds a 5-second delay to avoid overwhelming websites.\n* **Extract Content (HTML):** Extracts main text content from the HTML, skipping images and cleaning the text.\n\n---\n\n### 3. Embedding & Pinecone Storage\n\nThe final stage transforms content into vector embeddings and stores them in Pinecone.\n* **Gemini Embeddings (Embeddings):** Converts extracted text into 3076-dimensional vector embeddings using the `models/gemini-embedding-001` model.\n* **Data Loader (Document Loader):** Prepares content as documents for the vector store.\n* **Pinecone KnowledgeBase (Vector Store):** Inserts the generated embeddings and content into the \"supportbot\" Pinecone index, clearing existing data in the namespace first."
},
"typeVersion": 1
},
{
"id": "3ff777b7-24bd-420c-af38-62a395f52a1a",
"name": "Extraer URLs de páginas",
"type": "n8n-nodes-base.code",
"position": [
1936,
1392
],
"parameters": {
"jsCode": "const items = []\nfor (const item of $input.first().json.urlset.url) {\n items.push({ url: item.loc })\n}\n\nreturn items;"
},
"typeVersion": 2
},
{
"id": "6176e651-cef5-44e8-abed-0f6f6b81517b",
"name": "XML Conversion",
"type": "n8n-nodes-base.xml",
"position": [
1792,
1392
],
"parameters": {
"options": {}
},
"typeVersion": 1
},
{
"id": "cca1e7e7-32f6-42fd-b23c-3c2586344a50",
"name": "Obtener mapa del sitio",
"type": "n8n-nodes-base.httpRequest",
"position": [
1632,
1392
],
"parameters": {
"url": "={{ $json['Sitemap URL'] }}",
"options": {}
},
"typeVersion": 4.2
},
{
"id": "520e131d-b5f2-4857-aebd-5724da2a8083",
"name": "Dividir URLs de páginas",
"type": "n8n-nodes-base.code",
"position": [
1792,
1216
],
"parameters": {
"jsCode": "function addTrailingSlash(str) {\n if (typeof str !== 'string') {\n return str; // Or throw an error, handle non-string inputs\n }\n if (!str.endsWith('/')) {\n return str + '/';\n }\n return str;\n}\n\nconst urls = []\nfor (const item of $input.first().json['Page URLs'].split(',')) {\n urls.push({ url: addTrailingSlash(item).trim()})\n}\n\nreturn urls;"
},
"typeVersion": 2
},
{
"id": "7e7fe528-8748-470b-b627-a0c79b5aface",
"name": "Combinar URLs",
"type": "n8n-nodes-base.merge",
"position": [
2128,
1232
],
"parameters": {},
"typeVersion": 3.2
},
{
"id": "a0517aaf-6ccd-481d-b97e-b183d305451b",
"name": "Eliminar URLs duplicadas",
"type": "n8n-nodes-base.removeDuplicates",
"position": [
2272,
1232
],
"parameters": {
"options": {}
},
"typeVersion": 2
},
{
"id": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
"name": "Iterar sobre URLs de páginas",
"type": "n8n-nodes-base.splitInBatches",
"position": [
2480,
1232
],
"parameters": {
"options": {}
},
"typeVersion": 3
},
{
"id": "73aebd19-60ae-40d1-a747-0b9537d9d67c",
"name": "Extraer contenido",
"type": "n8n-nodes-base.html",
"position": [
2672,
1136
],
"parameters": {
"options": {
"cleanUpText": true
},
"operation": "extractHtmlContent",
"extractionValues": {
"values": [
{
"key": "content",
"cssSelector": "body",
"skipSelectors": "img"
}
]
}
},
"typeVersion": 1.2
},
{
"id": "0dbf70c1-cb57-4691-916f-2a2aa9a4cec0",
"name": "Obtener HTML de página para contenido",
"type": "n8n-nodes-base.httpRequest",
"position": [
2672,
1328
],
"parameters": {
"url": "={{ $json.url }}",
"options": {}
},
"typeVersion": 4.2
},
{
"id": "fa1c18c6-6c29-4e71-905e-0945909af99b",
"name": "Esperar 5 seg",
"type": "n8n-nodes-base.wait",
"position": [
2832,
1328
],
"webhookId": "9d87e60f-9df8-4a13-9c22-e3e5a5bb9c0e",
"parameters": {},
"typeVersion": 1.1
},
{
"id": "2bf3ad7f-a2fd-44f9-b6af-5a500ef80591",
"name": "Cargador de datos",
"type": "@n8n/n8n-nodes-langchain.documentDefaultDataLoader",
"position": [
3264,
1344
],
"parameters": {
"options": {}
},
"typeVersion": 1.1
},
{
"id": "a86d4c2e-559c-4942-ac0d-2ddcc7eb7f39",
"name": "Gemini Embeddings",
"type": "@n8n/n8n-nodes-langchain.embeddingsGoogleGemini",
"position": [
3072,
1344
],
"parameters": {
"modelName": "models/gemini-embedding-001"
},
"typeVersion": 1
},
{
"id": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"name": "Pinecone KnowledgeBase",
"type": "@n8n/n8n-nodes-langchain.vectorStorePinecone",
"position": [
3072,
1136
],
"parameters": {
"mode": "insert",
"options": {
"clearNamespace": true
}
},
"typeVersion": 1.3
},
{
"id": "4f5dc6e3-8f75-46ab-b3e1-49deb7695469",
"name": "Ingresar mapa del sitio o URLs de página",
"type": "n8n-nodes-base.formTrigger",
"position": [
1296,
1376
],
"webhookId": "ab54a2cd-2eda-4cf7-b822-8fb49ecb257e",
"parameters": {
"options": {},
"formTitle": "Agent Knowledge Base Input",
"formFields": {
"values": [
{
"fieldLabel": "Sitemap URL",
"placeholder": "https://website.com/page-sitemap.xml"
},
{
"fieldType": "textarea",
"fieldLabel": "Page URLs",
"placeholder": "https://website.com/about, https://website.com/contact"
}
]
},
"formDescription": "This form is to input the page sitemap or pages of your website"
},
"typeVersion": 2.2
},
{
"id": "67f6e98a-946c-4460-93d4-707511deb4f5",
"name": "Switch",
"type": "n8n-nodes-base.switch",
"position": [
1440,
1376
],
"parameters": {
"rules": {
"values": [
{
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "2af7e15b-2e56-40e5-addc-74bd0b4de214",
"operator": {
"type": "string",
"operation": "notEmpty",
"singleValue": true
},
"leftValue": "={{ $json['Page URLs'] }}",
"rightValue": ""
}
]
}
},
{
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "02899ab6-0c0b-4c0f-89ad-ec5787da36eb",
"operator": {
"type": "string",
"operation": "endsWith"
},
"leftValue": "={{ $json['Sitemap URL'] }}",
"rightValue": "xml"
}
]
}
}
]
},
"options": {
"allMatchingOutputs": true
}
},
"typeVersion": 3.2
}
],
"connections": {
"67f6e98a-946c-4460-93d4-707511deb4f5": {
"main": [
[
{
"node": "520e131d-b5f2-4857-aebd-5724da2a8083",
"type": "main",
"index": 0
}
],
[
{
"node": "cca1e7e7-32f6-42fd-b23c-3c2586344a50",
"type": "main",
"index": 0
}
]
]
},
"7e7fe528-8748-470b-b627-a0c79b5aface": {
"main": [
[
{
"node": "a0517aaf-6ccd-481d-b97e-b183d305451b",
"type": "main",
"index": 0
}
]
]
},
"fa1c18c6-6c29-4e71-905e-0945909af99b": {
"main": [
[
{
"node": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
"type": "main",
"index": 0
}
]
]
},
"2bf3ad7f-a2fd-44f9-b6af-5a500ef80591": {
"ai_document": [
[
{
"node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"type": "ai_document",
"index": 0
}
]
]
},
"cca1e7e7-32f6-42fd-b23c-3c2586344a50": {
"main": [
[
{
"node": "6176e651-cef5-44e8-abed-0f6f6b81517b",
"type": "main",
"index": 0
}
]
]
},
"6176e651-cef5-44e8-abed-0f6f6b81517b": {
"main": [
[
{
"node": "3ff777b7-24bd-420c-af38-62a395f52a1a",
"type": "main",
"index": 0
}
]
]
},
"73aebd19-60ae-40d1-a747-0b9537d9d67c": {
"main": [
[
{
"node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"type": "main",
"index": 0
}
]
]
},
"520e131d-b5f2-4857-aebd-5724da2a8083": {
"main": [
[
{
"node": "7e7fe528-8748-470b-b627-a0c79b5aface",
"type": "main",
"index": 0
}
]
]
},
"3ff777b7-24bd-420c-af38-62a395f52a1a": {
"main": [
[
{
"node": "7e7fe528-8748-470b-b627-a0c79b5aface",
"type": "main",
"index": 1
}
]
]
},
"a86d4c2e-559c-4942-ac0d-2ddcc7eb7f39": {
"ai_embedding": [
[
{
"node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"type": "ai_embedding",
"index": 0
}
]
]
},
"72c85ccf-a9d6-42b1-85a7-76800ba831e5": {
"main": [
[
{
"node": "73aebd19-60ae-40d1-a747-0b9537d9d67c",
"type": "main",
"index": 0
}
],
[
{
"node": "0dbf70c1-cb57-4691-916f-2a2aa9a4cec0",
"type": "main",
"index": 0
}
]
]
},
"a0517aaf-6ccd-481d-b97e-b183d305451b": {
"main": [
[
{
"node": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
"type": "main",
"index": 0
}
]
]
},
"4f5dc6e3-8f75-46ab-b3e1-49deb7695469": {
"main": [
[
{
"node": "67f6e98a-946c-4460-93d4-707511deb4f5",
"type": "main",
"index": 0
}
]
]
},
"0dbf70c1-cb57-4691-916f-2a2aa9a4cec0": {
"main": [
[
{
"node": "fa1c18c6-6c29-4e71-905e-0945909af99b",
"type": "main",
"index": 0
}
]
]
}
}
}¿Cómo usar este flujo de trabajo?
Copie el código de configuración JSON de arriba, cree un nuevo flujo de trabajo en su instancia de n8n y seleccione "Importar desde JSON", pegue la configuración y luego modifique la configuración de credenciales según sea necesario.
¿En qué escenarios es adecuado este flujo de trabajo?
Avanzado - Extracción de documentos, RAG de IA
¿Es de pago?
Este flujo de trabajo es completamente gratuito, puede importarlo y usarlo directamente. Sin embargo, tenga en cuenta que los servicios de terceros utilizados en el flujo de trabajo (como la API de OpenAI) pueden requerir un pago por su cuenta.
Flujos de trabajo relacionados recomendados
Zain Khan
@zainI partner with businesses to streamline processes and accelerate growth through intelligent AI automation and Web/mobile Development. Leveraging deep expertise in GPT-4, LangChain, and n8n, I develop AI-powered agents and sophisticated LLM pipelines.
Compartir este flujo de trabajo