Extraction de tout le contenu de pages web et stockage dans Pinecone avec les embeddings de Gemini
Ceci est unDocument Extraction, AI RAGworkflow d'automatisation du domainecontenant 16 nœuds.Utilise principalement des nœuds comme Xml, Code, Html, Wait, Merge. de网站obtenir所有页面contenu并存储至PineconedeGemini嵌入
- •Peut nécessiter les informations d'identification d'authentification de l'API cible
- •Clé API Pinecone
Nœuds utilisés (16)
Catégorie
{
"nodes": [
{
"id": "5ad6a510-3c4a-47e4-b8ff-c0e565e25d25",
"name": "Note adhésive",
"type": "n8n-nodes-base.stickyNote",
"position": [
368,
944
],
"parameters": {
"width": 832,
"height": 816,
"content": "This n8n workflow builds a Pinecone knowledge base from website content, handling both sitemap and direct URL inputs.\n---\n### 1. URL Input & Consolidation\n\nThis section gathers and refines the URLs to be processed.\n* **Input Sitemap or page URLs (Form Trigger):** Start by providing a sitemap URL or a list of specific page URLs.\n* **Switch:** Routes input based on whether a sitemap or individual URLs are provided.\n* **Split Pages URL (Code):** Parses and cleans individual page URLs.\n* **Fetch Sitemap (HTTP Request):** Downloads the sitemap XML.\n* **XML Conversion (XML):** Converts sitemap XML to JSON.\n* **Extract Page URLs (Code):** Pulls page URLs from the JSON sitemap.\n* **Merge URLs (Merge):** Combines all URLs into one list.\n* **Remove Duplicate URLs (Remove Duplicates):** Eliminates any duplicate URLs.\n---\n\n### 2. Content Extraction\n\nThis section fetches and cleans content from each unique URL.\n* **Loop Over Page URLs (Split In Batches):** Processes URLs in batches.\n* **Fetch Page HTML For content (HTTP Request):** Downloads HTML for each page.\n* **Wait 5 sec (Wait):** Adds a 5-second delay to avoid overwhelming websites.\n* **Extract Content (HTML):** Extracts main text content from the HTML, skipping images and cleaning the text.\n\n---\n\n### 3. Embedding & Pinecone Storage\n\nThe final stage transforms content into vector embeddings and stores them in Pinecone.\n* **Gemini Embeddings (Embeddings):** Converts extracted text into 3076-dimensional vector embeddings using the `models/gemini-embedding-001` model.\n* **Data Loader (Document Loader):** Prepares content as documents for the vector store.\n* **Pinecone KnowledgeBase (Vector Store):** Inserts the generated embeddings and content into the \"supportbot\" Pinecone index, clearing existing data in the namespace first."
},
"typeVersion": 1
},
{
"id": "3ff777b7-24bd-420c-af38-62a395f52a1a",
"name": "Extract Page URLs",
"type": "n8n-nodes-base.code",
"position": [
1936,
1392
],
"parameters": {
"jsCode": "const items = []\nfor (const item of $input.first().json.urlset.url) {\n items.push({ url: item.loc })\n}\n\nreturn items;"
},
"typeVersion": 2
},
{
"id": "6176e651-cef5-44e8-abed-0f6f6b81517b",
"name": "XML Conversion",
"type": "n8n-nodes-base.xml",
"position": [
1792,
1392
],
"parameters": {
"options": {}
},
"typeVersion": 1
},
{
"id": "cca1e7e7-32f6-42fd-b23c-3c2586344a50",
"name": "Fetch Sitemap",
"type": "n8n-nodes-base.httpRequest",
"position": [
1632,
1392
],
"parameters": {
"url": "={{ $json['Sitemap URL'] }}",
"options": {}
},
"typeVersion": 4.2
},
{
"id": "520e131d-b5f2-4857-aebd-5724da2a8083",
"name": "Split Pages URL",
"type": "n8n-nodes-base.code",
"position": [
1792,
1216
],
"parameters": {
"jsCode": "function addTrailingSlash(str) {\n if (typeof str !== 'string') {\n return str; // Or throw an error, handle non-string inputs\n }\n if (!str.endsWith('/')) {\n return str + '/';\n }\n return str;\n}\n\nconst urls = []\nfor (const item of $input.first().json['Page URLs'].split(',')) {\n urls.push({ url: addTrailingSlash(item).trim()})\n}\n\nreturn urls;"
},
"typeVersion": 2
},
{
"id": "7e7fe528-8748-470b-b627-a0c79b5aface",
"name": "Fusionner URLs",
"type": "n8n-nodes-base.merge",
"position": [
2128,
1232
],
"parameters": {},
"typeVersion": 3.2
},
{
"id": "a0517aaf-6ccd-481d-b97e-b183d305451b",
"name": "Remove Duplicate URLs",
"type": "n8n-nodes-base.removeDuplicates",
"position": [
2272,
1232
],
"parameters": {
"options": {}
},
"typeVersion": 2
},
{
"id": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
"name": "Loop Over Page URLs",
"type": "n8n-nodes-base.splitInBatches",
"position": [
2480,
1232
],
"parameters": {
"options": {}
},
"typeVersion": 3
},
{
"id": "73aebd19-60ae-40d1-a747-0b9537d9d67c",
"name": "Extract Content",
"type": "n8n-nodes-base.html",
"position": [
2672,
1136
],
"parameters": {
"options": {
"cleanUpText": true
},
"operation": "extractHtmlContent",
"extractionValues": {
"values": [
{
"key": "content",
"cssSelector": "body",
"skipSelectors": "img"
}
]
}
},
"typeVersion": 1.2
},
{
"id": "0dbf70c1-cb57-4691-916f-2a2aa9a4cec0",
"name": "Fetch Page HTML For content",
"type": "n8n-nodes-base.httpRequest",
"position": [
2672,
1328
],
"parameters": {
"url": "={{ $json.url }}",
"options": {}
},
"typeVersion": 4.2
},
{
"id": "fa1c18c6-6c29-4e71-905e-0945909af99b",
"name": "Attendre 5 sec",
"type": "n8n-nodes-base.wait",
"position": [
2832,
1328
],
"webhookId": "9d87e60f-9df8-4a13-9c22-e3e5a5bb9c0e",
"parameters": {},
"typeVersion": 1.1
},
{
"id": "2bf3ad7f-a2fd-44f9-b6af-5a500ef80591",
"name": "Data Loader",
"type": "@n8n/n8n-nodes-langchain.documentDefaultDataLoader",
"position": [
3264,
1344
],
"parameters": {
"options": {}
},
"typeVersion": 1.1
},
{
"id": "a86d4c2e-559c-4942-ac0d-2ddcc7eb7f39",
"name": "Gemini Incorporations",
"type": "@n8n/n8n-nodes-langchain.embeddingsGoogleGemini",
"position": [
3072,
1344
],
"parameters": {
"modelName": "models/gemini-embedding-001"
},
"typeVersion": 1
},
{
"id": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"name": "Pinecone KnowledgeBase",
"type": "@n8n/n8n-nodes-langchain.vectorStorePinecone",
"position": [
3072,
1136
],
"parameters": {
"mode": "insert",
"options": {
"clearNamespace": true
}
},
"typeVersion": 1.3
},
{
"id": "4f5dc6e3-8f75-46ab-b3e1-49deb7695469",
"name": "Input Sitemap or page urls",
"type": "n8n-nodes-base.formTrigger",
"position": [
1296,
1376
],
"webhookId": "ab54a2cd-2eda-4cf7-b822-8fb49ecb257e",
"parameters": {
"options": {},
"formTitle": "Agent Knowledge Base Input",
"formFields": {
"values": [
{
"fieldLabel": "Sitemap URL",
"placeholder": "https://website.com/page-sitemap.xml"
},
{
"fieldType": "textarea",
"fieldLabel": "Page URLs",
"placeholder": "https://website.com/about, https://website.com/contact"
}
]
},
"formDescription": "This form is to input the page sitemap or pages of your website"
},
"typeVersion": 2.2
},
{
"id": "67f6e98a-946c-4460-93d4-707511deb4f5",
"name": "Commutateur",
"type": "n8n-nodes-base.switch",
"position": [
1440,
1376
],
"parameters": {
"rules": {
"values": [
{
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "2af7e15b-2e56-40e5-addc-74bd0b4de214",
"operator": {
"type": "string",
"operation": "notEmpty",
"singleValue": true
},
"leftValue": "={{ $json['Page URLs'] }}",
"rightValue": ""
}
]
}
},
{
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "02899ab6-0c0b-4c0f-89ad-ec5787da36eb",
"operator": {
"type": "string",
"operation": "endsWith"
},
"leftValue": "={{ $json['Sitemap URL'] }}",
"rightValue": "xml"
}
]
}
}
]
},
"options": {
"allMatchingOutputs": true
}
},
"typeVersion": 3.2
}
],
"connections": {
"Switch": {
"main": [
[
{
"node": "520e131d-b5f2-4857-aebd-5724da2a8083",
"type": "main",
"index": 0
}
],
[
{
"node": "cca1e7e7-32f6-42fd-b23c-3c2586344a50",
"type": "main",
"index": 0
}
]
]
},
"Merge URLs": {
"main": [
[
{
"node": "a0517aaf-6ccd-481d-b97e-b183d305451b",
"type": "main",
"index": 0
}
]
]
},
"Wait 5 sec": {
"main": [
[
{
"node": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
"type": "main",
"index": 0
}
]
]
},
"2bf3ad7f-a2fd-44f9-b6af-5a500ef80591": {
"ai_document": [
[
{
"node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"type": "ai_document",
"index": 0
}
]
]
},
"cca1e7e7-32f6-42fd-b23c-3c2586344a50": {
"main": [
[
{
"node": "6176e651-cef5-44e8-abed-0f6f6b81517b",
"type": "main",
"index": 0
}
]
]
},
"6176e651-cef5-44e8-abed-0f6f6b81517b": {
"main": [
[
{
"node": "3ff777b7-24bd-420c-af38-62a395f52a1a",
"type": "main",
"index": 0
}
]
]
},
"73aebd19-60ae-40d1-a747-0b9537d9d67c": {
"main": [
[
{
"node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"type": "main",
"index": 0
}
]
]
},
"520e131d-b5f2-4857-aebd-5724da2a8083": {
"main": [
[
{
"node": "Merge URLs",
"type": "main",
"index": 0
}
]
]
},
"3ff777b7-24bd-420c-af38-62a395f52a1a": {
"main": [
[
{
"node": "Merge URLs",
"type": "main",
"index": 1
}
]
]
},
"Gemini Embeddings": {
"ai_embedding": [
[
{
"node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"type": "ai_embedding",
"index": 0
}
]
]
},
"72c85ccf-a9d6-42b1-85a7-76800ba831e5": {
"main": [
[
{
"node": "73aebd19-60ae-40d1-a747-0b9537d9d67c",
"type": "main",
"index": 0
}
],
[
{
"node": "0dbf70c1-cb57-4691-916f-2a2aa9a4cec0",
"type": "main",
"index": 0
}
]
]
},
"a0517aaf-6ccd-481d-b97e-b183d305451b": {
"main": [
[
{
"node": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
"type": "main",
"index": 0
}
]
]
},
"4f5dc6e3-8f75-46ab-b3e1-49deb7695469": {
"main": [
[
{
"node": "Switch",
"type": "main",
"index": 0
}
]
]
},
"0dbf70c1-cb57-4691-916f-2a2aa9a4cec0": {
"main": [
[
{
"node": "Wait 5 sec",
"type": "main",
"index": 0
}
]
]
}
}
}Comment utiliser ce workflow ?
Copiez le code de configuration JSON ci-dessus, créez un nouveau workflow dans votre instance n8n et sélectionnez "Importer depuis le JSON", collez la configuration et modifiez les paramètres d'authentification selon vos besoins.
Dans quelles scénarios ce workflow est-il adapté ?
Avancé - Extraction de documents, RAG IA
Est-ce payant ?
Ce workflow est entièrement gratuit et peut être utilisé directement. Veuillez noter que les services tiers utilisés dans le workflow (comme l'API OpenAI) peuvent nécessiter un paiement de votre part.
Workflows recommandés
Zain Khan
@zainI partner with businesses to streamline processes and accelerate growth through intelligent AI automation and Web/mobile Development. Leveraging deep expertise in GPT-4, LangChain, and n8n, I develop AI-powered agents and sophisticated LLM pipelines.
Partager ce workflow