Alle Webseiteninhalte abrufen und mit Gemini-Embeddings in Pinecone speichern

Experte

Dies ist ein Document Extraction, AI RAG-Bereich Automatisierungsworkflow mit 16 Nodes. Hauptsächlich werden Xml, Code, Html, Wait, Merge und andere Nodes verwendet. Gemini-Embeddings, die alle Seiteninhalte von einer Website abrufen und in Pinecone speichern

Voraussetzungen
  • Möglicherweise sind Ziel-API-Anmeldedaten erforderlich
  • Pinecone API Key
Workflow-Vorschau
Visualisierung der Node-Verbindungen, mit Zoom und Pan
Workflow exportieren
Kopieren Sie die folgende JSON-Konfiguration und importieren Sie sie in n8n
{
  "nodes": [
    {
      "id": "5ad6a510-3c4a-47e4-b8ff-c0e565e25d25",
      "name": "Haftnotiz",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        368,
        944
      ],
      "parameters": {
        "width": 832,
        "height": 816,
        "content": "This n8n workflow builds a Pinecone knowledge base from website content, handling both sitemap and direct URL inputs.\n---\n### 1. URL Input & Consolidation\n\nThis section gathers and refines the URLs to be processed.\n* **Input Sitemap or page URLs (Form Trigger):** Start by providing a sitemap URL or a list of specific page URLs.\n* **Switch:** Routes input based on whether a sitemap or individual URLs are provided.\n* **Split Pages URL (Code):** Parses and cleans individual page URLs.\n* **Fetch Sitemap (HTTP Request):** Downloads the sitemap XML.\n* **XML Conversion (XML):** Converts sitemap XML to JSON.\n* **Extract Page URLs (Code):** Pulls page URLs from the JSON sitemap.\n* **Merge URLs (Merge):** Combines all URLs into one list.\n* **Remove Duplicate URLs (Remove Duplicates):** Eliminates any duplicate URLs.\n---\n\n### 2. Content Extraction\n\nThis section fetches and cleans content from each unique URL.\n* **Loop Over Page URLs (Split In Batches):** Processes URLs in batches.\n* **Fetch Page HTML For content (HTTP Request):** Downloads HTML for each page.\n* **Wait 5 sec (Wait):** Adds a 5-second delay to avoid overwhelming websites.\n* **Extract Content (HTML):** Extracts main text content from the HTML, skipping images and cleaning the text.\n\n---\n\n### 3. Embedding & Pinecone Storage\n\nThe final stage transforms content into vector embeddings and stores them in Pinecone.\n* **Gemini Embeddings (Embeddings):** Converts extracted text into 3076-dimensional vector embeddings using the `models/gemini-embedding-001` model.\n* **Data Loader (Document Loader):** Prepares content as documents for the vector store.\n* **Pinecone KnowledgeBase (Vector Store):** Inserts the generated embeddings and content into the \"supportbot\" Pinecone index, clearing existing data in the namespace first."
      },
      "typeVersion": 1
    },
    {
      "id": "3ff777b7-24bd-420c-af38-62a395f52a1a",
      "name": "Extract Page URLs",
      "type": "n8n-nodes-base.code",
      "position": [
        1936,
        1392
      ],
      "parameters": {
        "jsCode": "const items = []\nfor (const item of $input.first().json.urlset.url) {\n  items.push({ url: item.loc })\n}\n\nreturn items;"
      },
      "typeVersion": 2
    },
    {
      "id": "6176e651-cef5-44e8-abed-0f6f6b81517b",
      "name": "XML Conversion",
      "type": "n8n-nodes-base.xml",
      "position": [
        1792,
        1392
      ],
      "parameters": {
        "options": {}
      },
      "typeVersion": 1
    },
    {
      "id": "cca1e7e7-32f6-42fd-b23c-3c2586344a50",
      "name": "Fetch Sitemap",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        1632,
        1392
      ],
      "parameters": {
        "url": "={{ $json['Sitemap URL'] }}",
        "options": {}
      },
      "typeVersion": 4.2
    },
    {
      "id": "520e131d-b5f2-4857-aebd-5724da2a8083",
      "name": "Split Pages URL",
      "type": "n8n-nodes-base.code",
      "position": [
        1792,
        1216
      ],
      "parameters": {
        "jsCode": "function addTrailingSlash(str) {\n  if (typeof str !== 'string') {\n    return str; // Or throw an error, handle non-string inputs\n  }\n  if (!str.endsWith('/')) {\n    return str + '/';\n  }\n  return str;\n}\n\nconst urls = []\nfor (const item of $input.first().json['Page URLs'].split(',')) {\n  urls.push({ url: addTrailingSlash(item).trim()})\n}\n\nreturn urls;"
      },
      "typeVersion": 2
    },
    {
      "id": "7e7fe528-8748-470b-b627-a0c79b5aface",
      "name": "Zusammenführen URLs",
      "type": "n8n-nodes-base.merge",
      "position": [
        2128,
        1232
      ],
      "parameters": {},
      "typeVersion": 3.2
    },
    {
      "id": "a0517aaf-6ccd-481d-b97e-b183d305451b",
      "name": "Remove Duplicate URLs",
      "type": "n8n-nodes-base.removeDuplicates",
      "position": [
        2272,
        1232
      ],
      "parameters": {
        "options": {}
      },
      "typeVersion": 2
    },
    {
      "id": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
      "name": "Loop Over Page URLs",
      "type": "n8n-nodes-base.splitInBatches",
      "position": [
        2480,
        1232
      ],
      "parameters": {
        "options": {}
      },
      "typeVersion": 3
    },
    {
      "id": "73aebd19-60ae-40d1-a747-0b9537d9d67c",
      "name": "Extract Content",
      "type": "n8n-nodes-base.html",
      "position": [
        2672,
        1136
      ],
      "parameters": {
        "options": {
          "cleanUpText": true
        },
        "operation": "extractHtmlContent",
        "extractionValues": {
          "values": [
            {
              "key": "content",
              "cssSelector": "body",
              "skipSelectors": "img"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "0dbf70c1-cb57-4691-916f-2a2aa9a4cec0",
      "name": "Fetch Page HTML For content",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        2672,
        1328
      ],
      "parameters": {
        "url": "={{ $json.url }}",
        "options": {}
      },
      "typeVersion": 4.2
    },
    {
      "id": "fa1c18c6-6c29-4e71-905e-0945909af99b",
      "name": "Warten 5 sec",
      "type": "n8n-nodes-base.wait",
      "position": [
        2832,
        1328
      ],
      "webhookId": "9d87e60f-9df8-4a13-9c22-e3e5a5bb9c0e",
      "parameters": {},
      "typeVersion": 1.1
    },
    {
      "id": "2bf3ad7f-a2fd-44f9-b6af-5a500ef80591",
      "name": "Data Loader",
      "type": "@n8n/n8n-nodes-langchain.documentDefaultDataLoader",
      "position": [
        3264,
        1344
      ],
      "parameters": {
        "options": {}
      },
      "typeVersion": 1.1
    },
    {
      "id": "a86d4c2e-559c-4942-ac0d-2ddcc7eb7f39",
      "name": "Gemini Einbettungen",
      "type": "@n8n/n8n-nodes-langchain.embeddingsGoogleGemini",
      "position": [
        3072,
        1344
      ],
      "parameters": {
        "modelName": "models/gemini-embedding-001"
      },
      "typeVersion": 1
    },
    {
      "id": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
      "name": "Pinecone KnowledgeBase",
      "type": "@n8n/n8n-nodes-langchain.vectorStorePinecone",
      "position": [
        3072,
        1136
      ],
      "parameters": {
        "mode": "insert",
        "options": {
          "clearNamespace": true
        }
      },
      "typeVersion": 1.3
    },
    {
      "id": "4f5dc6e3-8f75-46ab-b3e1-49deb7695469",
      "name": "Input Sitemap or page urls",
      "type": "n8n-nodes-base.formTrigger",
      "position": [
        1296,
        1376
      ],
      "webhookId": "ab54a2cd-2eda-4cf7-b822-8fb49ecb257e",
      "parameters": {
        "options": {},
        "formTitle": "Agent Knowledge Base Input",
        "formFields": {
          "values": [
            {
              "fieldLabel": "Sitemap URL",
              "placeholder": "https://website.com/page-sitemap.xml"
            },
            {
              "fieldType": "textarea",
              "fieldLabel": "Page URLs",
              "placeholder": "https://website.com/about, https://website.com/contact"
            }
          ]
        },
        "formDescription": "This form is to input the page sitemap or pages of your website"
      },
      "typeVersion": 2.2
    },
    {
      "id": "67f6e98a-946c-4460-93d4-707511deb4f5",
      "name": "Schalter",
      "type": "n8n-nodes-base.switch",
      "position": [
        1440,
        1376
      ],
      "parameters": {
        "rules": {
          "values": [
            {
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "2af7e15b-2e56-40e5-addc-74bd0b4de214",
                    "operator": {
                      "type": "string",
                      "operation": "notEmpty",
                      "singleValue": true
                    },
                    "leftValue": "={{ $json['Page URLs'] }}",
                    "rightValue": ""
                  }
                ]
              }
            },
            {
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "02899ab6-0c0b-4c0f-89ad-ec5787da36eb",
                    "operator": {
                      "type": "string",
                      "operation": "endsWith"
                    },
                    "leftValue": "={{ $json['Sitemap URL'] }}",
                    "rightValue": "xml"
                  }
                ]
              }
            }
          ]
        },
        "options": {
          "allMatchingOutputs": true
        }
      },
      "typeVersion": 3.2
    }
  ],
  "connections": {
    "Switch": {
      "main": [
        [
          {
            "node": "520e131d-b5f2-4857-aebd-5724da2a8083",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "cca1e7e7-32f6-42fd-b23c-3c2586344a50",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Merge URLs": {
      "main": [
        [
          {
            "node": "a0517aaf-6ccd-481d-b97e-b183d305451b",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Wait 5 sec": {
      "main": [
        [
          {
            "node": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "2bf3ad7f-a2fd-44f9-b6af-5a500ef80591": {
      "ai_document": [
        [
          {
            "node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
            "type": "ai_document",
            "index": 0
          }
        ]
      ]
    },
    "cca1e7e7-32f6-42fd-b23c-3c2586344a50": {
      "main": [
        [
          {
            "node": "6176e651-cef5-44e8-abed-0f6f6b81517b",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "6176e651-cef5-44e8-abed-0f6f6b81517b": {
      "main": [
        [
          {
            "node": "3ff777b7-24bd-420c-af38-62a395f52a1a",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "73aebd19-60ae-40d1-a747-0b9537d9d67c": {
      "main": [
        [
          {
            "node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "520e131d-b5f2-4857-aebd-5724da2a8083": {
      "main": [
        [
          {
            "node": "Merge URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "3ff777b7-24bd-420c-af38-62a395f52a1a": {
      "main": [
        [
          {
            "node": "Merge URLs",
            "type": "main",
            "index": 1
          }
        ]
      ]
    },
    "Gemini Embeddings": {
      "ai_embedding": [
        [
          {
            "node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
            "type": "ai_embedding",
            "index": 0
          }
        ]
      ]
    },
    "72c85ccf-a9d6-42b1-85a7-76800ba831e5": {
      "main": [
        [
          {
            "node": "73aebd19-60ae-40d1-a747-0b9537d9d67c",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "0dbf70c1-cb57-4691-916f-2a2aa9a4cec0",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "a0517aaf-6ccd-481d-b97e-b183d305451b": {
      "main": [
        [
          {
            "node": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "4f5dc6e3-8f75-46ab-b3e1-49deb7695469": {
      "main": [
        [
          {
            "node": "Switch",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "0dbf70c1-cb57-4691-916f-2a2aa9a4cec0": {
      "main": [
        [
          {
            "node": "Wait 5 sec",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}
Häufig gestellte Fragen

Wie verwende ich diesen Workflow?

Kopieren Sie den obigen JSON-Code, erstellen Sie einen neuen Workflow in Ihrer n8n-Instanz und wählen Sie "Aus JSON importieren". Fügen Sie die Konfiguration ein und passen Sie die Anmeldedaten nach Bedarf an.

Für welche Szenarien ist dieser Workflow geeignet?

Experte - Dokumentenextraktion, KI RAG

Ist es kostenpflichtig?

Dieser Workflow ist völlig kostenlos. Beachten Sie jedoch, dass Drittanbieterdienste (wie OpenAI API), die im Workflow verwendet werden, möglicherweise kostenpflichtig sind.

Workflow-Informationen
Schwierigkeitsgrad
Experte
Anzahl der Nodes16
Kategorie2
Node-Typen14
Schwierigkeitsbeschreibung

Für fortgeschrittene Benutzer, komplexe Workflows mit 16+ Nodes

Autor
Zain Khan

Zain Khan

@zain

I partner with businesses to streamline processes and accelerate growth through intelligent AI automation and Web/mobile Development. Leveraging deep expertise in GPT-4, LangChain, and n8n, I develop AI-powered agents and sophisticated LLM pipelines.

Externe Links
Auf n8n.io ansehen

Diesen Workflow teilen

Kategorien

Kategorien: 34