PDF-Inhalte parsen, normieren, extrahieren und in Pinecone für RAG speichern

Name: PDF-Inhalte parsen, normieren, extrahieren und in Pinecone für RAG speichern
Rating: 4.5 (10 reviews)
Author: Alok Kumar

Experte

Dies ist ein AI RAG, Multimodal AI-Bereich Automatisierungsworkflow mit 18 Nodes. Hauptsächlich werden If, Code, Wait, GoogleDrive, HttpRequest und andere Nodes verwendet. PDF-Frage-Antwort-System mit LlamaIndex, OpenAI-Einbettungen und Pinecone-Vektordatenbank bauen

Voraussetzungen

•Google Drive API-Anmeldedaten
•Möglicherweise sind Ziel-API-Anmeldedaten erforderlich
•OpenAI API Key
•Pinecone API Key

Verwendete Nodes (18)

DocumentDefaultDataLoader

TextSplitterRecursiveCharacterTextSplitter

Kategorie

KI RAG

Multimodales KI

Workflow-Vorschau

Visualisierung der Node-Verbindungen, mit Zoom und Pan

Google Drive Trigger

Datei herunterladen

Default Data Loader

Warten

Wait2

Upload zu Llama Cloud

Parsing-Status prüfen

Markdown aus Llama Cloud extrahieren

Text normalisieren

Text chunking

Embeddings generieren

In Pinecone speichern

React Flow

Workflow exportieren

Kopieren Sie die folgende JSON-Konfiguration und importieren Sie sie in n8n

{
  "id": "xDiuqZUZnShKpPzX",
  "meta": {
    "instanceId": "70273a2379644db63ce659827cfd8abac2d0b189210eafa02dd5376e3a62cd1d",
    "templateCredsSetupCompleted": true
  },
  "name": "Parse, Normalize, Extract, and Store PDF Content for RAG in Pinecone",
  "tags": [],
  "nodes": [
    {
      "id": "19b009db-a418-458c-a216-bdcc9af6fd2f",
      "name": "Google Drive Trigger",
      "type": "n8n-nodes-base.googleDriveTrigger",
      "position": [
        -1504,
        2080
      ],
      "parameters": {
        "event": "fileCreated",
        "options": {},
        "pollTimes": {
          "item": [
            {
              "mode": "everyMinute"
            }
          ]
        },
        "triggerOn": "specificFolder",
        "folderToWatch": {
          "__rl": true,
          "mode": "list",
          "value": ""
        }
      },
      "credentials": {
        "googleDriveOAuth2Api": {
          "id": "aU33fzddE6s3ZQw6",
          "name": "LearnBy-Google-Drive"
        }
      },
      "typeVersion": 1
    },
    {
      "id": "ff933f76-d719-40b5-b193-8a29e5fa2197",
      "name": "Datei herunterladen",
      "type": "n8n-nodes-base.googleDrive",
      "position": [
        -1248,
        2096
      ],
      "parameters": {
        "fileId": {
          "__rl": true,
          "mode": "id",
          "value": "={{ $json.id }}"
        },
        "options": {},
        "operation": "download"
      },
      "credentials": {
        "googleDriveOAuth2Api": {
          "id": "aU33fzddE6s3ZQw6",
          "name": "LearnBy-Google-Drive"
        }
      },
      "typeVersion": 3
    },
    {
      "id": "127b41ed-ad45-4234-b87f-4f3c2b6ea531",
      "name": "Default Data Loader",
      "type": "@n8n/n8n-nodes-langchain.documentDefaultDataLoader",
      "position": [
        528,
        2192
      ],
      "parameters": {
        "options": {},
        "textSplittingMode": "custom"
      },
      "typeVersion": 1.1
    },
    {
      "id": "0316c7d4-449f-4275-a9b1-8848545beba8",
      "name": "Haftnotiz",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        336,
        1712
      ],
      "parameters": {
        "width": 736,
        "height": 832,
        "content": "## Save to Vector DB"
      },
      "typeVersion": 1
    },
    {
      "id": "b06702b5-c322-4a5a-949a-855c8b97dadc",
      "name": "Haftnotiz1",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -1088,
        1720
      ],
      "parameters": {
        "color": 4,
        "width": 1392,
        "height": 656,
        "content": "## Prepare data - Parse and Normalize\n"
      },
      "typeVersion": 1
    },
    {
      "id": "05034e35-f6bf-45a6-860e-94f4da566daf",
      "name": "Warten",
      "type": "n8n-nodes-base.wait",
      "position": [
        -720,
        2088
      ],
      "webhookId": "a0518843-31f8-44f9-bd8e-1189e16de0f1",
      "parameters": {
        "amount": 30
      },
      "typeVersion": 1.1
    },
    {
      "id": "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6",
      "name": "If",
      "type": "n8n-nodes-base.if",
      "position": [
        -272,
        2016
      ],
      "parameters": {
        "options": {},
        "conditions": {
          "options": {
            "version": 2,
            "leftValue": "",
            "caseSensitive": true,
            "typeValidation": "strict"
          },
          "combinator": "and",
          "conditions": [
            {
              "id": "7a07aec1-fc5f-4b76-94d9-6fa8f509ac8e",
              "operator": {
                "name": "filter.operator.equals",
                "type": "string",
                "operation": "equals"
              },
              "leftValue": "={{ $json.status }}",
              "rightValue": "SUCCESS"
            }
          ]
        }
      },
      "typeVersion": 2.2
    },
    {
      "id": "28654aff-e603-4080-9e7a-e706aaee47c4",
      "name": "Wait2",
      "type": "n8n-nodes-base.wait",
      "position": [
        -48,
        2184
      ],
      "webhookId": "8da5da31-1ebd-4c82-8c6a-476d5d277cdd",
      "parameters": {
        "amount": 60
      },
      "typeVersion": 1.1
    },
    {
      "id": "ba935542-d2c0-4781-b6f9-5e1e007a9740",
      "name": "Haftnotiz2",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        0,
        1584
      ],
      "parameters": {
        "width": 288,
        "height": 352,
        "content": "## Normalized Content\n\n* Removes noise\n* Reduces duplication\n* Improves retrieval quality \n* Preserves context \n* Consistent format \n* Prevents wasted tokens \n\n**Note : Update the code based in your requirement**"
      },
      "typeVersion": 1
    },
    {
      "id": "45efcdf9-89a9-4638-a9ed-cac39506270f",
      "name": "Haftnotiz3",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -2080,
        1472
      ],
      "parameters": {
        "width": 464,
        "height": 1200,
        "content": "## Try It Out!  \n### This n8n template demonstrates how to normalize, index, and query insurance PDFs using AI and Pinecone for a full **RAG (Retrieval-Augmented Generation)** workflow.  \n\n### Use cases include: creating **chatbots or Q&A systems** for structured documents, extracting insights from insurance policies, or managing compliance/legal PDFs efficiently.  \n\n---\n\n## How it works\n* New PDFs are automatically detected from a **Google Drive** folder.  \n* PDFs are sent to **LlamaIndex Cloud** for parsing → returns clean Markdown text.  \n* Text is normalized to remove headers, footers, page numbers, and formatting artifacts.  \n* The normalized text is split into chunks (~1200 characters with 150-character overlap) for better embedding.  \n* **OpenAI embeddings** are generated for each chunk.  \n* Chunks and metadata are stored in **Pinecone** for semantic search.  \n* A **Chat Agent** queries Pinecone to retrieve answers from your document vector database.  \n\n---\n\n### How to use\n* Update the folder name in google drive trigger node. \n* Place a pdf file in the same folder in google drive.\n* Customize the `Normalized Content` function node to adjust regex for headers/footers specific to your documents.  \n* Adjust chunk size or metadata namespace in the Pinecone node to fit your project needs.  \n\n---\n\n### Requirements\n* Google Drive account for PDF source files.  \n* **LlamaIndex Cloud** account (parsing API key).  \n* **Pinecone** account for vector storage.  \n* **OpenAI** account for model and embeddings.   \n\n---\n\n### Need Help?  \nask in the [n8n Forum](https://community.n8n.io/)!  \n\nHappy Automating! 🚀\n"
      },
      "typeVersion": 1
    },
    {
      "id": "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1",
      "name": "Upload zu Llama Cloud",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        -944,
        2088
      ],
      "parameters": {
        "url": "https://api.cloud.llamaindex.ai/api/v1/parsing/upload",
        "method": "POST",
        "options": {},
        "sendBody": true,
        "contentType": "multipart-form-data",
        "sendHeaders": true,
        "authentication": "genericCredentialType",
        "bodyParameters": {
          "parameters": [
            {
              "name": "file",
              "parameterType": "formBinaryData",
              "inputDataFieldName": "data"
            }
          ]
        },
        "genericAuthType": "httpBearerAuth",
        "headerParameters": {
          "parameters": [
            {
              "name": "accept",
              "value": "application/json"
            },
            {
              "name": "Content-Type",
              "value": "multipart/form-data"
            }
          ]
        }
      },
      "credentials": {
        "httpBearerAuth": {
          "id": "FlAAm17M7G6as02l",
          "name": "learnby_llama_cloud"
        }
      },
      "executeOnce": false,
      "retryOnFail": true,
      "typeVersion": 4.2,
      "alwaysOutputData": false
    },
    {
      "id": "1199e4ff-1952-4225-b655-1f63875f8903",
      "name": "Parsing-Status prüfen",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        -496,
        2088
      ],
      "parameters": {
        "url": "=https://api.cloud.llamaindex.ai/api/parsing/job/{{ $('Upload to Llama Cloud').item.json.id }}",
        "options": {},
        "sendHeaders": true,
        "authentication": "genericCredentialType",
        "genericAuthType": "httpBearerAuth",
        "headerParameters": {
          "parameters": [
            {
              "name": "accept",
              "value": "application/json"
            }
          ]
        }
      },
      "credentials": {
        "httpBearerAuth": {
          "id": "FlAAm17M7G6as02l",
          "name": "learnby_llama_cloud"
        }
      },
      "retryOnFail": true,
      "typeVersion": 4.2
    },
    {
      "id": "cfaf9e10-0297-423a-b3f4-c25561c92078",
      "name": "Markdown aus Llama Cloud extrahieren",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        -48,
        1968
      ],
      "parameters": {
        "url": "=https://api.cloud.llamaindex.ai/api/v1/parsing/job/{{ $json.id }}/result/markdown",
        "options": {},
        "sendHeaders": true,
        "authentication": "genericCredentialType",
        "genericAuthType": "httpBearerAuth",
        "headerParameters": {
          "parameters": [
            {
              "name": "accept",
              "value": "application/json"
            }
          ]
        }
      },
      "credentials": {
        "httpBearerAuth": {
          "id": "FlAAm17M7G6as02l",
          "name": "learnby_llama_cloud"
        }
      },
      "retryOnFail": true,
      "typeVersion": 4.2
    },
    {
      "id": "564a0930-e80b-4db6-a62a-5224248e5cd9",
      "name": "Text normalisieren",
      "type": "n8n-nodes-base.code",
      "position": [
        176,
        1968
      ],
      "parameters": {
        "mode": "runOnceForEachItem",
        "jsCode": "// Get the input text from the previous node\nconst input = $json.markdown || $json.text || \"\";\n\nlet text = input.replace(/Car Insurance Policy\\s*\\d+/gi, \"\");\n\n// Remove \"Page X\" markers\ntext = text.replace(/Page\\s*\\d+/gi, \"\");\n\n// Replace --- dividers with a single newline\ntext = text.replace(/-{3,}/g, \"\\n\");\n\n// Decode & cleanup artifacts\ntext = text.replace(/&#x26;/g, \"&\");   // fix HTML entities\ntext = text.replace(/[ⓤ]/g, \"-\");      // replace bullet symbols with dashes\n\n// Collapse whitespace\ntext = text.replace(/\\n{2,}/g, \"\\n\\n\"); // keep paragraph breaks\ntext = text.replace(/[ \\t]+/g, \" \");    // collapse spaces\n\n// Step 5: Trim\ntext = text.trim();\n\n// Output for next node\nreturn { json: { normalizedText: text } };\n"
      },
      "typeVersion": 2
    },
    {
      "id": "fe32694a-2cbc-4ad4-88aa-4eb3dba0256c",
      "name": "Text chunking",
      "type": "@n8n/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitter",
      "position": [
        608,
        2400
      ],
      "parameters": {
        "options": {
          "splitCode": "markdown"
        },
        "chunkSize": 1200,
        "chunkOverlap": 150
      },
      "typeVersion": 1
    },
    {
      "id": "b399c9fa-03d7-4126-9026-674d091b9ddf",
      "name": "Embeddings generieren",
      "type": "@n8n/n8n-nodes-langchain.embeddingsOpenAi",
      "position": [
        400,
        2192
      ],
      "parameters": {
        "options": {}
      },
      "credentials": {
        "openAiApi": {
          "id": "Yj4Rt75fspowAEru",
          "name": "nextweb-openai"
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "1da27207-b77e-41d0-a249-5096ec8ac259",
      "name": "In Pinecone speichern",
      "type": "@n8n/n8n-nodes-langchain.vectorStorePinecone",
      "position": [
        432,
        1968
      ],
      "parameters": {
        "mode": "insert",
        "options": {
          "pineconeNamespace": "rag"
        },
        "pineconeIndex": {
          "__rl": true,
          "mode": "id",
          "value": "demo"
        }
      },
      "credentials": {
        "pineconeApi": {
          "id": "uo1lZDPNWTsMAeOC",
          "name": "learnby-PineconeApi-account"
        }
      },
      "notesInFlow": false,
      "typeVersion": 1.3
    },
    {
      "id": "4f65ec8b-f936-41cb-b05c-0cc710df1c9e",
      "name": "Haftnotiz4",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -1568,
        1728
      ],
      "parameters": {
        "color": 6,
        "width": 464,
        "height": 640,
        "content": "## Extract Data"
      },
      "typeVersion": 1
    }
  ],
  "active": false,
  "pinData": {},
  "settings": {
    "executionOrder": "v1"
  },
  "versionId": "5ec0ee83-34cd-423d-8bd5-41400bde4a4a",
  "connections": {
    "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6": {
      "main": [
        [
          {
            "node": "cfaf9e10-0297-423a-b3f4-c25561c92078",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "28654aff-e603-4080-9e7a-e706aaee47c4",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "05034e35-f6bf-45a6-860e-94f4da566daf": {
      "main": [
        [
          {
            "node": "1199e4ff-1952-4225-b655-1f63875f8903",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "28654aff-e603-4080-9e7a-e706aaee47c4": {
      "main": [
        [
          {
            "node": "1199e4ff-1952-4225-b655-1f63875f8903",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "fe32694a-2cbc-4ad4-88aa-4eb3dba0256c": {
      "ai_textSplitter": [
        [
          {
            "node": "127b41ed-ad45-4234-b87f-4f3c2b6ea531",
            "type": "ai_textSplitter",
            "index": 0
          }
        ]
      ]
    },
    "ff933f76-d719-40b5-b193-8a29e5fa2197": {
      "main": [
        [
          {
            "node": "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "564a0930-e80b-4db6-a62a-5224248e5cd9": {
      "main": [
        [
          {
            "node": "1da27207-b77e-41d0-a249-5096ec8ac259",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "1da27207-b77e-41d0-a249-5096ec8ac259": {
      "main": [
        []
      ]
    },
    "127b41ed-ad45-4234-b87f-4f3c2b6ea531": {
      "ai_document": [
        [
          {
            "node": "1da27207-b77e-41d0-a249-5096ec8ac259",
            "type": "ai_document",
            "index": 0
          }
        ]
      ]
    },
    "b399c9fa-03d7-4126-9026-674d091b9ddf": {
      "ai_embedding": [
        [
          {
            "node": "1da27207-b77e-41d0-a249-5096ec8ac259",
            "type": "ai_embedding",
            "index": 0
          }
        ]
      ]
    },
    "1199e4ff-1952-4225-b655-1f63875f8903": {
      "main": [
        [
          {
            "node": "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "19b009db-a418-458c-a216-bdcc9af6fd2f": {
      "main": [
        [
          {
            "node": "ff933f76-d719-40b5-b193-8a29e5fa2197",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1": {
      "main": [
        [
          {
            "node": "05034e35-f6bf-45a6-860e-94f4da566daf",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "cfaf9e10-0297-423a-b3f4-c25561c92078": {
      "main": [
        [
          {
            "node": "564a0930-e80b-4db6-a62a-5224248e5cd9",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

Häufig gestellte Fragen

Wie verwende ich diesen Workflow?

Kopieren Sie den obigen JSON-Code, erstellen Sie einen neuen Workflow in Ihrer n8n-Instanz und wählen Sie "Aus JSON importieren". Fügen Sie die Konfiguration ein und passen Sie die Anmeldedaten nach Bedarf an.

Für welche Szenarien ist dieser Workflow geeignet?

Experte - KI RAG, Multimodales KI

Ist es kostenpflichtig?

Dieser Workflow ist völlig kostenlos. Beachten Sie jedoch, dass Drittanbieterdienste (wie OpenAI API), die im Workflow verwendet werden, möglicherweise kostenpflichtig sind.