Beyond Basic RAG: Mastering Hybrid Search and Document Parsing in Data 360
Retrieval-Augmented Generation (RAG) acts as an intelligent data layer, allowing Agentforce agents to bypass their training limitations and deliver highly contextual, accurate answers. However, while setting up a basic RAG pipeline is straightforward, standard semantic search often struggles when confronted with messy real-world enterprise data, complex PDF layouts, or highly specific industry jargon. Effective RAG systems thrive on well-prepared, relevant data, making data preprocessing and unstructured document curation foundational for your agent’s success.
For architects and developers looking to achieve pinpoint retrieval accuracy, Data 360 offers advanced techniques to control exactly how your unstructured data is parsed, indexed, and searched.
Perfecting extraction with the Intelligent Context workspace
Historically, configuring unstructured data ingestion felt like a “black box” process. To solve this, developers can now use Intelligent Context, an AI-powered workspace within Data 360 designed to let you interactively test and refine how unstructured data is processed before building a full search index.
Instead of relying on rigid, pre-defined rules, you can apply prompt-based customization to fix specific extraction issues using natural language. For example, you can instruct the parser to “Extract the drivetrain table as markdown” or “Ignore the legal disclaimer footer”. The workspace also supports customizable “lenses,” allowing you to extract different types of information from the exact same document, such as a Sales Lens for pricing data and a Service Lens for troubleshooting steps.
The workflow is highly iterative: you upload a small sample set of documents, configure the parsing strategy, and ask questions in an agent chat window to verify that the resulting chunks and vectors retrieve the correct information. Once the logic is perfected, you publish the configuration to an Unstructured Data Model Object (UDMO) to scale across millions of live records.
Choosing your parsing engine: LLM vs. Docling
When configuring your indexing strategy in Intelligent Context or the Advanced Search Index setup, you must select the underlying parsing method. This choice fundamentally dictates how the system reconstructs table structures and document layouts
- LLM-Based parsing: This method utilizes multimodal Large Language Models (like GPT-4o) to visually “read” the document and interpret its meaning. It is highly adaptable and best suited for “gnarly,” noisy, or highly irregular documents where semantic understanding is required to extract the structure. The trade-off is that LLM parsing is slower, inference-heavy, and typically incurs higher API costs and latency.
- Docling parsing: Powered by IBM, Docling uses specialized computer vision and OCR models optimized specifically for document layout analysis. It is the superior choice for structured, “born-digital” PDFs, manuals, and reports. Docling is exceptionally fast and highly effective at maintaining logical reading orders and reconstructing complex table structures into clean Markdown or JSON.
Supercharging recall with Enriched Indexing
Even with perfect parsing, retrieving the right chunk based on a user’s conversational query can be difficult if the text is dense. Enriched Indexing is a strategy that generates additional chunks during the indexing phase to significantly improve search recall and precision.
When enabled, the system automatically generates three distinct types of chunks for every passage:
- PLAIN chunks: The raw content extracted directly from the original document.
- QUESTION chunks: A set of LLM-generated synthetic questions that the original chunk answers. Because users typically interact with agents by asking questions, vectorizing synthetic question pairs minimizes the semantic mismatch between the user’s intent and the document’s declarative text.
- METADATA chunks: LLM-generated metadata based on the plain chunk, including up to 10 keywords, key entities, top topics, sentiment, a concise title, and a brief summary.
This automated chunk enrichment acts as an alternative to intensive manual content curation, drastically improving the system’s ability to identify the right chunks without relying on developers to manually annotate files. While the vectors for the question chunks are what the system retrieves during a search, the prompt augmentation automatically uses the corresponding PLAIN chunks to ensure the LLM receives the actual source text.
Standard semantic search vs. Hybrid Search
Finally, architects must decide how the search index is queried. While vector search is excellent for semantic similarity (i.e. understanding that “How to log in” and “How can I sign on” mean the same thing), it often fails to recognize highly specific keywords, numbers, or domain terms. If a user searches for “LaserPrinter TX 400”, standard semantic search might mistakenly retrieve documentation for the “LaserPrinter TX 440”.
To solve this, developers should implement Hybrid Search, which combines the strengths of vector search and keyword search into a single search call. By evaluating both semantic meaning and lexical exact matches, hybrid search ensures that the highest-ranked chunks are accurate for industry jargon, specific brand names, and SKUs.
However, this precision comes with an architectural trade-off. Because hybrid search processes queries across both vector and keyword indexes and then reranks the results, it consumes roughly twice as many Data 360 credits and increases run-time latency. Therefore, standard semantic search may be sufficient for general knowledge bases, but hybrid search is strictly recommended when specific terminology and exact product numbers are key to retrieval quality.