Ingestion
Ingestion is the process of pushing documentation into the knowledge base so it can be searched and retrieved. There are four methods: direct Markdown, URL fetch, OpenAPI spec, and Confluence bulk import. All methods are idempotent — re-ingesting unchanged content is a no-op.
How ingestion works
Every document goes through the same pipeline regardless of method:
- Content hash check — if the document exists and the content hash matches, the job is skipped
- Source upsert — the namespace record is created if it does not exist
- Document upsert — title, slug, metadata, and content hash are stored
- Chunk deletion — old chunks for this document are removed
- Hierarchical splitting — Markdown is split at heading boundaries into parent chunks, then parent chunks into child chunks
- Batch embedding — all child chunks are embedded via Gemini in one batch call
- Chunk upsert — chunks and embeddings are written to PostgreSQL with pgvector
Methods
Markdown (direct)
The simplest method. POST raw Markdown with a title and namespace.
curl -X POST https://your-host/api/v1/ingest/md \
-H "Authorization: Bearer cape_..." \
-H "Content-Type: application/json" \
-d '{
"title": "Uploading assets",
"namespace": "user_docs",
"slug": "uploading-assets",
"content": "# Uploading assets\n\nTo upload..."
}'
Use slug to give the document a stable identifier for deduplication. If omitted, the title is used as the slug.
URL fetch
Pass a URL — the system fetches the page, strips HTML if needed, and ingests the result as Markdown.
curl -X POST https://your-host/api/v1/ingest/url \
-H "Authorization: Bearer cape_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.cape.io/getting-started",
"namespace": "user_docs"
}'
Returns 422 if the URL cannot be fetched.
OpenAPI spec
POST an OpenAPI/Swagger spec as JSON or YAML. The system creates one document per API operation, keyed by operationId.
curl -X POST https://your-host/api/v1/ingest/openapi \
-H "Authorization: Bearer cape_..." \
-H "Content-Type: application/yaml" \
--data-binary @openapi.yaml
Operations without an operationId are skipped. Re-posting the same spec only re-embeds changed operations.
Confluence space
Bulk-import an entire Confluence space. This runs asynchronously — the endpoint returns a job ID immediately.
curl -X POST https://your-host/api/v1/ingest/confluence \
-H "Authorization: Bearer cape_..." \
-H "Content-Type: application/json" \
-d '{ "spaceKey": "ENG", "namespace": "confluence" }'
Required environment variables:
CONFLUENCE_BASE_URL=https://yourorg.atlassian.net
CONFLUENCE_EMAIL=ci@yourorg.com
CONFLUENCE_API_TOKEN=...
Monitor the job via GET /api/v1/ingest/jobs.
CI/CD integration
The typical pattern is to trigger ingestion from a deployment pipeline after a docs build:
# GitHub Actions example
- name: Ingest documentation
run: |
for file in docs/**/*.md; do
curl -s -X POST $CAPE_DOCS_URL/api/v1/ingest/md \
-H "Authorization: Bearer $CAPE_API_KEY" \
-H "Content-Type: application/json" \
-d "$(jq -n \
--arg title "$(head -1 $file | sed 's/# //')" \
--arg content "$(cat $file)" \
--arg namespace "user_docs" \
--arg slug "$(basename $file .md)" \
'{title:$title,content:$content,namespace:$namespace,slug:$slug}')"
done
Or use the bundled script for local ingestion:
npx tsx scripts/ingest-folder.ts ./docs --limit=50
The script auto-detects namespace: files named technical.md go to tech_docs; everything else goes to user_docs.
Bulk folder script
npx tsx scripts/ingest-folder.ts <directory> [options]
| Option | Description |
|---|---|
--dry-run | Parse and log without writing to the database |
--limit=N | Stop after N files |
Progress output: + means ingested, . means skipped (unchanged).
Monitoring ingestion jobs
Long-running jobs (Confluence imports) are tracked in the database. View them in the Ingestion Jobs section of the admin panel or via the API:
curl https://your-host/api/v1/ingest/jobs \
-H "Authorization: Bearer cape_..."
Job statuses:
| Status | Meaning |
|---|---|
pending | Queued, not yet started |
running | Actively fetching and embedding |
done | Completed successfully |
failed | Error occurred — check the error field |