Documentation

Uploading Documents

Upload, process, and search documents with AI-powered text extraction and semantic search

One Resource for Every Media Type

Documents share the unified /api/v2/files/* surface with images and videos — same upload endpoints, same listing endpoint, same deletion. The server detects the file type from content and routes the document through extraction, chunking, and embedding automatically.

Supported Formats

.pdf, .docx, .txt, and .md. Streaming uploads accept up to 100 MB per file. For anything larger, use the presigned upload flow under /api/v2/files/uploads.

Size gradient: upload endpoint accepts 100 MB via streaming and up to 5 TB via presigned. Text extraction — the step that builds chunks and embeddings for semantic search and RAG — is capped at 500 MB per file. Documents between 100 MB and 500 MB must use the presigned upload flow; they will be fully extracted. Documents over 500 MB upload and store fine but are marked text_extraction_status: "failed" — still listable and downloadable, but not text-indexed.

Choosing an Upload Method

The Python SDK's client.files.upload() picks the right method automatically. If you're calling REST directly, use this table:

Use caseEndpointSize capRoundtrips
One documentPOST /files/upload100 MB1
Multiple documentsPOST /files/upload/batch100 MB / file1
Large document (>100 MB)POST /files/uploads S3/complete5 TB3

Quick Start

Upload a Document

python
async with Scopix(api_key="scopix_...") as client:
result = await client.files.upload("report.pdf")
print(f"File ID: {result.image_id}")
print(f"Filename: {result.filename}")
# Poll the unified processing status for extraction details
status = await client.files.get_processing_status(result.image_id)
print(f"Extraction: {status.text_extraction_status}")
print(f"Pages: {status.page_count}")
print(f"Chunks: {status.chunk_count}")

Upload Options (SDK)

python
result = await client.files.upload(
"report.pdf",
folder_id=None, # Optional folder UUID
project_id=None, # Optional project workspace
storage_target="default", # "default" or "custom" (BYOB)
skip_duplicates=True, # Return existing file_id on hash match
content_category="document", # Tailors AI processing
)

Batch Upload

What does "batch" mean here?

"Batch" means multiple files uploaded in one HTTP request — the endpoint groups them into a tracked upload session. It is not a job queue. All document processing (chunking, embedding, indexing) happens automatically in the background; you don't submit processing jobs separately.

Upload Multiple Documents

python
# upload_batch sends one multipart/form-data request with all files.
# Returns BatchUploadResults — a list subclass of UploadResult, iterate directly.
results = await client.files.upload_batch([
"report1.pdf",
"report2.docx",
"notes.txt",
])
print(f"Uploaded {len(results)} documents")
for r in results:
print(f" {r.filename}: {r.image_id}")
# Helper methods for batch inspection
if results.has_failures:
for r in results.failed():
print(f"FAIL {r.filename}: {r.description_error}")
print(results.summary()) # e.g. "3 succeeded"

Processing Status

Check Processing Status

python
status = await client.files.get_processing_status("550e8400-...")
print(f"Extraction: {status.text_extraction_status}") # pending | processing | completed | failed
print(f"Pages: {status.page_count}")
print(f"Chunks: {status.chunk_count}")

Per-Page Digitization

For PDF documents, the digitization pipeline returns per-page structural elements (headings, paragraphs, tables, key-value pairs) with normalized bounding boxes.

python
# Lightweight status (no element data)
status = await client.files.get_digitization_status("550e8400-...")
print(status["status"]) # pending | processing | completed | failed
# Full digitization (all pages)
result = await client.files.get_digitization("550e8400-...")
for page in result["pages"]:
print(f"Page {page['page_number']}: {page['element_count']} elements")
# Single page
page = await client.files.get_digitization_page("550e8400-...", page_number=2)
for el in page["elements"]:
print(f" {el['type']}: {el['content'][:60]}...")

Semantic Search

AI-Powered Search

Search uses semantic similarity — search by meaning, not just keywords. "damaged equipment" will find content about "broken machinery" even if those exact words aren't present.

Search Documents

python
results = await client.files.search(
query="safety inspection requirements",
limit=20,
similarity_threshold=0.3,
)
for chunk in results.items:
print(f"Document: {chunk.document_filename}")
print(f"Score: {chunk.score:.2f}")
print(f"Content: {chunk.content[:200]}...")

Search Specific Documents

python
results = await client.files.search(
query="compliance requirements",
document_ids=["doc_abc123", "doc_def456"],
limit=10,
)
for chunk in results.items:
print(f"{chunk.document_filename}: {chunk.content[:100]}...")

Document Management

Documents are managed through the same unified files resource as images and videos.

List, Get, Download, Delete

python
# List documents only — filter by media_type
files = await client.files.list(media_types=["document"], limit=20)
print(f"Total: {files.total_count}")
for f in files.items:
print(f" {f.filename} ({f.document_type})")
# Get document details
doc = await client.files.get("550e8400-...")
print(f"Filename: {doc.filename}, Pages: {doc.page_count}")
# Get extracted text
text_result = await client.files.get_text("550e8400-...")
print(f"Text length: {len(text_result['text'])} characters")
# Get chunks (for RAG debugging)
chunks_result = await client.files.get_chunks("550e8400-...")
print(f"Total chunks: {chunks_result['total_chunks']}")
# Get a temporary download URL for the original file
download_url = await client.files.download_url("550e8400-...")
# Delete document and all chunks
await client.files.delete("550e8400-...")

Documents in Chat (RAG)

Automatic Document Access

The chat system's document search agent has access to all your uploaded documents automatically. There is no need to explicitly attach documents — the AI searches relevant documents based on your query.

Chat with Documents

python
async with Scopix(api_key="scopix_...") as client:
async with client.chat_session() as session:
response = await session.send(
"What are the key safety requirements mentioned in the documents?"
)
print(response.content)
response2 = await session.send("Which section covers equipment maintenance?")
print(response2.content)

Deduplication

Documents are deduplicated by SHA-256 content hash — safe to retry failed uploads.

python
r1 = await client.files.upload("report.pdf", skip_duplicates=True)
r2 = await client.files.upload("report.pdf", skip_duplicates=True)
# r2.deduplicated == True; r2.image_id references the same document as r1
# (populated on presigned/multipart completions; may be None on streaming uploads)

Limits & Quotas

  • Streaming max file size: 100 MB per document (use presigned upload for larger)
  • Batch size: 1–100 documents per batch (tier-dependent)
  • Concurrent batches: 200 active batches per tenant
  • Supported formats: PDF, DOCX, TXT, MD