Documentation
Uploading Documents
Upload, process, and search documents with AI-powered text extraction and semantic search
One Resource for Every Media Type
Documents share the unified /api/v2/files/* surface with images and videos — same upload endpoints, same listing endpoint, same deletion. The server detects the file type from content and routes the document through extraction, chunking, and embedding automatically.
Supported Formats
.pdf, .docx, .txt, and .md. Streaming uploads accept up to 100 MB per file. For anything larger, use the presigned upload flow under /api/v2/files/uploads.
Size gradient: upload endpoint accepts 100 MB via streaming and up to 5 TB via presigned. Text extraction — the step that builds chunks and embeddings for semantic search and RAG — is capped at 500 MB per file. Documents between 100 MB and 500 MB must use the presigned upload flow; they will be fully extracted. Documents over 500 MB upload and store fine but are marked text_extraction_status: "failed" — still listable and downloadable, but not text-indexed.
Choosing an Upload Method
The Python SDK's client.files.upload() picks the right method automatically. If you're calling REST directly, use this table:
| Use case | Endpoint | Size cap | Roundtrips |
|---|---|---|---|
| One document | POST /files/upload | 100 MB | 1 |
| Multiple documents | POST /files/upload/batch | 100 MB / file | 1 |
| Large document (>100 MB) | POST /files/uploads→ S3→/complete | 5 TB | 3 |
Quick Start
Upload a Document
async with Scopix(api_key="scopix_...") as client: result = await client.files.upload("report.pdf")
print(f"File ID: {result.image_id}") print(f"Filename: {result.filename}")
# Poll the unified processing status for extraction details status = await client.files.get_processing_status(result.image_id) print(f"Extraction: {status.text_extraction_status}") print(f"Pages: {status.page_count}") print(f"Chunks: {status.chunk_count}")Upload Options (SDK)
result = await client.files.upload( "report.pdf", folder_id=None, # Optional folder UUID project_id=None, # Optional project workspace storage_target="default", # "default" or "custom" (BYOB) skip_duplicates=True, # Return existing file_id on hash match content_category="document", # Tailors AI processing)Batch Upload
What does "batch" mean here?
"Batch" means multiple files uploaded in one HTTP request — the endpoint groups them into a tracked upload session. It is not a job queue. All document processing (chunking, embedding, indexing) happens automatically in the background; you don't submit processing jobs separately.
Upload Multiple Documents
# upload_batch sends one multipart/form-data request with all files.# Returns BatchUploadResults — a list subclass of UploadResult, iterate directly.results = await client.files.upload_batch([ "report1.pdf", "report2.docx", "notes.txt",])
print(f"Uploaded {len(results)} documents")for r in results: print(f" {r.filename}: {r.image_id}")
# Helper methods for batch inspectionif results.has_failures: for r in results.failed(): print(f"FAIL {r.filename}: {r.description_error}")print(results.summary()) # e.g. "3 succeeded"Processing Status
Check Processing Status
status = await client.files.get_processing_status("550e8400-...")
print(f"Extraction: {status.text_extraction_status}") # pending | processing | completed | failedprint(f"Pages: {status.page_count}")print(f"Chunks: {status.chunk_count}")Per-Page Digitization
For PDF documents, the digitization pipeline returns per-page structural elements (headings, paragraphs, tables, key-value pairs) with normalized bounding boxes.
# Lightweight status (no element data)status = await client.files.get_digitization_status("550e8400-...")print(status["status"]) # pending | processing | completed | failed
# Full digitization (all pages)result = await client.files.get_digitization("550e8400-...")for page in result["pages"]: print(f"Page {page['page_number']}: {page['element_count']} elements")
# Single pagepage = await client.files.get_digitization_page("550e8400-...", page_number=2)for el in page["elements"]: print(f" {el['type']}: {el['content'][:60]}...")Semantic Search
AI-Powered Search
Search uses semantic similarity — search by meaning, not just keywords. "damaged equipment" will find content about "broken machinery" even if those exact words aren't present.
Search Documents
results = await client.files.search( query="safety inspection requirements", limit=20, similarity_threshold=0.3,)
for chunk in results.items: print(f"Document: {chunk.document_filename}") print(f"Score: {chunk.score:.2f}") print(f"Content: {chunk.content[:200]}...")Search Specific Documents
results = await client.files.search( query="compliance requirements", document_ids=["doc_abc123", "doc_def456"], limit=10,)
for chunk in results.items: print(f"{chunk.document_filename}: {chunk.content[:100]}...")Document Management
Documents are managed through the same unified files resource as images and videos.
List, Get, Download, Delete
# List documents only — filter by media_typefiles = await client.files.list(media_types=["document"], limit=20)print(f"Total: {files.total_count}")for f in files.items: print(f" {f.filename} ({f.document_type})")
# Get document detailsdoc = await client.files.get("550e8400-...")print(f"Filename: {doc.filename}, Pages: {doc.page_count}")
# Get extracted texttext_result = await client.files.get_text("550e8400-...")print(f"Text length: {len(text_result['text'])} characters")
# Get chunks (for RAG debugging)chunks_result = await client.files.get_chunks("550e8400-...")print(f"Total chunks: {chunks_result['total_chunks']}")
# Get a temporary download URL for the original filedownload_url = await client.files.download_url("550e8400-...")
# Delete document and all chunksawait client.files.delete("550e8400-...")Documents in Chat (RAG)
Automatic Document Access
The chat system's document search agent has access to all your uploaded documents automatically. There is no need to explicitly attach documents — the AI searches relevant documents based on your query.
Chat with Documents
async with Scopix(api_key="scopix_...") as client: async with client.chat_session() as session: response = await session.send( "What are the key safety requirements mentioned in the documents?" ) print(response.content)
response2 = await session.send("Which section covers equipment maintenance?") print(response2.content)Deduplication
Documents are deduplicated by SHA-256 content hash — safe to retry failed uploads.
r1 = await client.files.upload("report.pdf", skip_duplicates=True)r2 = await client.files.upload("report.pdf", skip_duplicates=True)# r2.deduplicated == True; r2.image_id references the same document as r1# (populated on presigned/multipart completions; may be None on streaming uploads)Limits & Quotas
- Streaming max file size: 100 MB per document (use presigned upload for larger)
- Batch size: 1–100 documents per batch (tier-dependent)
- Concurrent batches: 200 active batches per tenant
- Supported formats: PDF, DOCX, TXT, MD

