The Repository Reads Itself: Rethinking Content Repositories for the Agent Era
Traditional content repositories could store, version, route, and audit content — but never read it. Agents finally can. That changes what a repository has to be.
TL;DR. Content repositories were built to make content usable by systems, but the operating model was broken: humans had to create and maintain the metadata, structure, taxonomies, and classifications that made the repository useful. Agents change that. They can read content, help generate the data layer from it, and use repositories at machine speed. That changes the repository itself: metadata becomes system-maintained, search becomes iterative, retrieval becomes bursty, elasticity becomes mandatory, and configuration has to be AI-authorable. Vertesia Content Repository is built for that shift: a repository that reads its own content and gives agents and processes one governed place to read, act, transform, and preserve lineage.
The forever issue with content repositories
Content repositories were never really about storing files. They were about making content organized and accessible in a system of record – used by humans, through software.
That was the point of metadata, taxonomies, schemas, facets, lifecycle states, permissions, search indexes. All of it was an attempt to turn documents into something software could find, route, govern, reuse, and compute against.
The idea was right.
The operating model was wrong.
Someone had to create the metadata. Someone had to maintain it. Someone had to keep it aligned with the content as the business changed. In practice, that meant sparse fields, stale taxonomies, inconsistent classification, brittle search, and a repository that stored the content but still depended on humans to understand it.
That is the part agents change. Not because metadata suddenly matters — it always did. Because software can finally read the content well enough to help create and maintain the data layer from the content itself.
So the question I wanted to revisit wasn’t “how do we add an AI search box to a content system?” or “how do we wrap a chat assistant around a file store?”
The more basic question, from first principles: if software can finally read content, what does a content repository need to be?
What LLMs unlocked
GPT-3 didn’t make the answer obvious — it made the question impossible to ignore.
I had spent twenty years building content systems, and I already knew the file was not the interesting part. The interesting part was the data layer around it: metadata, extracted fields, relationships, lifecycle, permissions, provenance. That was what made content usable by software. The problem was that nobody had enough of it, and nobody could keep it clean.
When I saw LLMs, I did not think “great, now we can chat with documents.” I thought: we can finally generate the data layer from the content itself.
And once you can do that, the content can be what it always was — the source of truth. Contracts, policies, claims, statements, case files — those are the source of truth. The legal authority lives in the content but the business rarely ran on it directly.
Extracting structured data from content was hard enough that enterprises built a parallel world: pull out whatever was tractable, copy it into a CRM, an ERP, a case-management system, run the business on the copy, and treat the content as the static record behind the operational data.
LLMs collapse that gap. The data layer can be generated from the content, kept aligned with the content, and the content stops being a downstream artifact of a parallel system.
That pushed me to experiment. Those experiments became Vertesia.
Later, as we built agents on top of the platform, the need became obvious in practice: content was the input to almost everything they did.
Which makes sense. Content is the primary material of human knowledge work in the enterprise — what people read, judge, draft, decide, summarize, route. Agents doing real enterprise work inherit that. Content isn’t an input some agents need; it’s the natural input for most enterprise agents. That dependency points straight at the repository.
I did not set out to build another content repository. We had to build one because the agents needed a repository they could actually work with.
What models changed
The repository scaled in storage. The understanding layer didn’t scale at all. Not because the people doing the reading were slow — they did exactly what they were supposed to do — but because human attention was the limiting factor on what the repository could be used for. There was no way to read every contract every quarter looking for a hidden clause pattern. No way to compare every claim file against every prior settlement. No way to audit every policy renewal against every applicable regulation. Most of the questions you’d actually want to ask of an enterprise content corpus were never economically viable for humans to answer at all.
So those questions didn’t get asked.
Now agents read — and see. Visual processing matters as much as text extraction. For slides, image-heavy PDFs, charts, diagrams, scanned forms, the agent needs to look at the page, not just parse the extracted text.
And they can author. Drafts, slide decks, spreadsheet models, structured tables, diagrams, code artifacts — content tools that produce records the next agent can read. Reading was the unlock; authoring is what makes agents participants in the work, not just observers of it.
But that requires preparation ECM never built. Old content prep was raw text extraction for full-text search: flatten the bytes into a searchable string, hand it to Solr or Elasticsearch, query with keywords. That works for human users browsing. It does not work for agents that need to ask “the indemnification clause” or “table 3” without being handed the whole document — and it works even less for visual artifacts that have little extractable text. The shape of the prep had to change. We had to build it.
For a large class of enterprise content, the wall finally moves.
In many ways, models finally let us deliver on ECM’s original promise. The category sold single source of truth, automatic classification, semantic findability, lifecycle governance, content reuse, and compliance for thirty years — and quietly punted the actual understanding work to humans, because nothing else could do it. Now software can. The promises can finally be kept.
I’ve spent the last few years at Vertesia building agents in production against real enterprise content: contracts, invoices, claims, case files, policies, the same list I’ve been watching the category chase for twenty years. Those agents could read documents, classify them into types we never explicitly modeled, extract structure from messy PDFs, summarize complex submissions, and judge edge cases. The judgment work the old repository pushed back to humans — agents could do it at machine speed, across the scale the repository already operates at.
That was the design tension.
Traditional repositories preserve content but cannot understand it.
Agents can understand content, but do not, by themselves, preserve the lifecycle, lineage, governance, and audit a content system needs.
The best version is not a repository with an AI feature bolted on, and not an agent with a folder of files. It is a repository that is readable, typed, governed, and addressable by agents, tools, and processes.
The question became: what does a content repository look like when agents are inside it, not bolted on top?
A content repository should now read its own content. Everything else follows from that.
A new content repository, built from first principles
The repository is no longer passive storage. It becomes active infrastructure — something agents and processes operate on directly, not merely a place they read from and write to.
That is the best-of-both-worlds version: agents bring the understanding the repository never had; the repository keeps the durability, lineage, permissions, and audit trail agents on their own can’t preserve. Schemas that evolve with the data, intake that reads on arrival, content tools that act on the records, and one repository that knows what it holds.
This was not a new problem for me. In 2015, I wrote about part of it as Deep Content: richer, more structured, more software-addressable content.
The direction was right.
The missing piece was the operating model.
We still depended on humans to create and maintain the structure. Agents change that.
This essay is what came out of taking that question seriously. The result is Vertesia Content Repository: a governed content layer designed from first principles for the moment software can read content — with agents and content tools inside, addressable from processes, governed under the same contract as the rest of the platform.
Three pillars fell out of that design: intake that reads, schemas that evolve with the data, and content tools that operate inside the repository instead of sitting beside it.
The repository is not an agent. It is the durable system of record that lets agents read, act, and derive without losing lineage, permissions, and audit.
A content repository is not just a file store with metadata. It is the operating layer for work that depends on knowing what content is, what it means, where it came from, what’s allowed to happen to it, and what gets derived from it. Once agents become part of that work, the repository has to change. It has to know what shape the content has, what it’s about, who can see it, what can be derived from it, and how it composes with everything else. This is not an AI feature bolted onto a content store. The primitives changed because agents are load-bearing actors in the content layer.
The Workload Changes
Agents do not just change what the repository can know.
They change how the repository is used.
Traditional systems were designed around human behavior: one query, a few results, sequential reading, occasional refinement. The workload was bounded by human speed.
Agents behave differently.
They query in parallel. They retrieve in batches. They scan quickly. They refine. They repeat.
Metadata becomes maintainable
Metadata was always a machine primitive — Dublin Core, RDF, taxonomies, facets, structured schemas. The problem was that humans had to create and maintain it. So it became incomplete, stale, inconsistent, or ignored.
Agents change that. They generate, validate, and use metadata continuously as part of the work. Metadata becomes a compressed representation of the content that the system maintains, not something users reluctantly fill out.
The unit of human work becomes review, not data entry. The metadata layer the category designed thirty years ago finally has an operating model that works.
Search becomes iterative
Traditional search is single-shot: type → read → refine.
Agents probe and refine:
issue multiple queries in parallel
inspect results
refine the query
repeat
The question is no longer “how fast is one query?”
It is: how many queries can the system sustain, in parallel, under load?
A human might search “unpaid invoices South America.” An agent asks, in parallel: unpaid invoices for reseller accounts in South America, invoices above a risk threshold, totals grouped by reseller and country, invoices matching prior disputed-payment patterns — then refines, aggregates, and continues.
Retrieval becomes machine-scale
Humans read sequentially. Agents retrieve and scan dozens of documents in parallel, extract what matters, and iterate again.
This creates a different load profile:
bursty spikes of activity
high parallelism
repeated refinement loops
structured extraction and aggregation
The repository is no longer serving browsing. It is serving continuous machine retrieval.
Elasticity becomes mandatory
I’ve seen parts of this before. At Nuxeo, we pushed the repository toward flexibility and performance. It still wasn’t enough — because the workload itself changed.
Human-driven repository usage is naturally smoothed by human speed. Even when many users are active, the system is serving human-paced interaction.
Agents do not behave that way.
An agent issues dozens of queries, retrieves hundreds of records, scans documents in parallel, runs extraction, aggregates results, refines its strategy, and then goes quiet.
The pattern is burst, reason, burst, reason, burst — then silence.
If the repository is built for steady human browsing, it either falls over under agent load or gets overprovisioned for peaks that only appear during agent work. Neither is good enough.
An agent-era repository has to be elastic by design:
scale retrieval under bursty parallel access
serve structured and semantic queries together
handle spikes of extraction and embedding
support high-concurrency document reads
absorb agent-driven aggregation workloads
scale back down when the reasoning step goes quiet
This is not “better search.” It is a different class of workload.
“Sub-second search” is a human UX metric. Agents turn retrieval into an elasticity problem.
The System Has to Be Configurable by AI
There is another shift that is easy to miss.
A lot of enterprise software was designed around low-code configuration. That made sense. Admins could configure types, fields, rules, workflows, screens, retention policies, connectors, automations. The product became configurable without every change becoming a code project.
But low-code for humans is not the same thing as configuration for AI.
Most of those systems were built around admin UIs. Configuration lives in forms, files, database rows, hidden validators, proprietary conventions, and product-specific state. A human can click through that. An assistant cannot safely change it unless the system was designed for that from the beginning.
That matters now.
A human admin might configure a content type by clicking through fields, validation rules, lifecycle states, extraction settings, and screen layouts. An assistant cannot safely work through that unless the underlying configuration is typed, inspectable, diffable, and validated. Otherwise the assistant is just writing into a black box.
If an assistant is going to help configure a repository, it needs definitions it can inspect, reason about, modify, validate, diff, and explain. Content types, schemas, intake rules, extraction policies, lifecycle states, tool bindings, retrieval profiles — all of it has to be represented as something the assistant can work with safely.
That is not “AI in the admin UI.” It is a different design constraint.
The configuration layer has to be: typed, explicit, validated, versioned, diffable, inspectable, and safe for an assistant to propose changes to.
A concrete picture. The intake assistant sees a stream of recent contracts and notices that several include a data residency clause that doesn’t fit any existing field. It proposes a schema change to the Master Agreement type — add data_residency as list of identified regions. The proposal lands as a typed diff with attribution: which documents triggered it, what existing records would re-classify against the new field, what breaks (if anything) in current queries. A human reviewer accepts, modifies, or rejects. The repository versions the schema change with provenance: who proposed it, when, why, and which records prompted it.
That is what AI-authorability looks like in practice. The assistant proposes, the engine validates, the human reviews, the repository tracks every change. None of that works if the configuration lives in admin-UI forms and proprietary state. It only works if the definition is something the assistant can actually inspect and modify safely.
This is one of the reasons we designed the new repository and process engine together. The assistant is not just a chat layer over the product. It is a configuration actor — proposing schema changes, wiring tools, adjusting intake behavior, generating process definitions, explaining the diff before a human approves it.
Low-code for humans is not the same as configuration for AI.
AI-authorability is not a UX feature. It is an architecture property.
The Two Failure Modes
Before the design, the shape of the current wrong answers.
Much of the ECM market is bolting AI onto a thirty-year-old data model. They add a chat panel that searches the repository, an OCR upgrade, an “AI summary” button, a vector index. The seam shows the moment you ask the repository to know something it wasn’t designed to know. You can drop a contract into the chat panel and get a useful summary. You cannot ask the repository for every contract that auto-renews in Q4 with a supplier whose claim history shows three settlement events — because the data model never had auto_renew, renew_date or supplier_claim_history as fields the assistant could populate, and the assistant has no way to evolve the schema to add them. The AI is a search/summary surface; the data layer is unchanged. The repository still can’t reason about what it stores.
Modern AI tools treat content as disposable input. Drop a PDF into a chat window, get a summary. Useful for a one-off question; useless as the foundation for an enterprise. There is no lifecycle. No lineage. No permissions. No version history. No audit. No way for a process to address the content next month and trust what comes back. You get a search/chat surface, not a repository.
Real enterprise content work needs both planes at once. The durability and governance of an ECM. The understanding of an agent. In one place. Without the seams showing.
The market is already copying the vocabulary. Add extraction. Add embeddings. Add document chat. Add an assistant over content. Keep the same repository model underneath.
That is not the shift.
Document chat is not a repository. AI extraction is not a content model. Semantic search is not an agent substrate.
The change is deeper. The repository itself becomes the platform agents and processes work against.
The Thesis
The design sits on a single sentence:
Content endures. Understanding is generated. The repository must hold both — as one system.
A concrete picture. A contract lands in the repository. The intake assistant reads it, proposes the type (master_services_agreement), extracts fields like parties, effective_date, and total_value, generates a summary, detects PII, links related records, and routes it for review. The repository stores all of that — bytes, structured fields, embeddings, summary, lineage, who approved what — under one record. Six months later, a process needs every contract that auto-renews in Q4 from a specific supplier. The repository answers from typed fields and semantic similarity at once. None of that judgment came from a human writing rules. None of it was ad-hoc, either — the repository owns the schema, the lineage, the permissions, and the audit. That distinction is the system.
Every design decision follows. The repository owns schema, types, lineage, permissions, versioning, lifecycle. Agents operate on content under typed contracts. Content tools transform, generate, and manipulate the records the repository holds. The intake assistant proposes types and fields the repository validates before they become part of the model.
The rest of this essay is how that sentence turns into a repository.
First Pillar: AI-Powered Intake
The hardest moment in any content system is arrival.
A file arrives and the repository asks the user to do the work the repository cannot do: identify the type, fill in metadata, apply the taxonomy, set the lifecycle, and decide what should happen next.
People do this badly because the cost of doing it well is immediate and the cost of doing it poorly shows up months later. Half the metadata in any large ECM is wrong, blank, or stale.
AI changes the moment of arrival.
When a document arrives at the new repository, an intake assistant reads it. It:
proposes a type from the candidate types in the model,
proposes a new type with a schema sketch when no existing type fits,
extracts the structured metadata the type declares,
classifies into the taxonomy and flags edge cases that don’t fit,
generates a summary scoped to the type’s downstream uses,
detects PII and sensitive data,
chunks and embeds content for retrieval,
links to related records the repository already holds,
proposes lifecycle and retention.
All of this lands as a suggested record the human reviews, accepts, or corrects.
The human’s job changes from data entry to review. That’s an order-of-magnitude difference in throughput and quality.
Semantic document preparation
Intake also includes semantic docprep. Documents are not OCR’d into a flat text blob — they are visually processed into structured, semantically accurate text, with a generated table of contents and structural anchors that mirror the document’s actual shape: sections, clauses, tables, figures, footnotes, exhibits.
Long documents become navigable. An agent can ask the repository for “the indemnification clause” or “table 3” or “section 12.4 only” and retrieve that addressable part directly — not the whole document. That is what makes large-document retrieval actually work in production: full-document attention is wasteful and lossy; addressable parts are precise and efficient. The repository owns the structural map; the agent walks it.
Crucially, the intake assistant is inside the repository, not outside. It uses the repository’s schema, types, taxonomies, and permissions. When it proposes a new type, the proposal becomes a versioned schema change with attribution — the repository knows what it learned from what document on what day. Schemas evolve with the data they’re trying to describe.
This is what flexible schemas means in practice. Not “no schema” — that’s the chat-tool failure mode. Typed, validated, evolvable schemas that the assistant can extend with human approval and full lineage of why each field exists.
Second Pillar: Flexible, Schema-Aware Records
The result in most ECM systems is predictable: the schema reflects what the business needed years ago, not what it needs now.
The data model was historically the slowest-moving part of any content platform. Adding a new field meant a migration project. Adding a new type meant a release. Changing a taxonomy meant a quarter of disruption.
The new repository inverts that. Schemas are flexible by design — typed, validated, versioned, evolvable. The assistant continuously proposes refinements as it sees more content. New types, new fields, new constraints, all under version control with provenance. This is how the repository operates today, not an authoring fantasy.
Concretely, the repository provides:
Typed records with structured fields. Contracts have
parties,term,value. Invoices havesupplier,amount,gl_code. Validated, queryable, addressable.Embeddings and structured extraction computed on intake. Every document is searchable both by exact field and semantic similarity — precision and recall at once.
Expressive query interface for agents. Agents query the repository through a governed search surface that supports full-text, structured filters, semantic similarity, aggregations, compound logic, and scoring boosts. The point is not to give agents a keyword box — it is to expose the retrieval power they need, under policy and cost controls.
Branchable, tree-aware versioning. Documents fork — translations, redactions, negotiation iterations, agent variants. Linear histories with separate copies lose the relationship; the repository tracks the version tree so you can walk it, compare branches, and merge changes back. “Branch a branch” is a first-class operation. That’s what makes it possible to ask “which redacted version of the Q3 contract did legal sign off on, and what changed against the negotiation thread?” — without losing either branch.
Fine-grained, dynamic permissions. Beyond static ACLs: permissions can be computed from content fields, user attributes, time, and context. An agent’s access to a contract can depend on the contract’s status, the user’s relationship to the supplier, and the policy in force at query time. Permissions evaluate at retrieval, not just at storage.
Lineage and provenance. Every derived artifact records what produced it, what it came from, and when. The repository can explain itself.
Lifecycle, retention, and audit — durable, the same way every serious ECM has done it for thirty years. None of that goes away.
The long tail of content shapes — structured records, unstructured documents, semi-structured data (tables, spreadsheets, slides), media, code — under the same model.
This is the boring, durable half of the design. It’s also what lets the AI half be trusted. The intake assistant proposes; the repository keeps the record. Agents read; the repository owns the truth.
Third Pillar: Agents and Tools Inside the Content
This is where most “AI-powered ECM” stops. They let you read content. They do not let you act on it.
And enterprise work is about acting on content, not just reading it. Generate the quarterly review deck from this set of records. Transform every claims PDF into structured data. Produce the regulatory packet from these contracts and these schedules. Modify a contract. Update a spreadsheet. Run code against a dataset.
The architectural claim isn’t that the repository ships those tools. Plenty of platforms ship some version of them. The claim is that one tool serves every caller from a single definition.
The same content tool can be invoked by an internal worker agent inside a bounded reasoning loop, called by a deterministic process node when the step is known, exposed to a third-party model via MCP, and consumed by another agent system over A2A. One definition. Four consumption modes. Same governance — scoping, lineage, permissions, audit — applied in every case. If the step is known, the process calls the tool directly; no model in the path. If it requires judgment, an agent uses the same tool inside a node bounded by writes.
Most platforms expose their tool layer to one consumption mode and rebuild it for the others: agent SDKs separate from process-engine SDKs separate from external integrations separate again from the consoles humans use. The repository’s tool layer was designed for all four from day one. MCP and A2A are how that coherence reaches outside the platform; the coherence itself starts inside.
The tools that participate in this coherence today are the ones enterprise content work actually needs:
slide generation and editing against the records the repository holds
spreadsheet manipulation against typed records (Excel, CSV, Sheets)
code execution on data the repository owns, results captured as artifacts with lineage preserved
PDF generation, signing, and form-filling
document transformation between formats, languages, and structural shapes
diagrams, charts, and structured visualizations from the data the repository holds
table-of-contents extraction, structured chunking, and schema inference for long-form documents
These are not chatbot widgets. They’re governed primitives — scoped to records, observable in run history, constrained by permissions. The list grows; the contract doesn’t change.
When Not to Reach for This
A content repository this rich is overkill when:
You need a file share. Use a file share.
You need a chat interface over a small set of documents. Use any modern AI workspace tool.
Your content has no lifecycle, no permissions story, no audit requirement, no relationships between records. Use object storage and call it done.
You need a CMS for a public website. Use a CMS.
Reach for the new repository when you have typed records with relationships, content that needs to be readable by software, lifecycle and lineage that matter, processes that operate on records, content tools that are part of the work, and a need for the repository itself to evolve as the business does. That’s the sweet spot. The same enterprise pattern that has always justified ECM — only this time the repository can read what it holds.
One Repository, Not Many Features
The differentiator is not that the platform has intake, schemas, search, embeddings, content tools, lineage, lifecycle, permissions, and an assistant. Other systems will assemble those parts.
The differentiator is that all of them sit under the same repository contract: typed records, declared schemas, scoped permissions, full lineage, immutable history, governed content tools, and shared retrieval. One contract. One trust anchor. One audit trail.
That matters even more once agents become consumers of the repository. If every agent platform brings its own document store, its own metadata model, and its own retrieval semantics, the enterprise gets another generation of content silos. The repository should be the shared content layer: one place where content is typed, searchable, permissioned, derived from, and acted on.
This is also why a thin AI layer over an old admin model is not enough. If the assistant cannot inspect and safely change the repository definition, it is not really part of the system. It is just another UI.
In Vertesia, the repository is designed so the assistant can participate in configuration itself: proposing schemas, intake rules, content tools, retrieval behavior, and process wiring — all under validation and human approval.
The same repository is addressable from any agent or process, with the same governance whether the caller is the intake assistant, a worker agent inside a node, a process tool, or an external system speaking MCP or A2A.
That is the synthesis. Without it, you have a feature list. With it, you have a content repository.
Why we built it
We didn’t set out to build a new content repository.
We set out to run agents in production against real enterprise content.
Every serious use case led to the same conclusion: agents could read, but the repository couldn’t. The gap between them is what kept the system from working.
The design followed.
AI-powered intake. Flexible schema-aware records. Machine-maintained metadata. High-throughput retrieval. Content tools as primitives. Lineage and governance preserved. Repository addressable from processes.
Not features.
Consequences.
Agents read. Repositories preserve. Tools transform. The repository is the contract that lets all three work together.
Thanks for reading. Next: the process side — why agents force us to rethink the process engine as the contract between deterministic control and probabilistic behavior.
A note: this isn’t a roadmap pitch. Vertesia Content Repository is shipping now in our platform, alongside the Vertesia Process Engine. The process side gets its own essay next.



