Transparent AI Framework — Open Embeddings Schema
Open schema, signed provenance, and verifiable retrieval for a public semantic commons.
Version: 0.1.0 Status: Draft Author: Technology Shield Last Updated: 2026-03-25
1. Purpose
Large language models and embedding systems are being built with billions of dollars by organisations whose incentives are not aligned with collective human flourishing. The data they train on is opaque. The transforms they apply are hidden. The rankings they produce are invisible. The biases they carry are undisclosed.
This document defines an Open Embeddings Schema — a structured, content-addressed, cryptographically signed format for embeddings and their provenance — as one layer of a broader Transparent Semantic Commons.
The goal is not to claim neutrality. It is to make every assumption legible, every transform auditable, every ranking challengeable, and every governance decision forkable.
2. Design Principles
| # | Principle | Description |
|---|---|---|
| 1 | Canonical Serialisation | Every object has exactly one deterministic representation for consistent hashing |
| 2 | Content-Addressed Identity | Object IDs are derived from content hashes, not database sequences |
| 3 | Signed Records | Every record is cryptographically signed by its creator |
| 4 | Append-Only History | Records are never silently edited or deleted; only supersession and tombstoning |
| 5 | Typed Relations | Relationships between objects are first-class, typed, and signed |
| 6 | Multiple Embedding Spaces | No single model monopoly; every artefact can have embeddings from multiple models |
| 7 | Verifiable Retrieval | The rules used to retrieve and rank results are themselves public, signed objects |
| 8 | Forkable Governance | Any community can fork the dataset, embeddings, policies, and reputation without losing shared history |
3. Schema Objects
3.1 Artifact
A source object — the raw input to the commons.
{
"type": "Artifact",
"artifact_id": "cid:bafy...",
"content_type": "text/markdown",
"content_hash": "sha256:a1b2c3...",
"content_uri": "ipfs://Qm... | arweave://tx... | https://...",
"license": "CC-BY-4.0",
"created_at": "2026-03-25T01:00:00Z",
"submitted_by": "did:key:z6Mk...",
"jurisdiction_tags": ["global"],
"language": "en",
"sensitivity_tags": ["public"],
"parent_artifact_id": null,
"signature": "ed25519:..."
}
| Field | Type | Description |
|---|---|---|
artifact_id |
CID | Content-addressed ID derived from canonical serialisation |
content_type |
string | MIME type of the source content |
content_hash |
string | Cryptographic hash of the raw content |
content_uri |
string | Location of the content (IPFS, Arweave, HTTPS, or other durable storage) |
license |
string | SPDX licence identifier |
created_at |
ISO 8601 | Creation timestamp |
submitted_by |
DID | Decentralised identifier of the submitter |
jurisdiction_tags |
string[] | Jurisdictional scope |
language |
string | BCP 47 language tag |
sensitivity_tags |
string[] | Classification labels |
parent_artifact_id |
CID or null | Reference to a parent artefact (for derived works) |
signature |
string | Cryptographic signature of the submitter over the canonical record |
3.2 Transformation
Any operation applied to an artefact.
{
"type": "Transformation",
"transform_id": "cid:bafy...",
"input_artifact_id": "cid:bafy...",
"transform_type": "embed",
"software_id": "github.com/org/tool",
"software_version": "2.1.0",
"model_id": "bge-m3",
"model_version": "2026-03-01",
"parameters": {
"max_tokens": 512,
"pooling": "cls",
"normalize": true
},
"output_hash": "sha256:d4e5f6...",
"performed_by": "did:key:z6Mk...",
"timestamp": "2026-03-25T01:05:00Z",
"signature": "ed25519:..."
}
| Field | Type | Description |
|---|---|---|
transform_id |
CID | Content-addressed ID |
input_artifact_id |
CID | The artefact this transform was applied to |
transform_type |
enum | chunk, translate, classify, summarize, embed, redact, normalize |
software_id |
string | Identifier for the software that performed the transform |
software_version |
string | Version of the software |
model_id |
string | Identifier for the model used (if applicable) |
model_version |
string | Version of the model |
parameters |
object | Configuration parameters used |
output_hash |
string | Hash of the transform output |
performed_by |
DID | Who performed the transformation |
timestamp |
ISO 8601 | When the transformation was performed |
signature |
string | Cryptographic signature |
3.3 EmbeddingRecord
The embedding itself — the vector representation of an artefact.
{
"type": "EmbeddingRecord",
"embedding_id": "cid:bafy...",
"artifact_id": "cid:bafy...",
"transform_id": "cid:bafy...",
"vector": [0.0182, -0.4421, 0.0931, "..."],
"dimension": 1024,
"numeric_format": "float32",
"distance_metric": "cosine",
"normalization": "l2",
"model_family": "bge-m3",
"model_version": "2026-03-01",
"tokenizer_version": "v4",
"scope_tags": ["public", "science", "english"],
"created_at": "2026-03-25T01:05:00Z",
"submitted_by": "did:key:z6Mk...",
"signature": "ed25519:..."
}
| Field | Type | Description |
|---|---|---|
embedding_id |
CID | Content-addressed ID |
artifact_id |
CID | The source artefact |
transform_id |
CID | The transformation that produced this embedding |
vector |
float[] | The embedding vector |
dimension |
integer | Vector dimensionality |
numeric_format |
string | float16, float32, float64, int8 |
distance_metric |
string | cosine, euclidean, dot_product |
normalization |
string | l2, none, unit |
model_family |
string | Model family name |
model_version |
string | Specific model version |
tokenizer_version |
string | Tokenizer version used |
scope_tags |
string[] | Domain and scope labels |
created_at |
ISO 8601 | Creation timestamp |
submitted_by |
DID | Creator's decentralised identifier |
signature |
string | Cryptographic signature |
Critical design choice: For every artefact, the schema supports multiple embeddings from different models. This prevents monoculture and priesthood formation around a single embedding space.
Artifact A
├── EmbeddingRecord (bge-m3)
├── EmbeddingRecord (e5-large-v2)
├── EmbeddingRecord (community-model-1)
└── EmbeddingRecord (domain-specific-medical)
3.4 Claim
A proposition about the world.
{
"type": "Claim",
"claim_id": "cid:bafy...",
"claim_text": "Global average temperatures increased by 1.1°C between 1850-1900 and 2011-2020.",
"claim_type": "fact",
"language": "en",
"asserted_by": "did:key:z6Mk...",
"timestamp": "2026-03-25T02:00:00Z",
"related_artifacts": ["cid:bafy..."],
"related_embeddings": ["cid:bafy..."],
"confidence_declared": 0.95,
"jurisdiction_scope": "global",
"signature": "ed25519:..."
}
| Field | Type | Description |
|---|---|---|
claim_type |
enum | fact, opinion, forecast, moral_stance, interpretation, definition |
confidence_declared |
float (0-1) | The submitter's declared confidence level |
related_artifacts |
CID[] | Source artefacts supporting the claim |
related_embeddings |
CID[] | Embeddings linked to the claim |
3.5 EvidenceLink
A typed, signed relationship between any two objects.
{
"type": "EvidenceLink",
"link_id": "cid:bafy...",
"from_object": "cid:bafy...",
"to_object": "cid:bafy...",
"relation_type": "supports",
"weight": 0.85,
"created_by": "did:key:z6Mk...",
"timestamp": "2026-03-25T02:10:00Z",
"signature": "ed25519:..."
}
| Relation Types | Description |
|---|---|
supports |
Source evidence supports the target claim |
contradicts |
Source evidence contradicts the target claim |
contextualizes |
Source provides context for the target |
supersedes |
Source replaces the target |
rebuts |
Source specifically argues against the target |
duplicates |
Source is a duplicate of the target |
3.6 Challenge
A formal objection to any object in the commons.
{
"type": "Challenge",
"challenge_id": "cid:bafy...",
"target_object_id": "cid:bafy...",
"challenge_type": "misleading",
"reason": "The cited study was retracted in 2025 due to data fabrication.",
"supporting_artifacts": ["cid:bafy..."],
"opened_by": "did:key:z6Mk...",
"opened_at": "2026-03-25T03:00:00Z",
"status": "open",
"resolution_ref": null,
"signature": "ed25519:..."
}
| Challenge Types | Description |
|---|---|
poisoning |
Data or embedding has been deliberately corrupted |
malformed |
Object does not conform to schema |
copyright |
Content violates copyright |
fabricated |
Content is fabricated or synthetically generated without disclosure |
misleading |
Content is technically accurate but misleading in context |
duplicate |
Object is a duplicate of an existing entry |
hidden_transform |
A transformation was applied but not declared |
governance_abuse |
A governance decision was made improperly |
3.7 GovernanceDecision
The outcome of a challenge or governance action.
{
"type": "GovernanceDecision",
"decision_id": "cid:bafy...",
"target_object_id": "cid:bafy...",
"decision_type": "tombstone",
"decision_text": "Source artefact confirmed as retracted study. Embedding tombstoned. Claim marked as disputed.",
"voters_or_council": ["did:key:z6Mk...", "did:key:z6Mk..."],
"timestamp": "2026-03-25T04:00:00Z",
"supersedes": null,
"appeal_window": "P30D",
"signature": "ed25519:..."
}
4. Provenance Chain
Every embedding carries a full provenance chain that can be verified:
Artifact (source content)
│ content_hash verifies raw content integrity
▼
Transformation (chunking, embedding, etc.)
│ signed by performer; parameters recorded
▼
EmbeddingRecord (the vector)
│ signed by creator; linked to artifact and transformation
▼
EvidenceLinks (relationships)
│ signed by link creator
▼
Claims (propositions)
│ signed by asserter; linked to evidence
▼
Challenges (objections)
│ signed by challenger
▼
GovernanceDecisions (outcomes)
signed by council/voters
Verification flow:
- Retrieve the EmbeddingRecord
- Verify its signature against the
submitted_byDID - Follow
transform_idto the Transformation record - Verify the Transformation signature and parameters
- Follow
input_artifact_idto the Artifact - Verify the Artifact's
content_hashagainst the content atcontent_uri - Check for any open Challenges against any object in the chain
- Check for any GovernanceDecisions that affect objects in the chain
If any link in the chain fails verification, the embedding is flagged as unverifiable.
5. Storage Architecture
5.1 On-Chain (Tamper-Evidence Layer)
Store on a durable, append-only ledger:
- Object IDs (content hashes)
- Signatures
- Timestamps
- Governance decisions
- Challenge records
- Provenance chain links
Candidates: Arweave (permanent storage), Ceramic (mutable streams with IPFS anchoring), or a purpose-built chain.
5.2 Off-Chain (Content Layer)
Store in durable decentralised storage:
- Raw artefact content
- Embedding vectors
- Transformation outputs
- Full object records
Candidates: IPFS (content-addressed), Arweave (permanent), Filecoin (incentivised storage), or federated node operators.
5.3 Index Layer
Queryable indexes for retrieval:
- Vector similarity indexes
- Full-text search indexes
- Graph indexes (for evidence links and provenance chains)
Candidates: Any operator can run an index node. Multiple competing indexes is healthy — it prevents index operator capture.
6. Identity
All participants are identified using Decentralised Identifiers (DIDs).
| Requirement | Approach |
|---|---|
| Self-sovereign identity | did:key for simple cases; did:web or did:ion for organisations |
| Key rotation | DID Documents support key rotation without changing the identifier |
| Pseudonymity | Participants can contribute under a pseudonymous DID |
| Reputation linkage | Reputation accrues to the DID, not to a platform account |
| Revocation | Compromised DIDs can be revoked through DID Document updates |
7. Serialisation and Hashing
Canonical Serialisation
All objects are serialised using DAG-JSON (deterministic JSON for content addressing):
- Keys sorted lexicographically
- No whitespace
- Numbers in canonical form
- CID links use the DAG-JSON CID format
Hashing
- Content hashes: SHA-256
- Content IDs: CIDv1 with
dag-jsoncodec andsha2-256multihash
Signatures
- Default: Ed25519 over the canonical serialisation
- Verification: Public key resolved from the signer's DID Document
8. Multi-Model Embedding Support
This is a critical anti-capture mechanism. The schema explicitly supports and encourages multiple embeddings per artefact:
| Embedding Category | Description |
|---|---|
native |
The original embedding submitted with the artefact |
alternate |
Alternative embeddings from different models |
community |
Embeddings contributed by community members |
domain_specific |
Embeddings from domain-specialised models (medical, legal, scientific) |
translation |
Embeddings of translated versions of the artefact |
Retrieval clients choose which embedding spaces to search:
- Single model-space search (for consistency)
- Cross-model ensemble search (for robustness)
- Contradiction-aware search (for integrity)
9. Contribution Economics
Consuming systems (models, applications, services) that read from the commons should contribute back. Contributions are recorded as signed receipts:
{
"type": "ContributionReceipt",
"receipt_id": "cid:bafy...",
"contributor": "did:key:z6Mk...",
"contribution_type": "new_embeddings",
"quantity": 1500,
"quality_score": 0.92,
"accepted_by": "did:key:z6Mk...",
"timestamp": "2026-03-25T05:00:00Z",
"signature": "ed25519:..."
}
| Contribution Types |
|---|
| New public-domain artefacts |
| Additional embeddings for existing artefacts |
| Challenge reviews |
| Contradiction links |
| Benchmark runs |
| Index rebuild compute |
| Storage funding |
| Moderation labour |
| Reputation stake |
Governance can tie access tiers or influence caps to contribution history, preventing pure extraction without reciprocity.
10. Relationship to the Protocol
This schema defines the data objects. The Protocol for Open AI System (separate document) defines:
- How these objects are published, discovered, and replicated
- How governance decisions are made
- How communities fork
- How retrieval policies work
- How anti-capture mechanisms operate
Together, the schema and protocol form the Transparent Semantic Commons.