Transparent AI Framework — Open Embeddings Schema

Open schema, signed provenance, and verifiable retrieval for a public semantic commons.

Version: 0.1.0 Status: Draft Author: Technology Shield Last Updated: 2026-03-25


1. Purpose

Large language models and embedding systems are being built with billions of dollars by organisations whose incentives are not aligned with collective human flourishing. The data they train on is opaque. The transforms they apply are hidden. The rankings they produce are invisible. The biases they carry are undisclosed.

This document defines an Open Embeddings Schema — a structured, content-addressed, cryptographically signed format for embeddings and their provenance — as one layer of a broader Transparent Semantic Commons.

The goal is not to claim neutrality. It is to make every assumption legible, every transform auditable, every ranking challengeable, and every governance decision forkable.


2. Design Principles

# Principle Description
1 Canonical Serialisation Every object has exactly one deterministic representation for consistent hashing
2 Content-Addressed Identity Object IDs are derived from content hashes, not database sequences
3 Signed Records Every record is cryptographically signed by its creator
4 Append-Only History Records are never silently edited or deleted; only supersession and tombstoning
5 Typed Relations Relationships between objects are first-class, typed, and signed
6 Multiple Embedding Spaces No single model monopoly; every artefact can have embeddings from multiple models
7 Verifiable Retrieval The rules used to retrieve and rank results are themselves public, signed objects
8 Forkable Governance Any community can fork the dataset, embeddings, policies, and reputation without losing shared history

3. Schema Objects

3.1 Artifact

A source object — the raw input to the commons.

{
  "type": "Artifact",
  "artifact_id": "cid:bafy...",
  "content_type": "text/markdown",
  "content_hash": "sha256:a1b2c3...",
  "content_uri": "ipfs://Qm... | arweave://tx... | https://...",
  "license": "CC-BY-4.0",
  "created_at": "2026-03-25T01:00:00Z",
  "submitted_by": "did:key:z6Mk...",
  "jurisdiction_tags": ["global"],
  "language": "en",
  "sensitivity_tags": ["public"],
  "parent_artifact_id": null,
  "signature": "ed25519:..."
}
Field Type Description
artifact_id CID Content-addressed ID derived from canonical serialisation
content_type string MIME type of the source content
content_hash string Cryptographic hash of the raw content
content_uri string Location of the content (IPFS, Arweave, HTTPS, or other durable storage)
license string SPDX licence identifier
created_at ISO 8601 Creation timestamp
submitted_by DID Decentralised identifier of the submitter
jurisdiction_tags string[] Jurisdictional scope
language string BCP 47 language tag
sensitivity_tags string[] Classification labels
parent_artifact_id CID or null Reference to a parent artefact (for derived works)
signature string Cryptographic signature of the submitter over the canonical record

3.2 Transformation

Any operation applied to an artefact.

{
  "type": "Transformation",
  "transform_id": "cid:bafy...",
  "input_artifact_id": "cid:bafy...",
  "transform_type": "embed",
  "software_id": "github.com/org/tool",
  "software_version": "2.1.0",
  "model_id": "bge-m3",
  "model_version": "2026-03-01",
  "parameters": {
    "max_tokens": 512,
    "pooling": "cls",
    "normalize": true
  },
  "output_hash": "sha256:d4e5f6...",
  "performed_by": "did:key:z6Mk...",
  "timestamp": "2026-03-25T01:05:00Z",
  "signature": "ed25519:..."
}
Field Type Description
transform_id CID Content-addressed ID
input_artifact_id CID The artefact this transform was applied to
transform_type enum chunk, translate, classify, summarize, embed, redact, normalize
software_id string Identifier for the software that performed the transform
software_version string Version of the software
model_id string Identifier for the model used (if applicable)
model_version string Version of the model
parameters object Configuration parameters used
output_hash string Hash of the transform output
performed_by DID Who performed the transformation
timestamp ISO 8601 When the transformation was performed
signature string Cryptographic signature

3.3 EmbeddingRecord

The embedding itself — the vector representation of an artefact.

{
  "type": "EmbeddingRecord",
  "embedding_id": "cid:bafy...",
  "artifact_id": "cid:bafy...",
  "transform_id": "cid:bafy...",
  "vector": [0.0182, -0.4421, 0.0931, "..."],
  "dimension": 1024,
  "numeric_format": "float32",
  "distance_metric": "cosine",
  "normalization": "l2",
  "model_family": "bge-m3",
  "model_version": "2026-03-01",
  "tokenizer_version": "v4",
  "scope_tags": ["public", "science", "english"],
  "created_at": "2026-03-25T01:05:00Z",
  "submitted_by": "did:key:z6Mk...",
  "signature": "ed25519:..."
}
Field Type Description
embedding_id CID Content-addressed ID
artifact_id CID The source artefact
transform_id CID The transformation that produced this embedding
vector float[] The embedding vector
dimension integer Vector dimensionality
numeric_format string float16, float32, float64, int8
distance_metric string cosine, euclidean, dot_product
normalization string l2, none, unit
model_family string Model family name
model_version string Specific model version
tokenizer_version string Tokenizer version used
scope_tags string[] Domain and scope labels
created_at ISO 8601 Creation timestamp
submitted_by DID Creator's decentralised identifier
signature string Cryptographic signature

Critical design choice: For every artefact, the schema supports multiple embeddings from different models. This prevents monoculture and priesthood formation around a single embedding space.

Artifact A
  ├── EmbeddingRecord (bge-m3)
  ├── EmbeddingRecord (e5-large-v2)
  ├── EmbeddingRecord (community-model-1)
  └── EmbeddingRecord (domain-specific-medical)

3.4 Claim

A proposition about the world.

{
  "type": "Claim",
  "claim_id": "cid:bafy...",
  "claim_text": "Global average temperatures increased by 1.1°C between 1850-1900 and 2011-2020.",
  "claim_type": "fact",
  "language": "en",
  "asserted_by": "did:key:z6Mk...",
  "timestamp": "2026-03-25T02:00:00Z",
  "related_artifacts": ["cid:bafy..."],
  "related_embeddings": ["cid:bafy..."],
  "confidence_declared": 0.95,
  "jurisdiction_scope": "global",
  "signature": "ed25519:..."
}
Field Type Description
claim_type enum fact, opinion, forecast, moral_stance, interpretation, definition
confidence_declared float (0-1) The submitter's declared confidence level
related_artifacts CID[] Source artefacts supporting the claim
related_embeddings CID[] Embeddings linked to the claim

A typed, signed relationship between any two objects.

{
  "type": "EvidenceLink",
  "link_id": "cid:bafy...",
  "from_object": "cid:bafy...",
  "to_object": "cid:bafy...",
  "relation_type": "supports",
  "weight": 0.85,
  "created_by": "did:key:z6Mk...",
  "timestamp": "2026-03-25T02:10:00Z",
  "signature": "ed25519:..."
}
Relation Types Description
supports Source evidence supports the target claim
contradicts Source evidence contradicts the target claim
contextualizes Source provides context for the target
supersedes Source replaces the target
rebuts Source specifically argues against the target
duplicates Source is a duplicate of the target

3.6 Challenge

A formal objection to any object in the commons.

{
  "type": "Challenge",
  "challenge_id": "cid:bafy...",
  "target_object_id": "cid:bafy...",
  "challenge_type": "misleading",
  "reason": "The cited study was retracted in 2025 due to data fabrication.",
  "supporting_artifacts": ["cid:bafy..."],
  "opened_by": "did:key:z6Mk...",
  "opened_at": "2026-03-25T03:00:00Z",
  "status": "open",
  "resolution_ref": null,
  "signature": "ed25519:..."
}
Challenge Types Description
poisoning Data or embedding has been deliberately corrupted
malformed Object does not conform to schema
copyright Content violates copyright
fabricated Content is fabricated or synthetically generated without disclosure
misleading Content is technically accurate but misleading in context
duplicate Object is a duplicate of an existing entry
hidden_transform A transformation was applied but not declared
governance_abuse A governance decision was made improperly

3.7 GovernanceDecision

The outcome of a challenge or governance action.

{
  "type": "GovernanceDecision",
  "decision_id": "cid:bafy...",
  "target_object_id": "cid:bafy...",
  "decision_type": "tombstone",
  "decision_text": "Source artefact confirmed as retracted study. Embedding tombstoned. Claim marked as disputed.",
  "voters_or_council": ["did:key:z6Mk...", "did:key:z6Mk..."],
  "timestamp": "2026-03-25T04:00:00Z",
  "supersedes": null,
  "appeal_window": "P30D",
  "signature": "ed25519:..."
}

4. Provenance Chain

Every embedding carries a full provenance chain that can be verified:

Artifact (source content)
  │  content_hash verifies raw content integrity
  ▼
Transformation (chunking, embedding, etc.)
  │  signed by performer; parameters recorded
  ▼
EmbeddingRecord (the vector)
  │  signed by creator; linked to artifact and transformation
  ▼
EvidenceLinks (relationships)
  │  signed by link creator
  ▼
Claims (propositions)
  │  signed by asserter; linked to evidence
  ▼
Challenges (objections)
  │  signed by challenger
  ▼
GovernanceDecisions (outcomes)
      signed by council/voters

Verification flow:

  1. Retrieve the EmbeddingRecord
  2. Verify its signature against the submitted_by DID
  3. Follow transform_id to the Transformation record
  4. Verify the Transformation signature and parameters
  5. Follow input_artifact_id to the Artifact
  6. Verify the Artifact's content_hash against the content at content_uri
  7. Check for any open Challenges against any object in the chain
  8. Check for any GovernanceDecisions that affect objects in the chain

If any link in the chain fails verification, the embedding is flagged as unverifiable.


5. Storage Architecture

5.1 On-Chain (Tamper-Evidence Layer)

Store on a durable, append-only ledger:

  • Object IDs (content hashes)
  • Signatures
  • Timestamps
  • Governance decisions
  • Challenge records
  • Provenance chain links

Candidates: Arweave (permanent storage), Ceramic (mutable streams with IPFS anchoring), or a purpose-built chain.

5.2 Off-Chain (Content Layer)

Store in durable decentralised storage:

  • Raw artefact content
  • Embedding vectors
  • Transformation outputs
  • Full object records

Candidates: IPFS (content-addressed), Arweave (permanent), Filecoin (incentivised storage), or federated node operators.

5.3 Index Layer

Queryable indexes for retrieval:

  • Vector similarity indexes
  • Full-text search indexes
  • Graph indexes (for evidence links and provenance chains)

Candidates: Any operator can run an index node. Multiple competing indexes is healthy — it prevents index operator capture.


6. Identity

All participants are identified using Decentralised Identifiers (DIDs).

Requirement Approach
Self-sovereign identity did:key for simple cases; did:web or did:ion for organisations
Key rotation DID Documents support key rotation without changing the identifier
Pseudonymity Participants can contribute under a pseudonymous DID
Reputation linkage Reputation accrues to the DID, not to a platform account
Revocation Compromised DIDs can be revoked through DID Document updates

7. Serialisation and Hashing

Canonical Serialisation

All objects are serialised using DAG-JSON (deterministic JSON for content addressing):

  • Keys sorted lexicographically
  • No whitespace
  • Numbers in canonical form
  • CID links use the DAG-JSON CID format

Hashing

  • Content hashes: SHA-256
  • Content IDs: CIDv1 with dag-json codec and sha2-256 multihash

Signatures

  • Default: Ed25519 over the canonical serialisation
  • Verification: Public key resolved from the signer's DID Document

8. Multi-Model Embedding Support

This is a critical anti-capture mechanism. The schema explicitly supports and encourages multiple embeddings per artefact:

Embedding Category Description
native The original embedding submitted with the artefact
alternate Alternative embeddings from different models
community Embeddings contributed by community members
domain_specific Embeddings from domain-specialised models (medical, legal, scientific)
translation Embeddings of translated versions of the artefact

Retrieval clients choose which embedding spaces to search:

  • Single model-space search (for consistency)
  • Cross-model ensemble search (for robustness)
  • Contradiction-aware search (for integrity)

9. Contribution Economics

Consuming systems (models, applications, services) that read from the commons should contribute back. Contributions are recorded as signed receipts:

{
  "type": "ContributionReceipt",
  "receipt_id": "cid:bafy...",
  "contributor": "did:key:z6Mk...",
  "contribution_type": "new_embeddings",
  "quantity": 1500,
  "quality_score": 0.92,
  "accepted_by": "did:key:z6Mk...",
  "timestamp": "2026-03-25T05:00:00Z",
  "signature": "ed25519:..."
}
Contribution Types
New public-domain artefacts
Additional embeddings for existing artefacts
Challenge reviews
Contradiction links
Benchmark runs
Index rebuild compute
Storage funding
Moderation labour
Reputation stake

Governance can tie access tiers or influence caps to contribution history, preventing pure extraction without reciprocity.


10. Relationship to the Protocol

This schema defines the data objects. The Protocol for Open AI System (separate document) defines:

  • How these objects are published, discovered, and replicated
  • How governance decisions are made
  • How communities fork
  • How retrieval policies work
  • How anti-capture mechanisms operate

Together, the schema and protocol form the Transparent Semantic Commons.