Transparent AI Framework — Open Embeddings Schema

Open schema, signed provenance, and verifiable retrieval for a public semantic commons.

Version: 0.1.0 Status: Draft Author: Technology Shield Last Updated: 2026-03-25

1. Purpose

Large language models and embedding systems are being built with billions of dollars by organisations whose incentives are not aligned with collective human flourishing. The data they train on is opaque. The transforms they apply are hidden. The rankings they produce are invisible. The biases they carry are undisclosed.

This document defines an Open Embeddings Schema — a structured, content-addressed, cryptographically signed format for embeddings and their provenance — as one layer of a broader Transparent Semantic Commons.

The goal is not to claim neutrality. It is to make every assumption legible, every transform auditable, every ranking challengeable, and every governance decision forkable.

2. Design Principles

#	Principle	Description
1	Canonical Serialisation	Every object has exactly one deterministic representation for consistent hashing
2	Content-Addressed Identity	Object IDs are derived from content hashes, not database sequences
3	Signed Records	Every record is cryptographically signed by its creator
4	Append-Only History	Records are never silently edited or deleted; only supersession and tombstoning
5	Typed Relations	Relationships between objects are first-class, typed, and signed
6	Multiple Embedding Spaces	No single model monopoly; every artefact can have embeddings from multiple models
7	Verifiable Retrieval	The rules used to retrieve and rank results are themselves public, signed objects
8	Forkable Governance	Any community can fork the dataset, embeddings, policies, and reputation without losing shared history

3. Schema Objects

3.1 Artifact

A source object — the raw input to the commons.

{
  "type": "Artifact",
  "artifact_id": "cid:bafy...",
  "content_type": "text/markdown",
  "content_hash": "sha256:a1b2c3...",
  "content_uri": "ipfs://Qm... | arweave://tx... | https://...",
  "license": "CC-BY-4.0",
  "created_at": "2026-03-25T01:00:00Z",
  "submitted_by": "did:key:z6Mk...",
  "jurisdiction_tags": ["global"],
  "language": "en",
  "sensitivity_tags": ["public"],
  "parent_artifact_id": null,
  "signature": "ed25519:..."
}

Field	Type	Description
`artifact_id`	CID	Content-addressed ID derived from canonical serialisation
`content_type`	string	MIME type of the source content
`content_hash`	string	Cryptographic hash of the raw content
`content_uri`	string	Location of the content (IPFS, Arweave, HTTPS, or other durable storage)
`license`	string	SPDX licence identifier
`created_at`	ISO 8601	Creation timestamp
`submitted_by`	DID	Decentralised identifier of the submitter
`jurisdiction_tags`	string[]	Jurisdictional scope
`language`	string	BCP 47 language tag
`sensitivity_tags`	string[]	Classification labels
`parent_artifact_id`	CID or null	Reference to a parent artefact (for derived works)
`signature`	string	Cryptographic signature of the submitter over the canonical record

3.2 Transformation

Any operation applied to an artefact.

{
  "type": "Transformation",
  "transform_id": "cid:bafy...",
  "input_artifact_id": "cid:bafy...",
  "transform_type": "embed",
  "software_id": "github.com/org/tool",
  "software_version": "2.1.0",
  "model_id": "bge-m3",
  "model_version": "2026-03-01",
  "parameters": {
    "max_tokens": 512,
    "pooling": "cls",
    "normalize": true
  },
  "output_hash": "sha256:d4e5f6...",
  "performed_by": "did:key:z6Mk...",
  "timestamp": "2026-03-25T01:05:00Z",
  "signature": "ed25519:..."
}

Field	Type	Description
`transform_id`	CID	Content-addressed ID
`input_artifact_id`	CID	The artefact this transform was applied to
`transform_type`	enum	`chunk`, `translate`, `classify`, `summarize`, `embed`, `redact`, `normalize`
`software_id`	string	Identifier for the software that performed the transform
`software_version`	string	Version of the software
`model_id`	string	Identifier for the model used (if applicable)
`model_version`	string	Version of the model
`parameters`	object	Configuration parameters used
`output_hash`	string	Hash of the transform output
`performed_by`	DID	Who performed the transformation
`timestamp`	ISO 8601	When the transformation was performed
`signature`	string	Cryptographic signature

3.3 EmbeddingRecord

The embedding itself — the vector representation of an artefact.

{
  "type": "EmbeddingRecord",
  "embedding_id": "cid:bafy...",
  "artifact_id": "cid:bafy...",
  "transform_id": "cid:bafy...",
  "vector": [0.0182, -0.4421, 0.0931, "..."],
  "dimension": 1024,
  "numeric_format": "float32",
  "distance_metric": "cosine",
  "normalization": "l2",
  "model_family": "bge-m3",
  "model_version": "2026-03-01",
  "tokenizer_version": "v4",
  "scope_tags": ["public", "science", "english"],
  "created_at": "2026-03-25T01:05:00Z",
  "submitted_by": "did:key:z6Mk...",
  "signature": "ed25519:..."
}

Field	Type	Description
`embedding_id`	CID	Content-addressed ID
`artifact_id`	CID	The source artefact
`transform_id`	CID	The transformation that produced this embedding
`vector`	float[]	The embedding vector
`dimension`	integer	Vector dimensionality
`numeric_format`	string	`float16`, `float32`, `float64`, `int8`
`distance_metric`	string	`cosine`, `euclidean`, `dot_product`
`normalization`	string	`l2`, `none`, `unit`
`model_family`	string	Model family name
`model_version`	string	Specific model version
`tokenizer_version`	string	Tokenizer version used
`scope_tags`	string[]	Domain and scope labels
`created_at`	ISO 8601	Creation timestamp
`submitted_by`	DID	Creator's decentralised identifier
`signature`	string	Cryptographic signature

Critical design choice: For every artefact, the schema supports multiple embeddings from different models. This prevents monoculture and priesthood formation around a single embedding space.

Artifact A
  ├── EmbeddingRecord (bge-m3)
  ├── EmbeddingRecord (e5-large-v2)
  ├── EmbeddingRecord (community-model-1)
  └── EmbeddingRecord (domain-specific-medical)

3.4 Claim

A proposition about the world.

{
  "type": "Claim",
  "claim_id": "cid:bafy...",
  "claim_text": "Global average temperatures increased by 1.1°C between 1850-1900 and 2011-2020.",
  "claim_type": "fact",
  "language": "en",
  "asserted_by": "did:key:z6Mk...",
  "timestamp": "2026-03-25T02:00:00Z",
  "related_artifacts": ["cid:bafy..."],
  "related_embeddings": ["cid:bafy..."],
  "confidence_declared": 0.95,
  "jurisdiction_scope": "global",
  "signature": "ed25519:..."
}

Field	Type	Description
`claim_type`	enum	`fact`, `opinion`, `forecast`, `moral_stance`, `interpretation`, `definition`
`confidence_declared`	float (0-1)	The submitter's declared confidence level
`related_artifacts`	CID[]	Source artefacts supporting the claim
`related_embeddings`	CID[]	Embeddings linked to the claim

3.5 EvidenceLink

A typed, signed relationship between any two objects.

{
  "type": "EvidenceLink",
  "link_id": "cid:bafy...",
  "from_object": "cid:bafy...",
  "to_object": "cid:bafy...",
  "relation_type": "supports",
  "weight": 0.85,
  "created_by": "did:key:z6Mk...",
  "timestamp": "2026-03-25T02:10:00Z",
  "signature": "ed25519:..."
}

Relation Types	Description
`supports`	Source evidence supports the target claim
`contradicts`	Source evidence contradicts the target claim
`contextualizes`	Source provides context for the target
`supersedes`	Source replaces the target
`rebuts`	Source specifically argues against the target
`duplicates`	Source is a duplicate of the target

3.6 Challenge

A formal objection to any object in the commons.

{
  "type": "Challenge",
  "challenge_id": "cid:bafy...",
  "target_object_id": "cid:bafy...",
  "challenge_type": "misleading",
  "reason": "The cited study was retracted in 2025 due to data fabrication.",
  "supporting_artifacts": ["cid:bafy..."],
  "opened_by": "did:key:z6Mk...",
  "opened_at": "2026-03-25T03:00:00Z",
  "status": "open",
  "resolution_ref": null,
  "signature": "ed25519:..."
}

Challenge Types	Description
`poisoning`	Data or embedding has been deliberately corrupted
`malformed`	Object does not conform to schema
`copyright`	Content violates copyright
`fabricated`	Content is fabricated or synthetically generated without disclosure
`misleading`	Content is technically accurate but misleading in context
`duplicate`	Object is a duplicate of an existing entry
`hidden_transform`	A transformation was applied but not declared
`governance_abuse`	A governance decision was made improperly

3.7 GovernanceDecision

The outcome of a challenge or governance action.

{
  "type": "GovernanceDecision",
  "decision_id": "cid:bafy...",
  "target_object_id": "cid:bafy...",
  "decision_type": "tombstone",
  "decision_text": "Source artefact confirmed as retracted study. Embedding tombstoned. Claim marked as disputed.",
  "voters_or_council": ["did:key:z6Mk...", "did:key:z6Mk..."],
  "timestamp": "2026-03-25T04:00:00Z",
  "supersedes": null,
  "appeal_window": "P30D",
  "signature": "ed25519:..."
}

4. Provenance Chain

Every embedding carries a full provenance chain that can be verified:

Artifact (source content)
  │  content_hash verifies raw content integrity
  ▼
Transformation (chunking, embedding, etc.)
  │  signed by performer; parameters recorded
  ▼
EmbeddingRecord (the vector)
  │  signed by creator; linked to artifact and transformation
  ▼
EvidenceLinks (relationships)
  │  signed by link creator
  ▼
Claims (propositions)
  │  signed by asserter; linked to evidence
  ▼
Challenges (objections)
  │  signed by challenger
  ▼
GovernanceDecisions (outcomes)
      signed by council/voters

Verification flow:

Retrieve the EmbeddingRecord
Verify its signature against the submitted_by DID
Follow transform_id to the Transformation record
Verify the Transformation signature and parameters
Follow input_artifact_id to the Artifact
Verify the Artifact's content_hash against the content at content_uri
Check for any open Challenges against any object in the chain
Check for any GovernanceDecisions that affect objects in the chain

If any link in the chain fails verification, the embedding is flagged as unverifiable.

5. Storage Architecture

5.1 On-Chain (Tamper-Evidence Layer)

Store on a durable, append-only ledger:

Object IDs (content hashes)
Signatures
Timestamps
Governance decisions
Challenge records
Provenance chain links

Candidates: Arweave (permanent storage), Ceramic (mutable streams with IPFS anchoring), or a purpose-built chain.

5.2 Off-Chain (Content Layer)

Store in durable decentralised storage:

Raw artefact content
Embedding vectors
Transformation outputs
Full object records

Candidates: IPFS (content-addressed), Arweave (permanent), Filecoin (incentivised storage), or federated node operators.

5.3 Index Layer

Queryable indexes for retrieval:

Vector similarity indexes
Full-text search indexes
Graph indexes (for evidence links and provenance chains)

Candidates: Any operator can run an index node. Multiple competing indexes is healthy — it prevents index operator capture.

6. Identity

All participants are identified using Decentralised Identifiers (DIDs).

Requirement	Approach
Self-sovereign identity	`did:key` for simple cases; `did:web` or `did:ion` for organisations
Key rotation	DID Documents support key rotation without changing the identifier
Pseudonymity	Participants can contribute under a pseudonymous DID
Reputation linkage	Reputation accrues to the DID, not to a platform account
Revocation	Compromised DIDs can be revoked through DID Document updates

7. Serialisation and Hashing

Canonical Serialisation

All objects are serialised using DAG-JSON (deterministic JSON for content addressing):

Keys sorted lexicographically
No whitespace
Numbers in canonical form
CID links use the DAG-JSON CID format

Hashing

Content hashes: SHA-256
Content IDs: CIDv1 with dag-json codec and sha2-256 multihash

Signatures

Default: Ed25519 over the canonical serialisation
Verification: Public key resolved from the signer's DID Document

8. Multi-Model Embedding Support

This is a critical anti-capture mechanism. The schema explicitly supports and encourages multiple embeddings per artefact:

Embedding Category	Description
`native`	The original embedding submitted with the artefact
`alternate`	Alternative embeddings from different models
`community`	Embeddings contributed by community members
`domain_specific`	Embeddings from domain-specialised models (medical, legal, scientific)
`translation`	Embeddings of translated versions of the artefact

Retrieval clients choose which embedding spaces to search:

Single model-space search (for consistency)
Cross-model ensemble search (for robustness)
Contradiction-aware search (for integrity)

9. Contribution Economics

Consuming systems (models, applications, services) that read from the commons should contribute back. Contributions are recorded as signed receipts:

{
  "type": "ContributionReceipt",
  "receipt_id": "cid:bafy...",
  "contributor": "did:key:z6Mk...",
  "contribution_type": "new_embeddings",
  "quantity": 1500,
  "quality_score": 0.92,
  "accepted_by": "did:key:z6Mk...",
  "timestamp": "2026-03-25T05:00:00Z",
  "signature": "ed25519:..."
}

Contribution Types
New public-domain artefacts
Additional embeddings for existing artefacts
Challenge reviews
Contradiction links
Benchmark runs
Index rebuild compute
Storage funding
Moderation labour
Reputation stake

Governance can tie access tiers or influence caps to contribution history, preventing pure extraction without reciprocity.

10. Relationship to the Protocol

This schema defines the data objects. The Protocol for Open AI System (separate document) defines:

How these objects are published, discovered, and replicated
How governance decisions are made
How communities fork
How retrieval policies work
How anti-capture mechanisms operate

Together, the schema and protocol form the Transparent Semantic Commons.