Migration Runbook
Source Cluster → Target Cluster
End-to-end operational guide for migrating the production kea deployment from Source Cluster to Target Cluster, then upgrading to swift on Target Cluster. Audience: platform engineers and technical management.
Scope & audience
This runbook covers the full migration of user-owned data — teams, agents, knowledge-base documents, and conversation history — from a production kea deployment on Source Cluster to a fresh Target Cluster (cloud 2), followed by an in-place upgrade from kea to swift on Target Cluster.
Source Cluster and Target Cluster share no infrastructure. Every data store is explicitly exported, transferred, and verified. The guide is written so that management can track progress and risk at chapter level, while engineers have the detail needed to execute each step.
Two-phase strategy
data copy · same schema
schema transform · same platform
Chapter 1 moves all data between two identical kea deployments. The schema is the same on both sides — no transformation is required, only transfer and verification. Users are cut over to Target Cluster kea at the end of this chapter.
Chapter 2 runs entirely on Target Cluster. Swift is deployed in a separate namespace alongside live kea. Migration scripts transform kea data into the swift schema. Users stay on kea until swift passes full validation; cutover is a single load-balancer rule change with no DNS delay.
Transfer logistics
Source Cluster and Target Cluster share no network, no storage, and no identity infrastructure. Every byte of user data must be explicitly exported from Source Cluster, physically transported or routed through an authorized intermediary, and imported into Target Cluster. This section describes the transfer approach and sizes the data so the right method can be chosen.
The two-platform problem
Because both platforms are highly secured, the only viable transfer intermediary is an authorized workstation that is permitted to connect to services on both sides. There is no direct network path between Source Cluster and Target Cluster.
The operator runs migration scripts from one laptop that holds a VPN or
bastion session to both Source Cluster and Target Cluster at the same time. Structured data
(Postgres, Keycloak, OpenFGA) flows through the laptop's memory — never
written to disk unencrypted. Object storage is synced via
rclone or mc mirror using credentials for
both endpoints.
Requires: confirmation that security policy permits one workstation to hold simultaneous authenticated sessions to both platforms. See open item below.
All data is exported to encrypted archives on the operator's laptop, transported through an approved secure channel (encrypted USB, secure file transfer system, or equivalent), then imported on the Target Cluster side. No simultaneous connectivity to both platforms is required.
Drawback: for large object storage volumes (> 5 GB) this becomes slow and operationally cumbersome. The secure channel must support the full transfer size.
Data volume estimate — 500 users, 10 teams, ~3 agents per user
Extrapolated from direct measurement of a local kea instance (real byte counts per table row). The structured data is negligibly small. Object storage is the only unknown that matters.
| Store | Basis | Estimated raw | Compressed |
|---|---|---|---|
PostgreSQL — agent |
12 rows = 90 KB → scale ×125 (1,500 user agents) | ~11 MB | ~3 MB |
PostgreSQL — metadata |
4 rows = 120 KB → scale ×3,000 documents | ~45 MB | ~10 MB |
| PostgreSQL — all other tables | tag, resource, users, teammetadata |
~25 MB | ~5 MB |
| Keycloak realm export | 500 users × ~10 KB/user (credentials, groups, roles) | ~5 MB | ~2 MB |
| OpenFGA tuples | ~3,000 tuples × 200 B | ~600 KB | <1 MB |
| Total structured data | ~87 MB | ~21 MB | |
| Object storage (MinIO / GCS) |
Unknown — must be measured on Source Cluster. Light use (20 docs/team, 2 MB avg): ~400 MB Moderate use (100 docs/team, 3 MB avg): ~3 GB Heavy use (500 docs/team, 5 MB avg): ~25 GB |
||
rclone sync
(server-side, no laptop bandwidth involved). This avoids routing gigabytes
through a VPN tunnel unnecessarily.
Key findings & open items
All schema and ownership questions were resolved empirically by running a local kea instance and inspecting live data across every store. One item remains open.
payload_json blobs.
The definition_ref field (e.g. "v2.react.basic") is the
template key that maps directly to swift's source_agent_id.
Team ownership lives entirely in OpenFGA, not in Postgres. Field map: §2.3.
session.team_id.
A realm export with ID preservation keeps them stable across platforms.
The teammetadata Postgres table is effectively empty in kea —
teams live in Keycloak and OpenFGA only.
mcp-server table contains
platform-level deployment config (knowledge-flow vector search, tabular, opensearch ops…).
It is re-seeded by deployment on every environment. No migration needed.
user:alice (username) and user:<keycloak-uuid>.
The import must use only the UUID format on the target to match how swift
resolves identities.
team:personal string. Swift assigns each user a distinct personal
team derived deterministically from their Keycloak UUID. Chapter 2 must
create one personal team record per migrated user.
Acceptable downtime for the Chapter 1 cutover must be agreed before scheduling. This determines whether the final data sync runs live (short freeze at the end) or requires a full service stop.
Confirm that one authorized laptop can hold simultaneous authenticated sessions to both Source Cluster and Target Cluster (VPN or bastion access to both platforms at the same time). This determines whether the transfer uses Option A (preferred) or Option B (sequential encrypted archive). See § Transfer logistics for full context.
Measure actual sizes on the Source Cluster production instance before planning the transfer. Required numbers:
- MinIO bucket total size:
mc du minio/<bucket>or equivalent - Postgres
freddatabase size:SELECT pg_size_pretty(pg_database_size('fred')) - Number of rows in
agent,metadata,resource,tag - Number of active users (to validate the 500-user assumption)
- Number of teams (to validate the 10-team assumption)
Cutover from kea to swift requires a validated set of reference questions run against equivalent agents on both platforms, with automated quality metrics. This is a hard gate on Chapter 2 cutover and a standing requirement for any future model, prompt, or retrieval change — not specific to this migration.
kea side: lightweight JSON batch endpoint to be added (WebSocket protocol makes automated replay impractical from outside the UI).
swift side: evaluation UI + script already in progress; deepeval-based open-source tooling under development.
Five stores are transferred from Source Cluster to Target Cluster. Because the application version is the same on both sides, no schema mapping is needed — only transfer and verification. At the end of this chapter, kea runs on Target Cluster with all user data intact and Source Cluster is placed in read-only mode.
1.1 Keycloak
Keycloak is the identity source for all users and groups. The full realm is exported from Source Cluster and imported into Target Cluster. Keycloak's export format preserves all internal UUIDs — including group IDs that serve as team IDs throughout the system — so no ID remapping is needed after import.
Export from Source Cluster via the Keycloak Admin REST API partial-export
endpoint (POST /admin/realms/{realm}/partial-export?exportClients=true&exportGroupsAndRoles=true)
or kcadm.sh export. The resulting JSON file contains all users,
hashed credentials, group memberships, roles, and client configurations.
Import to Target Cluster via kcadm.sh import or the Admin
REST API import endpoint. Verify that group IDs in the imported realm match those
in the Source Cluster OpenFGA export — they must be identical.
Hashed passwords migrate transparently. Users can log in immediately without a password reset, provided Target Cluster Keycloak uses the same hashing algorithm (bcrypt, the default).
MFA devices (TOTP / WebAuthn) must be checked before executing. TOTP secrets migrate within the realm export. WebAuthn / passkey credentials are hardware-bound and cannot be exported; affected users will need to re-enrol their devices on Target Cluster after login.
1.2 OpenFGA
OpenFGA holds all authorisation tuples: who is owner / manager / member of which team or organisation, and which user or team owns which agent, tag, or document. The tuple set is the authoritative access control state — if it is wrong, users will see incorrect team memberships and agent ownership.
Export from Source Cluster by reading all tuples from the store via the
OpenFGA Read API with an empty filter (returns all tuples paginated).
Collect the full list to a JSON file.
Deduplication required. Source Cluster seeds each user in two formats:
user:alice (username) and user:<keycloak-uuid>.
Keep only the UUID format — Target Cluster swift resolves identities by UUID exclusively.
Filter out any tuple whose user field matches a plain username pattern
(no hyphens, not a UUID).
Import to Target Cluster by creating a new OpenFGA store, applying the
same authorization model (FGA schema), then bulk-writing the filtered tuples via
the Write API in batches. The Target Cluster store ID will be different from Source Cluster's —
update the application configuration accordingly.
Verify by spot-checking several users: confirm they appear as members of the expected teams and that their agent ownership tuples are present.
1.3 PostgreSQL
The fred database is transferred table by table. Schema is identical
between kea instances. Only user-data tables are exported; platform-seeded and
ephemeral tables are skipped.
| Table | Action | Notes |
|---|---|---|
agent |
copy | All 12 rows including system agents — kept intact as the source for the Chapter 2 transformation |
tag |
copy | Knowledge-base tag hierarchy — identical schema |
metadata |
copy | Document metadata with tag_id references |
resource |
copy | Resource records with author and doc JSON |
teammetadata |
copy if non-empty | Team descriptions and banner keys — check row count on Source Cluster first; often empty |
users |
copy if non-empty | GCU acceptance records — only non-empty when users explicitly accepted GCU on Source Cluster |
mcp-server |
skip | Platform deployment config — re-seeded automatically by the Target Cluster kea deployment |
session, session_history, session_attachments, session_purge_queue |
skip | Ephemeral — users resume fresh sessions on Target Cluster |
feedbacks |
skip | Operational data — not user-facing |
tasks, sched_workflow_tasks |
skip | Scheduler tasks restart cleanly on Target Cluster |
v2_langgraph_checkpoint* |
skip | In-flight execution state only — not meaningful after migration |
pg_dump -t <table> --data-only
per table on Source Cluster, transfer the dump files to Target Cluster, and restore with
psql -d fred < dump.sql. Run Alembic migrations on the Target Cluster database
first to create the schema, then load data. Import order: tag →
metadata → resource → teammetadata →
users → agent.
1.4 Object storage
All content files and team banners are synced from Source Cluster MinIO to Target Cluster object
storage. The key path structure must be preserved exactly so that existing database
references (banner_object_storage_key, resource paths) remain valid
without any update to Postgres records.
mc mirror for MinIO-to-MinIO,
or rclone sync for cross-provider transfer (MinIO → GCS or S3).
Configure rclone with both source and target credentials before the
maintenance window — the sync itself can run as a pre-step with a final
incremental pass during the freeze.
Run a full sync before the maintenance window to transfer the bulk of data.
During the freeze, run a final incremental sync to capture any last-minute uploads.
Verify that the Target Cluster bucket names match what the kea application configuration
expects (control-plane-content, app-content, etc.).
If the bucket names differ between Source Cluster and Target Cluster, update the kea application configuration on Target Cluster — do not rename files inside the buckets.
1.5 OpenSearch
Re-vectorization is triggered after the kea application starts on Target Cluster. The knowledge-flow backend processes each document from object storage and populates the OpenSearch vector index. Duration is proportional to the number and size of documents ingested on Source Cluster.
The Source Cluster log store and KPI metrics index are also left behind — both are operational time-series data. Target Cluster starts a fresh observability baseline.
1.6 Validation checklist
Run all checks on Target Cluster kea before flipping traffic from Source Cluster. Do not cut over with any item failing.
1.7 Cutover procedure
The cutover transfers live traffic from Source Cluster to Target Cluster. It requires a maintenance window whose duration depends on the final incremental sync size (see open item in § Findings).
Swift is deployed in a separate Kubernetes namespace on Target Cluster alongside the live kea
deployment. Migration scripts read data from the kea database (fred)
and write transformed records into the swift database (fred_swift).
Users remain on kea throughout; the cutover to swift is a single ingress rule
change with instant effect and no DNS propagation.
agent_instance table (replaces kea's opaque
agent blob), the prompt library (starts empty — users
build it in swift), and per-user personal teams (created by the migration script).
2.1 PostgreSQL schema transforms
The swift database (fred_swift) is created fresh by Alembic migrations.
Migration scripts then copy and transform rows from fred into
fred_swift. Both databases share the same Postgres instance and the
same fred user.
| kea (fred) | swift (fred_swift) | Action | Notes |
|---|---|---|---|
tag |
tag |
direct copy | Identical schema — no mapping needed |
metadata |
metadata |
direct copy | Verify tag_ids array format is compatible |
resource |
resource |
direct copy | No structural changes |
users |
users |
copy + defaults | Add current_resources_storage_size = 0 for any rows missing it |
teammetadata |
teammetadata |
copy + drop columns | Exclude kea's storage-size fields; banner_object_storage_key is unchanged |
agent (UUID ids only) |
agent_instance |
full transform — see §2.3 | Different table name, different structure, ownership resolved from OpenFGA |
| — | prompt, default_prompt_usage |
start empty | No kea equivalent; users build the prompt library in swift |
session* |
session_metadata |
skip | Ephemeral — not migrated in either chapter |
mcp-server |
— | skip | Platform config — re-seeded by swift deployment |
2.2 Personal team creation
Swift assigns every user a personal team whose ID is derived deterministically
from the user's Keycloak UUID via personal_team_id(user_id).
Kea has no equivalent — it uses a single shared literal string
"personal" as a pseudo-team. The migration script must bridge this gap.
teammetadata and the corresponding
OpenFGA tuples, any user who logs in to swift will find no personal space and
the application will return errors.
For each Keycloak user in the migrated realm, the script must:
- Compute
personal_team_id(user_keycloak_uuid) - Insert a row in
fred_swift.teammetadatawith that ID - Write two OpenFGA tuples to the Target Cluster store:
user:<uuid> owner team:<personal_team_id>andorganization:fred organization team:<personal_team_id>
The personal_team_id function is defined in the swift codebase
(libs/fred-core/fred_core/teams/) and must be called from the
migration script to guarantee consistency.
2.3 Agent migration
Kea stores agents as payload_json blobs with no team_id
in Postgres — ownership lives in OpenFGA. Swift's agent_instance is
fully structured and team-scoped. The field mapping is fully known from live data
inspection.
payload_json → agent_instance field map
| kea source | swift agent_instance column | Notes |
|---|---|---|
payload_json.id | agent_instance_id | Preserve UUID as-is |
payload_json.name | display_name | Direct copy |
payload_json.definition_ref | source_agent_id | e.g. "v2.react.basic" |
payload_json.enabled | enabled | Direct copy |
payload_json.tuning (minus fields) | tuning_json | Strip the fields array — it is a template definition, not instance config |
OpenFGA lookup: team:X owner agent:Y | team_id | → team_id = X |
OpenFGA lookup: user:X owner agent:Y | team_id | → team_id = personal_team_id(X) |
| OpenFGA owner (user UUID) | created_by | Keycloak UUID of the agent's OpenFGA owner |
| Known at migration time | source_runtime_id | The runtime ID serving definition_ref on Target Cluster swift |
| Derived | template_id | source_runtime_id + ":" + source_agent_id |
"BankTransfer"). Only UUID agents are user
data. System agents are re-seeded by the swift runtime catalog.
SQL filter: WHERE id ~ '^[0-9a-f]{8}-[0-9a-f]{4}-'.
v2.react.basic agents. This template exists in swift's runtime catalog.
The agent is recreated with the same name, role, and description from
tuning.role and tuning.description. No conversation
history is lost — sessions are not migrated regardless.
2.4 OpenSearch — no re-vectorization needed
Swift uses the same embedding model as kea. The Chapter 1 OpenSearch index is therefore fully compatible with swift's knowledge-flow backend. No re-indexing is required: swift is configured to point at the existing index and search is available from the first request.
If the embedding model ever changes in a future swift release, a full re-index from object storage will be required at that point — not as part of this migration.
2.5 Validation checklist
Run all checks on the swift namespace before switching ingress. The swift application must be reachable internally (not via public DNS) during this phase.
This check is not migration-specific — it is equally required whenever a new model version, prompt, or retrieval configuration is deployed. It is therefore treated as a standing validation capability, not a one-off migration step.
Tooling target: mid-July 2026. A replay harness will be available on both sides: a lightweight JSON batch endpoint on kea (pragmatic, given its WebSocket protocol), and a full evaluation UI + script on swift (deepeval-based). Until then, manual spot-checking via the respective UIs is the fallback. See open item in § Key findings.
2.6 Cutover procedure
Cutover from kea to swift on Target Cluster is a Kubernetes ingress rule change. Because both namespaces are on the same cluster, there is no DNS propagation delay — the switch takes effect in seconds.
fred database.Appendix A — Store decision table
| Store / table | Chapter 1 | Chapter 2 | Notes |
|---|---|---|---|
| Keycloak realm | export / import | — | Preserve group UUIDs — they are team IDs everywhere |
| OpenFGA tuples | copy (UUID format only) | — | Deduplicate username vs UUID format |
agent (UUID ids) | copy | → agent_instance (§2.3) | Short-string ids are platform-seeded — skip those |
tag | copy | copy (identical schema) | |
metadata | copy | copy | |
resource | copy | copy | |
teammetadata | copy if non-empty | drop storage_size fields | Often empty in practice |
users | copy if non-empty | add storage_size default | Only non-empty if GCU was accepted on Source Cluster |
| Personal teams | — | create 1 per user (§2.2) | kea: shared literal "personal" · swift: per-user UUID-derived |
prompt, default_prompt_usage | — | start empty | No kea equivalent |
mcp-server | skip | skip | Platform config — re-seeded by deployment |
session*, feedbacks, tasks*, checkpoints | skip | skip | Ephemeral or operational |
| Object storage (content + banners) | sync — preserve key paths | — | Final incremental sync during maintenance window |
| OpenSearch vectors | skip | re-vectorize from object storage | Never copy kea index — embedding model may differ |
| OpenSearch logs / KPIs | skip | skip | Start fresh on Target Cluster |
Appendix B — Rollback strategy
Both chapters have a clean, fast rollback path. The key principle is that no source is destroyed until the 72-hour hold period expires.
Chapter 1 rollback — Source Cluster kea ← Target Cluster kea
Source Cluster remains in read-only mode for 72 hours after cutover. To roll back, re-enable writes on Source Cluster and update the DNS record or load-balancer rule to point back at Source Cluster. No data restoration is needed — Source Cluster's data was never modified.
Chapter 2 rollback — Target Cluster kea ← Target Cluster swift
The kea namespace remains running on Target Cluster for 72 hours after the swift cutover.
To roll back, revert the ingress rule to point at the kea namespace service.
The effect is instant. The fred database is untouched by the
Chapter 2 migration scripts (they write to fred_swift only).
Appendix C — Developer local setup
This appendix is for engineers who want to run the migration scripts locally to validate them before executing on production. It is not required reading for management.
Infrastructure: fred-deployment-factory
The ignored/fred-deployment-factory repository provides a full
Docker Compose stack with Postgres, Keycloak, MinIO, OpenSearch, OpenFGA,
and Temporal. A single command brings up the entire infrastructure:
Two databases, one Postgres container
The factory creates both the kea database (fred) and the swift
database (fred_swift) in the same Postgres container. Both are
owned by the fred user. This mirrors the production topology where
the two namespaces share a Postgres service.
| Database | Owner | Used by | Config key |
|---|---|---|---|
fred | fred | kea (all backends) | database: fred |
fred_swift | fred | swift (all backends) | database: fred_swift |
To switch between running kea and running swift locally, change the
database: field in the backend's configuration_prod.yaml.
Both share the same Postgres port (5432) and credentials — only the database
name differs.
Creating the fred_swift schema
After make docker-up, fred_swift is an empty database.
Run swift's Alembic migrations to create the schema before running any
migration scripts:
Checkpoints for repeatable migration testing
The factory supports named checkpoints — snapshots of all Docker volumes. Once you have created test data in kea (agents, teams, documents), save a checkpoint before running any migration scripts:
make checkpoint-save NAME=kea-source
To reset and replay the migration from scratch:
Configuration reference
| Variable / key | kea value | swift value |
|---|---|---|
| Postgres database | fred | fred_swift |
| Postgres user | fred | fred |
| Postgres password (env) | FRED_POSTGRES_PASSWORD=<change-me> | |
| Postgres host (local) | localhost:5432 | |
| OpenFGA token (env) | OPENFGA_API_TOKEN=<change-me> | |
| MinIO access key (env) | MINIO_ACCESS_KEY=admin | |
| Factory env var for swift DB | POSTGRES_FRED_SWIFT_DB=fred_swift | |