How to Replace Bastion Hosts Without Breaking Developer Workflows

Bastion host migrations fail when they break developer muscle memory. This playbook replaces bastions in phases — preserving SSH habits, on-call workflows, and emergency access — while adding session recording, identity-based policy, and automatic key removal.

Why Bastion Migrations Fail

Every infrastructure team that has tried to retire bastion hosts knows the pattern: security proposes the migration, engineers nod along, the new platform is deployed, and six months later the bastions are still running because nobody can agree on how to handle the edge cases. The emergency SSH key. The on-call runbook that starts with ssh bastion-prod. The deploy script that tunnels through port 22 on the jump box. The contractor who only has access to one server through a hardcoded ProxyJump entry.

Bastion migrations do not fail because of technology limitations. They fail because of developer muscle memory. Engineers have built years of habits around their SSH config, their terminal aliases, and their mental model of how to reach production. Replacing the bastion without replacing these habits means the old box stays alive as a “fallback” that becomes permanent.

The solution is a phased approach that runs old and new access paths in parallel, migrates habits explicitly, and removes the bastion only after measurable adoption proves the new path works.

72%
of bastion migration attempts take over 6 months
34
average SSH keys found per bastion during inventory
4 weeks
typical pilot-to-cutover with phased approach

The Migration Timeline

A successful bastion migration follows six phases. Skipping phases — especially the inventory and parallel run — is the most common cause of failure. Each phase has a clear deliverable and a go/no-go gate before proceeding.

Bastion Replacement Migration Timeline 1. Inventory Week 1–2 Keys, users, scripts 2. Pilot Week 3 5 servers, 1 team 3. Parallel Run Week 4–5 Both paths active 4. Policy Map Week 5–6 SSH config → policies 5. Cutover Week 7 Bastion read-only 6. Decom Week 8+ Remove bastion Artifacts • SSH key inventory • User list + roles • Script dependencies • Emergency access   procedures • ProxyJump configs • Tunnel mappings Gate: full asset map reviewed by SRE lead Artifacts • Agent deployed • SSO auth tested • Session recording   verified • Pilot team trained • Feedback collected Gate: pilot team can access all 5 hosts Artifacts • Adoption metrics • Bastion login logs • New path login logs • User friction reports • Runbook updates Gate: >80% of sessions use new path Artifacts • Policy definitions • Group-to-host maps • JIT policies created • On-call workflow   documented • Emergency runbook Gate: all SSH config entries mapped Artifacts • Bastion set to   read-only mode • Key auth disabled • Remaining users   migrated 1:1 Gate: zero bastion logins for 7 days Artifacts • Bastion powered   off / terminated • DNS removed • Firewall rules   cleaned up • Keys purged Gate: no inbound traffic for 14 days

A six-phase migration timeline with clear go/no-go gates at each phase transition. Rushing past the parallel run phase is the most common cause of migration failure.

Phase 1: Inventory Audit

Before you can replace a bastion, you need to know exactly what it does. Most teams are surprised by the inventory: the bastion hosts more SSH keys, serves more users, and supports more automated scripts than anyone remembers. A thorough inventory covers four categories.

Category What to Inventory Where to Find It
SSH keys Every public key in authorized_keys on the bastion and target hosts ~/.ssh/authorized_keys, /etc/ssh/ on each host
Users Every user account on the bastion, active and inactive /etc/passwd, LDAP/AD group memberships
Scripts and automations Cron jobs, deploy scripts, monitoring probes that SSH through the bastion CI/CD configs, crontabs, Ansible inventories
Emergency procedures Break-glass keys, on-call runbooks, incident response playbooks Runbook wikis, PagerDuty escalation policies

Phase 2: Choosing a Pilot Cohort

The pilot should be a team that is technically confident, not overly change-resistant, and does not own the most critical production workload. A good choice is the SRE team or a backend team that manages non-critical services. The pilot cohort should include 5–10 servers and 5–8 engineers.

During the pilot, the bastion remains fully operational. The goal is to prove that the new access path works, not to force adoption. Deploy the OnePAM agent to the pilot servers, configure SSO authentication, enable session recording, and train the pilot team on the new workflow.

Phase 3: Running Parallel Access Paths

This is the critical phase that most teams rush through. For 1–2 weeks, both the bastion and the new access path are available. Engineers can use either one. The goal is to measure adoption organically: track what percentage of sessions use the new path vs. the bastion, identify friction points, and fix them before the cutover.

Adoption Metrics to Track

New path sessions / total sessions: Target is 80%+ before moving to cutover. Mean time to first access: Should be comparable to the bastion path. Support tickets: Track any access failures or confusion. Bastion-only users: Identify engineers who have not tried the new path and understand why.

Phase 4: Mapping SSH Config to Identity-Based Policies

Engineers rely on their ~/.ssh/config for quick access. Every Host entry, ProxyJump directive, and IdentityFile reference needs an equivalent in the new system. The SSH config generator can produce equivalent policy definitions from an existing SSH config file.

SSH Config Pattern Identity-Based Equivalent
Host prod-* with ProxyJump bastion Resource group “production” with SSO auth + JIT policy
Host staging-db with specific key Named resource with role-based access and session recording
Shared IdentityFile for team access IdP group membership grants access; no shared keys
Host emergency with break-glass key Emergency access policy with auto-approval + recording

Preserving Emergency and On-Call Workflows

On-call access is the single biggest blocker for bastion retirement. The existing workflow is muscle memory: wake up at 3 AM, ssh bastion, ssh prod-app-01, debug the issue. Any new system must be at least as fast and reliable as this workflow, or on-call engineers will refuse to use it.

  • Auto-approval for Severity-1 incidents: JIT requests linked to PagerDuty incidents get auto-approved with full recording
  • Browser-based terminal: No SSH client needed; works from any device including phones
  • Saved connection shortcuts: One-click access to frequently used hosts, equivalent to SSH config aliases
  • Break-glass override: Documented procedure for accessing infrastructure when the access platform itself is unavailable

Session Recording as an Improvement Over Bastion Logs

Bastion hosts provide minimal logging: who connected and when. They do not capture what happened during the session. Session recording is a significant improvement: every command, every output, every file transfer is recorded and can be replayed during incident review or compliance audits.

This is a strong argument for the migration even beyond the security benefits. When an incident occurs and the post-mortem needs to determine exactly what commands were run during the response window, a session recording provides definitive evidence. The bastion’s auth.log can only tell you that someone logged in. Learn more in the bastion comparison article.

Measuring Adoption and Key Removal

Adoption is measured by the ratio of new-path sessions to total sessions. Key removal is measured by the number of SSH keys removed from target hosts after the new access path is validated. Both metrics should be tracked weekly during the migration.

A healthy migration shows new-path adoption rising from 20% in week 1 to 80%+ by week 4, with a corresponding decrease in bastion sessions. Key removal follows adoption: as engineers confirm the new path works, their SSH keys can be removed from the target hosts. The goal is zero authorized keys on any production host by the end of the migration.

Handling CI/CD Pipeline Access

Bastions are not only used by humans. CI/CD pipelines often ProxyJump through the bastion to deploy code, run database migrations, or execute smoke tests on internal hosts. These automated workflows need a separate migration path because they cannot authenticate through SSO or approve JIT requests interactively.

The recommended pattern is service-account access with scoped API tokens: the pipeline authenticates with a machine identity, receives a short-lived credential valid only for the target host and the deployment action, and the session is recorded for audit purposes. This eliminates the long-lived SSH key that lives in a CI secret store—often the weakest link in the entire access chain because it is never rotated and grants broad access.

Start by cataloging every CI/CD pipeline that uses the bastion. For each pipeline, document the target hosts, the actions performed, the current credential type, and the rotation frequency. Then migrate them to scoped API access one pipeline at a time, validating that deployments succeed before removing the bastion path. Leave the bastion accessible to CI/CD as a fallback until all pipelines are migrated and tested through at least one full release cycle.

Firewall Rule Cleanup

Bastions introduce persistent firewall rules: port 22 open from the bastion’s IP to every target host, port 22 open from the Internet to the bastion. After decommissioning, these rules must be cleaned up. A common oversight is leaving the inbound rule on the bastion’s former IP address, which may be reassigned to a new instance that is not hardened as a bastion.

Decommission Checklist

  • Zero bastion logins for 14+ consecutive days
  • All SSH keys removed from bastion authorized_keys
  • All ProxyJump references removed from team SSH configs
  • Emergency access procedure updated to use new platform
  • On-call runbooks updated and tested by the on-call rotation
  • CI/CD pipelines updated to use direct or scoped access
  • Firewall rules for bastion IP removed from all security groups
  • DNS records pointing to bastion removed or updated
  • Bastion VM terminated and disk snapshots archived (30-day retention)
  • Migration retrospective completed and documented

Use the browser-based SSH feature for the smoothest developer experience, and review the architecture documentation for agent deployment options across cloud providers and on-premises environments.

Replace Your Bastion in Weeks, Not Months

OnePAM provides browser-based SSH with SSO, session recording, JIT approvals, and on-call workflow integration. Run parallel access paths during migration and decommission the bastion when adoption proves the new path works.

Start Free Trial
OnePAM Team
Security & Infrastructure Team