Context and Problem Statement
Adoption of GenAI at Shield is lagging and uncoordinated. We have no standard tooling while defense primes have deployed to 170,000+ employees over the past 3 years [1], [2], [3]. China’s military has already deployed domestic AI models at scale [4], [5]. All while Shield doesn’t mention AI in developer onboarding, there is no standard GenAI software for new employees, and our engineers have no clear guidance on acceptable usage.
I propose that not only do we align on a single AI platform, but that we adopt the models and tools offered by Anthropic. Anthropic has consistently led in innovation and novel problem solving over the past two years, which is exactly the type of work we do here at Shield.
Decision Drivers
Tools lack cross compatibility
Every tool has its own naming conventions, formats, and toolsets (MCP servers, agents, skills, rules). Those differences are solvable with symlinks and scripts. The real blocker is that prompts don’t transfer between models.
Studies show that even minor formatting changes can cause up to 76 accuracy points difference in model outputs [6], and these format preferences don’t transfer between models. Automated prompt translation improves cross-model performance by 27% on SWE-Bench and 39% on Terminal-Bench [7], but proves how differently models approach problems. This means that teams using different platforms are investing in parallel, but largely incompatible direcions.
I have seen this first hand at Shield. Even with the latest frontier models the upgrade tooling that I created for Claude to do HMSDK upgrades flat out didn’t work in ChatGPT (November 2025). Even after multiple iterations of the tool in Codex it could not grasp the task at hand and how to use all the tooling I had generated alongside Claude.
Anthropic Dominates in Tooling Innovation
Anthropic has consistently been first to ship capabilities that matter for engineering:
| Innovation | Anthropic Shipped | Others Followed |
|---|---|---|
| Computer Use | Oct 2024 | OpenAI Operator (Jan 2025) |
| MCP (Tool Protocol) | Nov 2024 | OpenAI (Mar 2025), Google (Apr 2025), VS Code (Jul 2025) |
| Claude Code | Feb 2025 | OpenAI Codex (May 2025) |
| Hooks System | Jul 2025 | Windsurf Cascade Hooks (Dec 2025) |
| Plugin Marketplace | Oct 2025 | No competitors |
| Cowork | Jan 2026 | No competitors — agentic file operations for non-engineers |
OpenAI and Microsoft led on the older innovations (chat interface, code completion), but Anthropic has dominated the 2024-2025 wave of agentic tooling. MCP is now a Linux Foundation project—Anthropic created the standard everyone else adopted.
The real gap isn’t just “first to ship”—it’s iteration velocity. When OpenAI adopted MCP in Mar 2025, Anthropic had already shipped Claude Code. When Codex shipped in May 2025, Claude Code was already iterating on plan mode and background tasks. By the time competitors release their v1, Anthropic is on v2 or v3.
Productivity Expectations
GenAI is making a real measurable impact across industries. 90% of Fortune 100 companies [8] have deployed AI coding tools, ~85% of developers [9], [10] use them, and the market has grown to $7.4 billion [11].
The biggest wins come from automating process work. Teams using AI tools ship 26% more PRs per week [12] with 4x faster turnaround [13].
Defense primes have already scaled. Lockheed has 70,000 users [14] on its Genesis platform. Boeing deployed to 170,000 employees [1] by late 2023. Blue Origin reports 95% of software engineers using AI tools with 70% company-wide adoption [15].
ROI Math
At ~$60/seat/month for Claude Enterprise ($720/year), even a conservative 20% productivity gain on a $200K engineer creates $40,000 in value, a 55x return. Novo Nordisk reduced clinical report writing from 10+ weeks to 10 minutes [16]. TELUS engineering teams ship code 30% faster [17]. Pfizer saves up to 16,000 hours annually [18].
Agentic tools like Claude Code cost more (up to $1000/seat/month at heavy usage) but unlock work that wasn’t previously viable. Public ROI data is limited, so here are real examples from Shield:
HMSDK Upgrades: Upgrading between SDK versions is a significant undertaking. Claude Code enabled AI agents to work around the clock on massive upgrades. This wasn’t about speed. Without AI, these upgrades wouldn’t have been feasible at all. ROI: 5x faster, but more importantly, AI made this work viable when it otherwise wasn’t.
Customer Engagement Acceptance Testing: CE has used Claude Code for SDK acceptance testing since 25.3. It enables fast bug discovery, root cause analysis that CE traditionally couldn’t prioritize, and async execution that expands testing scope. ROI: A bad SDK release could jeopardize million dollar contracts.
Training Material Validation: HMSDK training materials take two weeks for a new engineer to complete, making them difficult to keep current. ROI: full validation of basic training materials takes 1hr in CI pipeline, takes engineer a full day. Future iterations have massive potential.
Not all models are made the same
Benchmarking models is really hard so I don’t want to get too deep into it here. I think over time models will continue to leap frog each other, but stay within a close margin to eachother. However, tooling is where we are seeing real stratification and what we should use to drive decisions.
Models are trained on solved problems, the problems at Shield aren’t solved yet. For this reason I think we should place a strong emphasis on benchmarks that involve novel problem solving, and multi-step software engineering tasks. We should also place a large emphasis on performance and not price, a small intelligence gain is worth extra costs given our domain.
SWE-bench Verified [19] is the industry standard benchmark for evaluating AI on real-world software engineering. It tests models against 500 actual GitHub issues from popular Python repositories—the model must read the issue, understand the codebase, and produce a working patch. A score of 70%+ means the model can autonomously resolve most real bugs and feature requests.
| Model | Score | Cost/Instance | GovCloud |
|---|---|---|---|
| Gemini 3 Pro | 76.2% | $0.46 | IL6+ |
| Claude 4.5 Opus | 74.4% | $0.72 | Not Available |
| GPT-5.2 (high reasoning) | 71.8% | $0.52 | IL6+ |
| Claude 4.5 Sonnet | 70.6% | $0.56 | IL51 |
Considered Options
Platform Comparison
| Capability | Anthropic | OpenAI | Cursor | Windsurf | |
|---|---|---|---|---|---|
| Chat WebUI | Claude.ai | ChatGPT | Gemini | ✗ | ✗ |
| Agentic Chat | Cowork (Jan 2026) | ✗ | ✗ | ✗ | ✗ |
| Desktop App | Claude Desktop | ChatGPT Desktop | Gemini Desktop | ✗ | ✗ |
| AI IDE | Extension | Extension | Antigravity | Cursor | Windsurf |
| CLI Agent | Claude Code | Codex CLI | Gemini CLI | ✗ | ✗ |
| Python SDK | anthropic | openai | google-genai | ✗ | ✗ |
| Embedding Models | ✗ | text-embedding-3 | text-embedding | ✗ | ✗ |
| Image Generation | ✗ | DALL-E | Imagen | ✗ | ✗ |
| Agent SDK | Agent SDK | Agents SDK | ADK | ✗ | ✗ |
| MCP Support | Native (creator) | Mar 2025 | Apr 2025 | Yes | Yes |
| Hooks/Automation | Jul 2025 | ✗ | ✗ | ✗ | Dec 2025 |
| Plugin Marketplace | Oct 2025 | ✗ | ✗ | ✗ | ✗ |
| Computer Use | Oct 2024 | Operator (Jan 2025) | ✗ | ✗ | ✗ |
| Background Agents | Yes | Yes | Jules | Yes | Yes |
| Model Agnostic | ✗ | ✗ | ✗ | Yes | Yes |
| Enterprise SSO | Yes | Yes | Yes | Yes | Yes |
| IL5 Authorization | Yes (Bedrock) | Yes (Azure) | Yes (GDC) | ✗ | Yes |
| IL6+/Classified | In pilot | Yes | Yes | ✗ | In pilot |
| Pro Pricing | $20/mo | $20/mo | $20/mo | $20/mo | $15/mo |
| Enterprise Pricing | ~$60/seat | ~$60/seat | Contact | $40/seat | Contact |
Platforms
Three frontier providers worth considering. All offer similar capabilities at similar price points. The differentiators matter:
Anthropic leads in agentic tooling. They shipped MCP, Claude Code, hooks, and the plugin marketplace before anyone else. Competitors follow 6-12 months behind. If you want the latest capabilities for autonomous engineering workflows, Anthropic gets there first. Downside: IL6+/classified support is still in pilot.
OpenAI has the broadest ecosystem. If you need to integrate with existing enterprise tools or want the safest vendor choice, OpenAI has the most established relationships. Downside: consistently 6-12 months behind on agentic features, Copilot still lacks IL5 authorization, and their focus skews toward general audiences rather than specialized engineering work.
Google has the deepest government presence. Selected for GenAI.mil serving 3M+ DoD personnel, IL6+ authorized, and tightly integrated with Google Cloud. If GovCloud and classified work are priorities, Google has the strongest position. Downside: agentic tooling is newer and less mature, and enterprise pricing is opaque.
Security and GovCloud
Private Plugin Marketplace: Claude Code is the only tool that lets us host a private marketplace on internal GitLab, distributing proprietary tooling automatically. No competitor offers this.
FedRAMP is no longer the bottleneck. Authorization that previously took years now completes in under two months for pilot participants.
FedRAMP 20x [20] (March 2025) replaced paper processes with automation. Pilot participants have received authorization in under two months.
Decision Outcome
Adopt the Anthropic ecosystem company-wide:
- Claude Enterprise for all employees (chat, research, general use)
- Claude Code for engineering (agentic coding, automation)
- Agent SDK for custom automation workflows
Every new employee gets a Claude Enterprise subscription. Engineers get Claude Code API keys through the central account.
Implementation
- Procurement: Negotiate Claude Enterprise agreement
- Rollout: Phase 1 (engineering), Phase 2 (all employees)
- Training: Internal docs, CLAUDE.md templates, MCP server examples
- Plugin Marketplace: Stand up internal GitLab-hosted marketplace
References
Footnotes
IL5 encompasses FedRAMP High, CUI, IL4, and IL5 authorizations—essentially all unclassified government work.↩︎