GenAI Tooling Alignment – Anson’s Projects

Context and Problem Statement

Adoption of GenAI at Shield is lagging and uncoordinated. We have no standard tooling while defense primes have deployed to 170,000+ employees over the past 3 years [1], [2], [3]. China’s military has already deployed domestic AI models at scale [4], [5]. All while Shield doesn’t mention AI in developer onboarding, there is no standard GenAI software for new employees, and our engineers have no clear guidance on acceptable usage.

I propose that not only do we align on a single AI platform, but that we adopt the models and tools offered by Anthropic. Anthropic has consistently led in innovation and novel problem solving over the past two years, which is exactly the type of work we do here at Shield.

Decision Drivers

Tools lack cross compatibility

Every tool has its own naming conventions, formats, and toolsets (MCP servers, agents, skills, rules). Those differences are solvable with symlinks and scripts. The real blocker is that prompts don’t transfer between models.

Studies show that even minor formatting changes can cause up to 76 accuracy points difference in model outputs [6], and these format preferences don’t transfer between models. Automated prompt translation improves cross-model performance by 27% on SWE-Bench and 39% on Terminal-Bench [7], but proves how differently models approach problems. This means that teams using different platforms are investing in parallel, but largely incompatible direcions.

I have seen this first hand at Shield. Even with the latest frontier models the upgrade tooling that I created for Claude to do HMSDK upgrades flat out didn’t work in ChatGPT (November 2025). Even after multiple iterations of the tool in Codex it could not grasp the task at hand and how to use all the tooling I had generated alongside Claude.

Anthropic Dominates in Tooling Innovation

Anthropic has consistently been first to ship capabilities that matter for engineering:

Innovation	Anthropic Shipped	Others Followed
Computer Use	Oct 2024	OpenAI Operator (Jan 2025)
MCP (Tool Protocol)	Nov 2024	OpenAI (Mar 2025), Google (Apr 2025), VS Code (Jul 2025)
Claude Code	Feb 2025	OpenAI Codex (May 2025)
Hooks System	Jul 2025	Windsurf Cascade Hooks (Dec 2025)
Plugin Marketplace	Oct 2025	No competitors
Cowork	Jan 2026	No competitors — agentic file operations for non-engineers

OpenAI and Microsoft led on the older innovations (chat interface, code completion), but Anthropic has dominated the 2024-2025 wave of agentic tooling. MCP is now a Linux Foundation project—Anthropic created the standard everyone else adopted.

The real gap isn’t just “first to ship”—it’s iteration velocity. When OpenAI adopted MCP in Mar 2025, Anthropic had already shipped Claude Code. When Codex shipped in May 2025, Claude Code was already iterating on plan mode and background tasks. By the time competitors release their v1, Anthropic is on v2 or v3.

Productivity Expectations

GenAI is making a real measurable impact across industries. 90% of Fortune 100 companies [8] have deployed AI coding tools, ~85% of developers [9], [10] use them, and the market has grown to $7.4 billion [11].

The biggest wins come from automating process work. Teams using AI tools ship 26% more PRs per week [12] with 4x faster turnaround [13].

Defense primes have already scaled. Lockheed has 70,000 users [14] on its Genesis platform. Boeing deployed to 170,000 employees [1] by late 2023. Blue Origin reports 95% of software engineers using AI tools with 70% company-wide adoption [15].

ROI Math

At ~$60/seat/month for Claude Enterprise ($720/year), even a conservative 20% productivity gain on a $200K engineer creates $40,000 in value, a 55x return. Novo Nordisk reduced clinical report writing from 10+ weeks to 10 minutes [16]. TELUS engineering teams ship code 30% faster [17]. Pfizer saves up to 16,000 hours annually [18].

Agentic tools like Claude Code cost more (up to $1000/seat/month at heavy usage) but unlock work that wasn’t previously viable. Public ROI data is limited, so here are real examples from Shield:

HMSDK Upgrades: Upgrading between SDK versions is a significant undertaking. Claude Code enabled AI agents to work around the clock on massive upgrades. This wasn’t about speed. Without AI, these upgrades wouldn’t have been feasible at all. ROI: 5x faster, but more importantly, AI made this work viable when it otherwise wasn’t.

Customer Engagement Acceptance Testing: CE has used Claude Code for SDK acceptance testing since 25.3. It enables fast bug discovery, root cause analysis that CE traditionally couldn’t prioritize, and async execution that expands testing scope. ROI: A bad SDK release could jeopardize million dollar contracts.

Training Material Validation: HMSDK training materials take two weeks for a new engineer to complete, making them difficult to keep current. ROI: full validation of basic training materials takes 1hr in CI pipeline, takes engineer a full day. Future iterations have massive potential.

Not all models are made the same

Benchmarking models is really hard so I don’t want to get too deep into it here. I think over time models will continue to leap frog each other, but stay within a close margin to eachother. However, tooling is where we are seeing real stratification and what we should use to drive decisions.

Models are trained on solved problems, the problems at Shield aren’t solved yet. For this reason I think we should place a strong emphasis on benchmarks that involve novel problem solving, and multi-step software engineering tasks. We should also place a large emphasis on performance and not price, a small intelligence gain is worth extra costs given our domain.

SWE-bench Verified [19] is the industry standard benchmark for evaluating AI on real-world software engineering. It tests models against 500 actual GitHub issues from popular Python repositories—the model must read the issue, understand the codebase, and produce a working patch. A score of 70%+ means the model can autonomously resolve most real bugs and feature requests.

Model	Score	Cost/Instance	GovCloud
Gemini 3 Pro	76.2%	$0.46	IL6+
Claude 4.5 Opus	74.4%	$0.72	Not Available
GPT-5.2 (high reasoning)	71.8%	$0.52	IL6+
Claude 4.5 Sonnet	70.6%	$0.56	IL5¹

Considered Options

Platform Comparison

Capability	Anthropic	OpenAI	Google	Cursor	Windsurf
Chat WebUI	Claude.ai	ChatGPT	Gemini	✗	✗
Agentic Chat	Cowork (Jan 2026)	✗	✗	✗	✗
Desktop App	Claude Desktop	ChatGPT Desktop	Gemini Desktop	✗	✗
AI IDE	Extension	Extension	Antigravity	Cursor	Windsurf
CLI Agent	Claude Code	Codex CLI	Gemini CLI	✗	✗
Python SDK	anthropic	openai	google-genai	✗	✗
Embedding Models	✗	text-embedding-3	text-embedding	✗	✗
Image Generation	✗	DALL-E	Imagen	✗	✗
Agent SDK	Agent SDK	Agents SDK	ADK	✗	✗
MCP Support	Native (creator)	Mar 2025	Apr 2025	Yes	Yes
Hooks/Automation	Jul 2025	✗	✗	✗	Dec 2025
Plugin Marketplace	Oct 2025	✗	✗	✗	✗
Computer Use	Oct 2024	Operator (Jan 2025)	✗	✗	✗
Background Agents	Yes	Yes	Jules	Yes	Yes
Model Agnostic	✗	✗	✗	Yes	Yes
Enterprise SSO	Yes	Yes	Yes	Yes	Yes
IL5 Authorization	Yes (Bedrock)	Yes (Azure)	Yes (GDC)	✗	Yes
IL6+/Classified	In pilot	Yes	Yes	✗	In pilot
Pro Pricing	$20/mo	$20/mo	$20/mo	$20/mo	$15/mo
Enterprise Pricing	~$60/seat	~$60/seat	Contact	$40/seat	Contact

Platforms

Three frontier providers worth considering. All offer similar capabilities at similar price points. The differentiators matter:

Anthropic leads in agentic tooling. They shipped MCP, Claude Code, hooks, and the plugin marketplace before anyone else. Competitors follow 6-12 months behind. If you want the latest capabilities for autonomous engineering workflows, Anthropic gets there first. Downside: IL6+/classified support is still in pilot.

OpenAI has the broadest ecosystem. If you need to integrate with existing enterprise tools or want the safest vendor choice, OpenAI has the most established relationships. Downside: consistently 6-12 months behind on agentic features, Copilot still lacks IL5 authorization, and their focus skews toward general audiences rather than specialized engineering work.

Google has the deepest government presence. Selected for GenAI.mil serving 3M+ DoD personnel, IL6+ authorized, and tightly integrated with Google Cloud. If GovCloud and classified work are priorities, Google has the strongest position. Downside: agentic tooling is newer and less mature, and enterprise pricing is opaque.

Security and GovCloud

Private Plugin Marketplace: Claude Code is the only tool that lets us host a private marketplace on internal GitLab, distributing proprietary tooling automatically. No competitor offers this.

FedRAMP is no longer the bottleneck. Authorization that previously took years now completes in under two months for pilot participants.

Figure 1: Time from commercial release to FedRAMP authorization. Arrows are shrinking—newer models get authorized faster.

FedRAMP 20x [20] (March 2025) replaced paper processes with automation. Pilot participants have received authorization in under two months.

Decision Outcome

Adopt the Anthropic ecosystem company-wide:

Claude Enterprise for all employees (chat, research, general use)
Claude Code for engineering (agentic coding, automation)
Agent SDK for custom automation workflows

Every new employee gets a Claude Enterprise subscription. Engineers get Claude Code API keys through the central account.

Implementation

Procurement: Negotiate Claude Enterprise agreement
Rollout: Phase 1 (engineering), Phase 2 (all employees)
Training: Internal docs, CLAUDE.md templates, MCP server examples
Plugin Marketplace: Stand up internal GitLab-hosted marketplace

References

[1]

A. Seth, “Success of any AI capability is in being used sustainably by the business.” [Online]. Available: https://www.cdomagazine.tech/aiml/success-of-any-ai-capability-is-in-being-used-sustainably-by-the-business-the-boeing-company-chief-enterprise-ai-and-data-officer

[2]

Lockheed Martin, “Empowering innovation with secure generative AI across enterprise.” [Online]. Available: https://www.lockheedmartin.com/en-us/news/features/2024/empowering-innovation-with-secure-generative-ai-across-enterprise.html

[3]

Business Wire, “Future tech accelerates northrop grumman’s launch of enterprise AI factory.” [Online]. Available: https://www.businesswire.com/news/home/20251028360221/en/Future-Tech-Accelerates-Northrop-Grummans-Launch-of-Enterprise-AI-Factory

[4]

South China Morning Post, “China’s PLA is using DeepSeek AI for non-combat support. Will actual combat be next?” [Online]. Available: https://www.scmp.com/news/china/military/article/3303512/chinas-pla-using-deepseek-ai-non-combat-support-will-actual-combat-be-next

[5]

Jamestown Foundation, “DeepSeek use in PRC military and public security systems,” China Brief. [Online]. Available: https://jamestown.org/program/deepseek-use-in-prc-military-and-public-security-systems/

[6]

M. Sclar, Y. Choi, Y. Tsvetkov, and A. Suhr, “Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting,” arXiv preprint, 2023, Available: https://arxiv.org/abs/2310.11324

[7]

A. Waheed et al., “PromptBridge: Automated cross-model prompt translation,” arXiv preprint, 2025, Available: https://arxiv.org/abs/2512.01420

[8]

TechCrunch, “GitHub copilot crosses 20M all-time users.” [Online]. Available: https://techcrunch.com/2025/07/30/github-copilot-crosses-20-million-all-time-users/

[9]

Stack Overflow, “2024 stack overflow developer survey - AI.” [Online]. Available: https://survey.stackoverflow.co/2024/ai

[10]

JetBrains, “The state of developer ecosystem 2025 - artificial intelligence.” [Online]. Available: https://devecosystem-2025.jetbrains.com/artificial-intelligence

[11]

Mordor Intelligence, “AI code tools market size, share & 2030 trends report.” [Online]. Available: https://www.mordorintelligence.com/industry-reports/artificial-intelligence-code-tools-market

[12]

Z. Cui, M. Demirer, S. Jaffe, L. Musolff, S. Peng, and T. Salz, “The effects of generative AI on high-skilled work: Evidence from three field experiments with software developers,” 2024, Available: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4945566

[13]

Opsera, “GitHub copilot enterprise: The impact on developer productivity.” [Online]. Available: https://www.opsera.io/blog/github-copilot-enterprise-impact-on-developer-productivity

[14]

Aerospace America, “Lockheed martin details challenges implementing AI in the DOD marketspace.” [Online]. Available: https://aerospaceamerica.aiaa.org/institute/lockheed-martin-details-challenges-implementing-ai-in-the-dod-marketspace/

[15]

AWS, “How blue origin built the first AI agent-designed hardware for the moon in days, not years.” [Online]. Available: https://aws.amazon.com/solutions/case-studies/blue-origin-case-study/

[16]

Anthropic, “Novo nordisk: Transforming clinical report writing with claude.” [Online]. Available: https://claude.com/customers/novo-nordisk

[17]

Anthropic, “TELUS boosts workplace innovation with claude.” [Online]. Available: https://claude.com/customers/telus

[18]

AWS, “Driving patient-centric innovation in life sciences using generative AI with pfizer.” [Online]. Available: https://aws.amazon.com/solutions/case-studies/pfizer-PACT-case-study/

[19]

C. E. Jimenez et al., “SWE-bench: Can language models resolve real-world GitHub issues?” [Online]. Available: https://www.swebench.com/

[20]

FedRAMP, “FedRAMP 20x overview.” [Online]. Available: https://www.fedramp.gov/20x/

Footnotes

IL5 encompasses FedRAMP High, CUI, IL4, and IL5 authorizations—essentially all unclassified government work.↩︎

Reuse

CC BY 4.0

Other Formats