5 papers worth your attention today
1. Credential Leakage in LLM Agent Skills
Zhihao Chen et al. — arXiv:2604.03070
The biggest empirical security study of the LLM agent skills ecosystem to date. 17,022 skills analyzed. 520 vulnerable. 1,708 security issues. The headline finding: 73.5% of credential leakage comes from one mechanism — debug logging that gets piped into the LLM's context window via stdout capture.
Most security scanners can't catch this because 76.3% of leakage cases require joint analysis of code and natural language descriptions. Code-only static analysis misses the majority of issues.
Why it matters: Every developer has written print(api_key) while debugging. In agent frameworks, that line becomes a permanent backdoor — the framework captures stdout, the LLM ingests it, any follow-up query can retrieve it. This isn't a developer education problem. It's an architectural mismatch between debugging conventions that have existed for decades and a new framework that turned stdout into model context. SQL injection of the agent era.
2. GrandCode: Grandmaster-Level Competitive Programming via Agentic RL
Xiaoya Li et al. — arXiv:2604.02721
First place in 3 consecutive live Codeforces rounds (1087-1089, March 2026). Beat all human participants including legendary grandmasters. This is live competition, not a benchmark replay.
The interesting finding isn't that they won. It's that without their RL pipeline, the same model scores below Gemini 3.1 Pro (72% vs 75% on internal benchmarks). The model isn't the differentiator — Agentic GRPO is. Their extension of GRPO to multi-stage agent rollouts with delayed rewards is what pushed them past humans.
Why it matters: Reinforces the thesis that infrastructure matters more than model capability. Same model + better RL pipeline = grandmaster level. This is the same pattern as Kitchen Loop and Meta-Harness — the wrapper is doing the work, not the LLM.
What deserves skepticism: only 3 contests in one week, no Elo rating disclosed, compute costs hidden, and they violated Codeforces' AI policy to enter. Real result, qualified context.
3. Opal: Private Memory for Personal AI
Darya Kaviani et al. (UC Berkeley + Google DeepMind) — arXiv:2604.02522
Solves a problem most personal AI products don't even attempt: even with encrypted personal data, the pattern of what you retrieve leaks information. If your AI queries your medical records every Monday morning, an observer doesn't need to decrypt anything — they already know you have a recurring appointment.
Opal puts all data-dependent reasoning inside a trusted hardware enclave. The untrusted storage only sees fixed, oblivious memory accesses (ORAM). Maintenance operations like reindexing piggyback on routine accesses so even housekeeping doesn't leak patterns.
The results: 60.5% retrieval accuracy (beating the insecure Graphiti baseline at 56.9%), 29x throughput improvement, 15x lower infrastructure cost than a secure baseline. Under consideration for deployment to "millions of users at a major AI provider" — almost certainly Google.
Why it matters: This is what production-grade private AI memory actually looks like. Most products today use the word "private" to mean "we encrypt the data" — which doesn't address access pattern leakage at all. The bar is moving.
4. IndustryCode: Reality Check on Code Generation Benchmarks
Puyu Zeng et al. — arXiv:2604.02729
First comprehensive industrial code benchmark spanning finance, automation, aerospace, and remote sensing across MATLAB, Python, C++, and Stata. 579 sub-problems from 125 real industrial challenges. Top model (Claude 4.5 Opus) achieves 68.1% on sub-problems but only 42.5% when those sub-problems must compose into full projects.
That 25.6-point drop is the diagnostic finding. Models can solve pieces. They can't compose them.
Other patterns from their error analysis:
- Syntax errors (32.8%) dominate over reasoning failures (8.4%) — fluency with specialized languages matters more than abstract reasoning
- Thinking mode is a double-edged sword — eliminates reasoning failures (11.8% → 0.8%) but causes context confusion to spike from 1.9% to 29.2%. Thinking harder makes the model lose track of the problem.
- No single model wins everywhere — Claude leads overall, Gemini-3-pro leads on MATLAB (77.9%), Qwen3-Max (open-source) beats GPT-5.2 on C++
Why it matters: When models score 90%+ on HumanEval and LiveCodeBench, it's tempting to think coding is solved. IndustryCode shows the gap between "solve the algorithmic puzzle" and "build the chip simulator that actually compiles in MATLAB." Industrial code remains hard.
5. Haiku to Opus in 10 Bits
Roy Rinberg et al. — arXiv:2604.02343
Reframes model capability as compressible information. The mechanism: a smaller model iteratively asks yes/no questions to a larger model, transmitting 1 bit per answer. After 10 binary questions, the smaller model recovers 23-72% of the capability gap with the larger model on standard benchmarks (7-38% on harder ones).
Compression ratios of 0.0006-0.004 — over 100x smaller than prior LLM-based compression.
The skeptical read: On easier benchmarks, the self-refinement baseline (just asking the smaller model to think harder) captures most of the gains. The structured question-asking process helps more than Opus's actual answers. On AIME competition math, Opus's answers actually hurt — the smaller model does better refining its own thinking.
Why it matters: The interesting finding isn't compression. It's that the capability delta between small and large models is surprisingly sparse on structured tasks. If 10 bits captures 50% of the gap, what is the other 90% of the parameter count buying? This has implications for distillation, model serving, and understanding what scale actually delivers.
Themes Today
Infrastructure beats model. GrandCode and the credential leakage papers tell the same story from opposite directions. GrandCode shows that the same model becomes grandmaster-level with the right RL infrastructure around it. The credential leakage paper shows that the same models become catastrophically insecure with the wrong framework infrastructure around them. The model isn't the variable. The wrapper is.
Reality checks on benchmarks. IndustryCode and the Haiku-to-Opus paper both expose the gap between aggregate scores and actual capability. IndustryCode: models score 68% on pieces and 42% on wholes. Haiku-to-Opus: 10 bits captures most of the model gap on easy benchmarks because the gap was small to begin with. Aggregate metrics hide the failure modes that matter.
Privacy is moving from "encrypt the data" to "hide the access patterns." Opal is the first production-track paper I've seen that takes oblivious access patterns seriously. Most personal AI products would fail this test entirely. The bar is moving and most products don't know it yet.