Zum Inhalt springen

MCP Server / Auto claude code research in sleep

Auto claude code research in sleep

ARIS ⚔️ (Auto-Research-In-Sleep) — Lightweight Markdown-only skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation. No framework, no lock-in — works with Claude Code, Codex, OpenClaw, or any LLM agent.

6,322von @wanshuiyinMITGitHub →

Installation

Claude Code
claude mcp add auto-claude-code-research-in-sleep -- npx -y @openai/codex
npx
npx -y @openai/codex

npm: @openai/codex

Transport

sse

Tools (20)

Meta-Optimize

Atomic sessions

venue

`ICML`

false

Stop after parsing + strategy (Phase 0-3). See what reviewers want before drafting

false

Auto-run supplementary experiments via `/experiment-bridge` when reviewers ask for new evidence

Paper

Score

Author

Stack

UAV-CC

Under review

balanced

max

AUTO_PROCEED

`true`

false

Pause after each review round so you can read the score, give custom modification instructions, skip specific fixes, or stop early

sources

`all`

false

Download top relevant arXiv PDFs during literature survey. When `false`, only fetches metadata (title, abstract, authors)

DBLP_BIBTEX

`true`

true

GPT-5.4 xhigh reviews experiment code before GPU deployment. Set `false` to skip

wandb

`false`

illustration

`gemini`

venue

`ICLR`

false

GitHub repo URL to clone as base codebase (e.g., `— base repo: https://github.com/org/project`). No code? Build on top of an open-source project

gpu

`local`

compact

`false`

Dokumentation

Auto-claude-code-research-in-sleep (ARIS ⚔️🌙)

💡 Use ARIS in Claude Code / Cursor / Trae as a skill-based workflow, or get the full experience with the standalone CLI — enjoy any way you like!

🤖 AI agents: Read AGENT_GUIDE.md instead — structured for LLM consumption, not human browsing.

🔥 ARIS-Code CLI — 独立安装版 · English | ⬇️ Download

📰 ARIS-Code v0.3.10 (2026-04-11) — Proxy/custom base URL (CCSwitch) | Local models (LM Studio/Ollama) | Research Wiki | Meta-Optimize | Atomic sessions | Bash safety | Windows (experimental)

v0.3.0 (2026-04-03) — Multi-file memory index | Rich task system (TodoWrite) | /plan | Security hardening

v0.2.2 (2026-04-03) — /plan step-by-step planning | /tasks persistent tracking

v0.2.1 (2026-04-03) — Persistent Memory | Kimi K2.5 multi-turn fix | CJK cursor fix

v0.2.0 (2026-04-02) — Open source | Kimi + MiniMax + GLM support | Smart LlmReview routing | CI/CD

v0.1.0 (2026-04-02) — Initial release | Multi-executor & reviewer | 42 bundled skills

中文版 README | English

🌙 Let Claude Code do research while you sleep. Wake up to find your paper scored, weaknesses identified, experiments run, and narrative rewritten — autonomously.

🪶 Radically lightweight — zero dependencies, zero lock-in. The entire system is plain Markdown files. No framework to learn, no database to maintain, no Docker to configure, no daemon to babysit. Every skill is a single SKILL.md readable by any LLM — swap Claude Code for Codex CLI, OpenClaw, Cursor, Trae, Antigravity, Windsurf, or your own agent and the workflows still work. Fork it, rewrite it, adapt it to your stack.

💡 ARIS is a methodology, not a platform. What matters is the research workflow — take it wherever you go. 🌱

· · · · 💬 Join Community ·

Custom Claude Code skills for autonomous ML research workflows. These skills orchestrate cross-model collaboration — Claude Code drives the research while an external LLM (via Codex MCP) acts as a critical reviewer. 🔀 Also supports alternative model combinations (Kimi, LongCat, DeepSeek, etc.) — no Claude or OpenAI API required. For example, MiniMax-M2.7 + GLM-5 or GLM-5 + MiniMax-M2.7. 🤖 Codex CLI native — full skill set also available for OpenAI Codex. 🖱️ Cursor — works in Cursor too. 🖥️ Trae — ByteDance AI IDE. 🚀 Antigravity — Google's agent-first IDE. 🆓 Free tier via ModelScope — zero cost, zero lock-in.

💭 Why not self-play with a single model? Using Claude Code subagents or agent teams for both execution and review is technically possible, but tends to fall into local minima — the same model reviewing its own patterns creates blind spots.

Think of it like adversarial vs. stochastic bandits: a single model self-reviewing is the stochastic case (predictable reward noise), while cross-model review is adversarial (the reviewer actively probes weaknesses the executor didn't anticipate) — and adversarial bandits are fundamentally harder to game.

💭 Why two models, not more? Two is the minimum needed to break self-play blind spots, and 2-player games converge to Nash equilibrium far more efficiently than n-player ones. Adding more reviewers increases API cost and coordination overhead with diminishing returns — the biggest gain is going from 1→2, not 2→4.

Claude Code's strength is fast, fluid execution; Codex (GPT-5.4 xhigh) is slower but more deliberate and rigorous in critique. These complementary styles — speed × rigor — produce better outcomes than either model talking to itself.

🎯 More Than Just a Prompt

These are full pipelines — you can also use each workflow independently. Already have an idea? Skip to Workflow 1.5. Have results? Jump to Workflow 3. Got reviews? Jump to Workflow 4. Want persistent memory? Enable Research Wiki. See Quick Start for all commands and Workflows for the full breakdown.

Basic mode — give ARIS a research direction, it handles everything:

/research-pipeline "factorized gap in discrete diffusion LMs"

🔥 Targeted mode — got a paper you want to improve? Give ARIS the paper + the code:

/research-pipeline "improve method X" — ref paper: https://arxiv.org/abs/2406.04329, base repo: https://github.com/org/project

ARIS reads the paper → finds its weaknesses → clones the codebase → generates ideas that specifically fix those weaknesses with that code → runs experiments → writes your paper. Like telling a research assistant: "read this paper, use this repo, find what's missing, and fix it."

Mix and match: ref paper only = "what can be improved?", base repo only = "what can I build with this code?", both = "improve this paper using this code."

🔥 Rebuttal mode — reviews just dropped? Don't panic. ARIS reads every concern, builds a strategy, and drafts a rebuttal that's grounded, structured, and under the character limit:

/rebuttal "paper/ + reviews" — venue: ICML, character limit: 5000

| Parameter | Default | What it does | |-----------|---------|-------------| | venue | ICML | Target venue (ICML/NeurIPS/ICLR/CVPR/ACL/AAAI/ACM) | | character limit | — | Required. Hard character limit for rebuttal text | | quick mode | false | Stop after parsing + strategy (Phase 0-3). See what reviewers want before drafting | | auto experiment | false | Auto-run supplementary experiments via /experiment-bridge when reviewers ask for new evidence | | max stress test rounds | 1 | How many times GPT-5.4 xhigh stress-tests the draft | | max followup rounds | 3 | Per-reviewer follow-up round limit |

Three safety gates — rebuttal will NOT finalize if any fails:

  • 🔒 No fabrication — every claim maps to paper/review/user-confirmed result
  • 🔒 No overpromise — every promise is user-approved
  • 🔒 Full coverage — every reviewer concern is tracked

Two outputs: PASTE_READY.txt (exact char count, paste to venue) + REBUTTAL_DRAFT_rich.md (extended version for manual editing).

After acceptance — your paper is in, now prepare the presentation:

/paper-slides "paper/"     # → Beamer PDF + PPTX + speaker notes + Q&A prep
/paper-poster "paper/"     # → A0/A1 poster PDF + editable PPTX + SVG

💡 From idea to paper to podium — one toolchain. 🌱

🏆 Papers Built with ARIS

| Paper | Score | Venue | Author | Stack | |-------|:-----:|-------|--------|-------| | CS Paper | 8/10 "clear accept" | CS Conference | @DefanXue & @Monglitay | Claude Code + GPT-5.4 | | AAAI Paper | 7/10 "good paper, accept" | AAAI 2026 Main Technical | @xinbo820-web | Pure Codex CLI | | UAV-CC | Under review | IEEE TGRS | @wxx827 | Claude Opus 4.6 + Codex 5.4 xhigh + Cursor |

🎉 Built with ARIS — from idea to submission. Full details + PDFs →

📢 What's New

  • 2026-04-10 — ⚡ Effort Levels— effort: lite | balanced | max | beast. Controls work intensity across all skills: papers found, ideas generated, review rounds, writing depth. Codex reasoning stays xhigh always. beast = every knob to maximum for top-venue sprints. Default balanced = zero change for existing users. Details →
  • 2026-04-10 — 🔎 DeepXiv integration — progressive paper retrieval via DeepXiv CLI. Opt-in: — sources: deepxiv or — sources: all, deepxiv. Staged reading: search → brief → head → section. pip install deepxiv-sdk to enable. Community contribution by @DreamEnding
  • 2026-04-10 — 🛡️ /experiment-audit — cross-model experiment integrity verification. GPT-5.4 reads your eval scripts and results directly, checks for fake ground truth, self-normalized scores, phantom results, and scope inflation (#131, #57). Advisory — warns loudly, never blocks. /result-to-claim auto-reads audit if present. New experiment-integrity.md shared reference. The executor must never judge its own integrity.
  • 2026-04-10 — 🧠 tools/smart_update.sh — intelligent skill updater. Compares local vs upstream, detects personal customizations (server paths, API keys), only updates safe skills. bash tools/smart_update.sh --apply
  • 2026-04-10 — 🏆 Community paper: UAV-CC — first community paper with full PDF archived. UAV change captioning benchmark for IEEE TGRS by @wxx827. Stack: Claude Opus 4.6 + Codex 5.4 xhigh + Cursor. Papers now archived in community_papers/
  • 2026-04-08 — 📚 /research-wiki — persistent research knowledge base inspired by Karpathy's LLM Wiki. Accumulates papers, ideas, experiments, and claims across the entire research lifecycle with typed relationships. Wiki-aware hooks in /research-lit (ingest papers), /idea-creator (read wiki + write ideas back), and /result-to-claim (update claim status + trigger re-ideation). Failed ideas become anti-repetition memory. ARIS now learns from its mistakes.
  • 2026-04-05 — 🧬 /meta-optimize — outer-loop harness optimization for ARIS. Passively logs skill invocations, tool calls, failures, and parameter overrides via Claude Code hooks. Run /meta-optimize to analyze accumulated usage data and propose SKILL.md improvements — reviewer-gated, user-approved. Inspired by Meta-Harness (Lee et al., 2026). ARIS now optimizes itself.
  • 2026-04-04 — 🔧 Codex Plugin deep integration/codex:rescue now auto-invoked when experiments fail (Workflow 1.5) or LaTeX won't compile (Workflow 3). GPT independently diagnoses the bug before Claude retries — two AI debuggers are better than one. Optional: codex exec powers nightmare review, /codex:rescue powers auto-debug. Setup →
  • 2026-04-03 — ☁️ Modal serverless GPU — no GPU? gpu: modal in CLAUDE.md, one command (modal run launcher.py), no SSH, no Docker, auto scale-to-zero. $30/month free tier — enough to try ARIS experiments without any hardware. pip install modal && modal setup and go. Community contribution by @zeyuzhangzyz
  • 2026-04-03 — 🎮 Reviewer Difficulty Levelsmedium (default, unchanged), hard (reviewer memory + debate protocol), nightmare (GPT reads repo directly via codex exec — Claude can't hide anything). — difficulty: nightmare for maximum stress test before submission
  • 2026-03-30 — 🔥 Auto-debug & exhaust-before-surrender — experiment-bridge auto-diagnoses failures (OOM, import, CUDA, NaN) and retries up to 3×. Inspired by PUA

  • 2026-03-30 — ☁️ Vast.ai GPU rentalgpu: vast auto-rents cheapest GPU. By @YIHONG-JIN. 🔧 MiniMax M2.7 upgrade by @octo-patch

  • 2026-03-27 — 📄 IEEE venue support (9 families). 🔎 Semantic Scholar. By @ypd666

  • 2026-03-26 — 📄 Document-based inputRESEARCH_BRIEF.md auto-detect

  • 2026-03-24 — 📝 Workflow 4: /rebuttal — 7-phase pipeline, 3 safety gates

  • 2026-03-23 — 🔧 /training-check, /result-to-claim, /ablation-planner integrated. 📦 compact mode. By @JingxuanKang & @couragec

  • 2026-03-22 — 📋 Templates — input templates for every workflow. 📄 7 venue templates — CVPR, ACL, AAAI, ACM MM added. 🛡️ Anti-hallucination fix — Workflow 2 enforces DBLP → CrossRef → [VERIFY]. 🔗 base repo — clone a GitHub repo as base codebase (— base repo: https://github.com/org/project)

  • 2026-03-22 — 🔍 Codex + Gemini review guide — Codex executes, Gemini reviews via local gemini-review MCP bridge. CN

  • 2026-03-20 — 🚀 Antigravity adaptation guide — use ARIS skills in Google Antigravity (agent-first IDE). Community contribution by @PeppaPigw

  • 2026-03-20 — 🖥️ Trae adaptation guide — use ARIS skills in Trae (ByteDance AI IDE). Community contribution by @Prometheus-cotigo. 🔢 formula-derivation — Community contribution by @Falling-Flower

  • 2026-03-19 — 🖼️ paper-poster — Conference poster. Community contribution by @dengzhe-hou

  • 2026-03-19 — 🔗 Workflow 1.5 upgraded/experiment-bridge GPT-5.4 code review. 📊 W&B fix

  • 2026-03-18 — 🎤 paper-slides + 🔁 Codex+Claude bridge + 🖱️ Cursor guide + 🤖 Codex CLI skills + 📝 grant-proposal + 🎨 paper-illustration (Gemini) + 📊 CitationClaw

  • 2026-03-17 — 🔧 Git code sync + 🆓 ModelScope guide + parameter pass-through

  • 2026-03-16 — 🔬 research-refine + experiment-plan — turn vague ideas into problem-anchored proposals with claim-driven experiment roadmaps. Now integrated into Workflow 1 (/idea-discovery). Community contribution by @zjYao36

  • 2026-03-16 — 🇨🇳 Alibaba Coding Plan guide — one API key, 4 models (Kimi-K2.5 + Qwen3.5+ + GLM-5 + MiniMax-M2.7), dual-endpoint setup. Community contribution by @tianhao909

  • 2026-03-15 — 🔀 Bring your own model! Any OpenAI-compatible API now works as reviewer via llm-chat MCP server. GLM, MiniMax, Kimi, LongCat, DeepSeek all tested — zero Claude or OpenAI API needed

  • 2026-03-15 — 🐾 OpenClaw adaptation guide — use ARIS research workflows in OpenClaw without Claude Code slash skills

  • 2026-03-15 — 📐 proof-writer — community skill for rigorous theorem proof drafting. 📚 Anti-hallucination citations/paper-write now fetches real BibTeX from DBLP/CrossRef instead of LLM-generated entries — on by default, zero install

  • 2026-03-14 — 📱 Feishu/Lark integration: three modes (off/push/interactive), mobile notifications for experiments, reviews, and checkpoints

  • 2026-03-13 — 🛑 Human-in-the-loop: configurable AUTO_PROCEED checkpoints across all workflows. Full autopilot or step-by-step approval

  • 2026-03-12 — 🔗 Zotero + Obsidian + local PDFs + arXiv/Scholar: multi-source literature search with cross-model novelty verification

  • 2026-03-12 — 🚀 Three end-to-end workflows complete: one prompt → top-venue-style paper. /research-pipeline chains idea discovery → auto review → paper writing autonomously

  • 2026-03-12 — 📝 /paper-writing workflow: narrative report → structured outline → figures → LaTeX → compiled PDF → 2-round auto-improvement (4/10 → 8.5/10)

🚀 Quick Start

# 1. Install skills
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.git
mkdir -p ~/.claude/skills/    # create if it doesn't exist (new Claude Code versions)
cp -r Auto-claude-code-research-in-sleep/skills/* ~/.claude/skills/

# 1b. Update skills (when upstream has new versions)
cd Auto-claude-code-research-in-sleep && git pull
bash tools/smart_update.sh          # dry-run: shows what's new/changed/safe
bash tools/smart_update.sh --apply  # apply: adds new + updates safe ones

# 2. Set up Codex MCP (for review skills)
npm install -g @openai/codex
codex setup                    # set model to gpt-5.4 when prompted
claude mcp add codex -s user -- codex mcp-server

# 3. Use in Claude Code
claude
> /idea-discovery "your research direction"  # Workflow 1 — be specific! not "NLP" but "factorized gap in discrete diffusion LMs"
> /experiment-bridge                         # Workflow 1.5 — have a plan? implement + deploy + collect results
> /auto-review-loop "your paper topic or scope"  # Workflow 2: review → fix → re-review overnight
> /paper-writing "NARRATIVE_REPORT.md"       # Workflow 3: narrative → polished PDF
> /rebuttal "paper/ + reviews" — venue: ICML    # Workflow 4: parse reviews → draft rebuttal → follow-up
> /research-pipeline "your research direction"  # Full pipeline: Workflow 1 → 1.5 → 2 → 3 end-to-end
> /research-wiki init                           # 📚 Enable persistent research memory (one-time)
> /meta-optimize                                # Meta: analyze usage logs → propose skill improvements

📚 Research Wiki (optional): Give ARIS persistent memory across sessions. Papers, ideas, failed experiments — nothing is forgotten:

# In Claude Code:
> /research-wiki init                         # creates research-wiki/ in your project
# That's it. From now on, /research-lit auto-ingests papers, /idea-creator reads
# the wiki before brainstorming (and writes ideas back), /result-to-claim updates
# claim status. Failed ideas become anti-repetition memory for future ideation.

See Research Wiki for the full guide.

🧬 Meta-optimization (optional): Run these in your normal terminal (not inside Claude Code) to enable passive usage logging:

# One-time setup in your project directory
mkdir -p .claude .aris/meta tools/meta_opt
cp Auto-claude-code-research-in-sleep/templates/claude-hooks/meta_logging.json .claude/settings.json
cp Auto-claude-code-research-in-sleep/tools/meta_opt/*.sh tools/meta_opt/
chmod +x tools/meta_opt/*.sh
# Then start Claude Code — hooks are active immediately
claude

Events are logged to both project-level (.aris/meta/events.jsonl) and global (~/.aris/meta/events.jsonl) logs. After 5+ workflow runs, run /meta-optimize to see data-driven improvement proposals. Use /meta-optimize --global to analyze trends across all your projects. See Workflow M for details.

📝 Templates available! See templates/ for ready-to-use input templates for every workflow — research brief (Workflow 1), experiment plan (Workflow 1.5), narrative report (Workflow 3), paper plan (Workflow 3).

🔎 Optional: DeepXiv progressive retrieval

pip install deepxiv-sdk

Then use /deepxiv directly or opt into it from /research-lit with — sources: deepxiv or — sources: all, deepxiv.

🔎 Optional: Exa AI-powered web search

pip install exa-py
export EXA_API_KEY=your-key-here

Then use /exa-search directly or opt into it from /research-lit with — sources: exa or — sources: all, exa. Covers blogs, docs, news, and research papers with built-in content extraction.

🗑️ Uninstall: To remove ARIS skills without affecting your own personal skills:

cd Auto-claude-code-research-in-sleep && ls skills/ | xargs -I{} rm -rf ~/.claude/skills/{}

Tip: All pipeline behaviors are configurable via inline overrides — append — key: value to any command:

| Parameter | Default | What it does | |-----------|---------|-------------| | AUTO_PROCEED | true | Auto-continue at idea selection gate. Set false to manually pick which idea to pursue before committing GPU time | | human checkpoint | false | Pause after each review round so you can read the score, give custom modification instructions, skip specific fixes, or stop early | | sources | all | Which literature sources to search: zotero, obsidian, local, web, semantic-scholar, deepxiv, exa, or all. Note: semantic-scholar, deepxiv, and exa must be explicitly listed — not included in all | | arxiv download | false | Download top relevant arXiv PDFs during literature survey. When false, only fetches metadata (title, abstract, authors) | | DBLP_BIBTEX | true | Fetch real BibTeX from DBLP/CrossRef instead of LLM-generated entries. Eliminates hallucinated citations. Zero install | | code review | true | GPT-5.4 xhigh reviews experiment code before GPU deployment. Set false to skip | | wandb | false | Auto-add W&B logging to experiment scripts. Set true + configure wandb_project in CLAUDE.md. /monitor-experiment pulls training curves from W&B | | illustration | gemini | AI illustration in Workflow 3: gemini (default, needs GEMINI_API_KEY), mermaid (free), or false (skip) | | venue | ICLR | Target venue: ICLR, NeurIPS, ICML, CVPR, ACL, AAAI, ACM. Determines LaTeX style file and page limit | | base repo | false | GitHub repo URL to clone as base codebase (e.g., — base repo: https://github.com/org/project). No code? Build on top of an open-source project | | gpu | local | GPU target: local (default), remote (SSH server), or vast (rent on-demand from Vast.ai — auto-provision, auto-destroy) | | compact | false | Generate compact summary files (IDEA_CANDIDATES.md, findings.md, EXPERIMENT_LOG.md) for short-context models and session recovery | | ref paper | false | Reference paper to build on (PDF path or arXiv URL). Summarized first, then ideas extend/improve it. Combine with base repo for paper+code workflows | | effort | balanced | Work intensity: lite (0.4x tokens), balanced (default), max (2.5x), beast (5-8x). Controls breadth/depth/iterations. Codex reasoning always xhigh. See Effort Levels | | difficulty | medium | Reviewer adversarial level: medium (default), hard (+ memory + debate), nightmare (+ GPT reads repo via codex exec) |

/research-pipeline "your topic" — AUTO_PROCEED: false                          # pause at idea selection gate
/research-pipeline "your topic" — human checkpoint: true                       # pause after each review round to give feedback
/research-pipeline "your topic" — sources: zotero, web                         # only search Zotero + web (skip local PDFs)
/research-pipeline "your topic" — sources: all, deepxiv                        # default sources plus DeepXiv progressive retrieval
/research-pipeline "your topic" — sources: all, exa                            # default sources plus Exa AI-powered web search
/research-pipeline "your topic" — arxiv download: true                         # download top arXiv PDFs during literature survey
/research-pipeline "your topic" — difficulty: nightmare                        # maximum adversarial review before submission
/research-pipeline "your topic" — effort: beast                               # all knobs to maximum — top-venue sprint
/research-pipeline "your topic" — effort: lite                                # quick exploration, save tokens
/research-pipeline "your topic" — effort: max, review_rounds: 3               # max effort but cap review at 3 rounds
/research-pipeline "your topic" — AUTO_PROCEED: false, human checkpoint: true  # combine options

Important: Codex MCP uses the model from ~/.codex/config.toml, not from skill files. Make sure it says model = "gpt-5.4" (recommended). Other options: gpt-5.3-codex, gpt-5.2-codex, o3. Run codex setup or edit the file directly.

Want Codex to execute but Claude Code to review? See docs/CODEX_CLAUDE_REVIEW_GUIDE.md. That path installs the base skills/skills-codex/*, then overlays skills/skills-codex-claude-review/*, and routes review-heavy skills through the local claude-review MCP bridge.

Want Codex to execute but Gemini to review locally? See docs/CODEX_GEMINI_REVIEW_GUIDE.md and CN. That path installs the base skills/skills-codex/*, then overlays skills/skills-codex-gemini-review/*, and routes the reviewer-aware predefined skills through the local gemini-review MCP bridge using direct Gemini API by default.

See full setup guide for details and alternative model combinations if you don't have Claude/OpenAI API.

🧠 Update skills later? Smart update analyzes what's safe:

cd Auto-claude-code-research-in-sleep
git pull
bash tools/smart_update.sh          # dry-run: shows what's new/changed/safe
bash tools/smart_update.sh --apply  # apply: adds new + updates safe ones

Compares local skills with upstream, detects personal customizations (server paths, API keys, etc.), and only updates skills that are safe to replace. Skills with your personal info are flagged for manual review.

✨ Features

  • 📊 31 composable skills — mix and match, or chain into full pipelines (/idea-discovery, /auto-review-loop, /paper-writing, /research-pipeline)

  • 🔍 Literature & novelty — multi-source paper search (Zotero + Obsidian + local PDFs + arXiv/Scholar) + cross-model novelty verification

  • 💡 Idea discovery — literature survey → brainstorm 8-12 ideas → novelty check → GPU pilot experiments → ranked report

  • 🔄 Auto review loop — 4-round autonomous review, 5/10 → 7.5/10 overnight with 20+ GPU experiments

  • 📝 Paper writing — narrative → outline → figures → LaTeX → PDF → auto-review (4/10 → 8.5/10), one command. Anti-hallucination citations via DBLP/CrossRef

  • 🤖 Cross-model collaboration — Claude Code executes, GPT-5.4 xhigh reviews. Adversarial, not self-play

  • 📝 Peer review — review others' papers as a conference reviewer, with structured scoring and meta-review

  • 🖥️ Review-driven experiments — when GPT-5.4 says "run an ablation", Claude Code automatically writes the script, rsyncs to your GPU server, launches in screen, collects results, and folds them back into the paper. Just configure your server in CLAUDE.md (setup guide). No GPU? Use gpu: vast to rent one from Vast.ai on demand

  • 🔀 Flexible models — default Claude × GPT-5.4, also supports GLM, MiniMax, Kimi, LongCat, DeepSeek, etc. — no Claude or OpenAI API required

  • 🛑 Human-in-the-loop — configurable checkpoints at key decisions. AUTO_PROCEED=true for full autopilot, false to approve each step

  • 📱 Feishu/Lark notifications — three modes: off (default, strongly recommended for most users), push-only (webhook, mobile alerts), interactive (approve/reject from Feishu). Zero impact when unconfigured

    Push Only — group chat cards (experiment done, checkpoint, error, pipeline complete):

    Interactive — private chat with Claude Code (approve/reject, custom instructions):

  • 📚 Research Wiki — persistent knowledge base that accumulates papers, ideas, experiments, and claims across the research lifecycle. Failed ideas become anti-repetition memory. ARIS learns from its mistakes and gets smarter with every run. Inspired by Karpathy's LLM Wiki

  • 🧩 Extensible — domain-specific skills welcome! Add a SKILL.md and open a PR. See community skills like dse-loop (architecture/EDA)


📈 Score Progression (Real Run)

A real overnight 4-round run on an ML research project, from borderline reject to submission-ready:

| Round | Score | What Happened | |-------|-------|---------------| | Initial | 5.0/10 | Borderline reject | | Round 1 | 6.5/10 | Added standard metrics, discovered metric decoupling | | Round 2 | 6.8/10 | Key claim failed to reproduce, pivoted narrative | | Round 3 | 7.0/10 | Large seed study killed main improvement claim | | Round 4 | 7.5/10 ✅ | Diagnostic evidence solidified, submission ready |

The loop autonomously ran 20+ GPU experiments, rewrote the paper's narrative framing, and killed claims that didn't hold up — all without human intervention.

🏆 Community Showcase — Papers Built with ARIS

Real projects where the ARIS pipeline was used end-to-end. If you've used ARIS to complete a paper, we'd love to feature it here — open an issue or PR!

| Paper | Rating | Venue | Built by | Notes | |-------|:------:|-------|----------|-------| | CS Paper | 8/10 — "Top 50% of accepted papers, clear accept" | CS Conference | @DefanXue & @Monglitay | Full ARIS pipeline: idea → experiments → auto-review → paper writing. Reviewer: "empirical findings are stark, well-supported, and expose a fundamental flaw" | | AAAI 2026 Paper | 7/10 — "Good paper, accept" | AAAI 2026 Main Technical | @xinbo820-web | Pure Codex CLI (ARIS-Codex skills). Accepted at AAAI 2026 | | UAV-CC | Under review | IEEE TGRS | @wxx827 | UAV change captioning benchmark. Claude Opus 4.6 (executor) + Codex GPT-5.4 xhigh (reviewer) + Cursor Opus 4.6 (assist). PDF → |

🎉 Papers built entirely with ARIS — from idea to acceptance. Know more? Let us know!

🧩 Awesome Community Skills & Extensions

Domain-specific skills and external projects contributed by the community. PRs welcome — just add a skills/your-skill/SKILL.md and open a PR!

💡 How to use: Community skills are not auto-wired into core workflows. To use one, ask your executor (Claude Code / OpenClaw / etc.) to read the skill's SKILL.md, then plug it into the appropriate workflow stage based on the description below.

🎉 Community Skills (13): research-refine · experiment-plan · grant-proposal · paper-poster · paper-slides · mermaid-diagram · proof-writer · comm-lit-review · dse-loop · idea-discovery-robot · formula-derivation · paper-illustration · writing-systems-papers

🌐 External Projects & Docs (10): open-source-hardening-skills · CitationClaw · auto-hparam-tuning · paper-to-course · Antigravity Adaptation Guide · OpenClaw Adaptation Guide · Cursor Adaptation Guide · Codex+Claude Review Bridge · Trae Adaptation Guide · paper-illustration

🙌 Thanks to every contributor! We fold the tables below to keep the README readable — but every skill and project here is equally valued. PRs always welcome!

| Name | Domain | Description | Codex MCP? | |------|--------|-------------|-----------| | 🔬 research-refine | General | Turn a vague idea into a problem-anchored, implementation-oriented method proposal. Best inserted between /idea-discovery and /auto-review-loop | Yes | | 🧪 experiment-plan | General | Turn a refined proposal into a claim-driven experiment roadmap with ablations, budgets, and run order | No | | 🧭 research-refine-pipeline | General | One-shot chain: /research-refine/experiment-plan for method refinement plus experiment planning | Yes | | 📝 grant-proposal | General | Grant proposal drafting (KAKENHI/NSF/NSFC/ERC/DFG/SNSF/ARC/NWO). Chains /research-lit/novelty-check/research-review/paper-illustration | Yes | | 🎤 paper-slides | General | Conference talk slides (beamer → PDF + PPTX) with speaker notes, full talk script + Q&A prep. Auto slide count from talk type | Yes | | 🖼️ paper-poster | General | Conference poster (article + tcbposter → A0/A1 PDF + component PPTX + SVG). Venue-specific colors, visual review loop, Codex MCP review | Yes | | 📐 proof-writer | ML Theory | Rigorous theorem/lemma proof drafting — feasibility triage, dependency maps, honest blockage reports | No | | 📡 comm-lit-review | Communications / Wireless | Domain-specific literature review — IEEE/ACM/ScienceDirect priority, venue tiering, PHY/MAC/transport/NTN taxonomy | No | | 🏗️ dse-loop | Architecture / EDA | Autonomous design space exploration — iteratively run, analyze, and tune parameters (gem5, Yosys, etc.) | No | | 🤖 idea-discovery-robot | Robotics / Embodied AI | Workflow 1 adaptation — grounds idea discovery in embodiment, benchmark, sim2real path, and real-robot safety constraints | Yes | | 📐 mermaid-diagram | General | Mermaid diagrams (20+ types) — free alternative to paper-illustration, no API key needed | No | | 🔢 formula-derivation | General | Research formula development — derivation, verification, and LaTeX formatting | No | | 🖥️ writing-systems-papers | Systems | Paragraph-level blueprint for 10-12 page systems papers (OSDI/SOSP/ASPLOS/NSDI/EuroSys) — page allocation, writing patterns, self-check | Yes |

| Name | Domain | Description | |------|--------|-------------| | 🛡️ open-source-hardening-skills | DevOps / OSS | 10-skill pipeline to harden research code into production-ready open-source projects — audit, refactor, test, CI, docs, review | | 📊 CitationClaw | General | Citation impact analysis — input paper title → citation crawling, scholar identification, tiered analysis, HTML dashboard | | 🚀 Antigravity Adaptation Guide | General | Use ARIS skills in Google Antigravity — native SKILL.md support, dual model (Claude Opus 4.6 / Gemini 3.1 Pro), MCP setup, EN + CN guides | | 🐾 OpenClaw Adaptation Guide | General | Use ARIS workflow methodology in OpenClaw — skill-to-stage mapping, file-based orchestration, no Claude Code CLI needed | | 🖱️ Cursor Adaptation Guide | General | Use ARIS skills in Cursor@-reference skills, MCP setup, workflow mapping, state file recovery across sessions | | 🖥️ Trae Adaptation Guide | General | Use ARIS skills in Trae (ByteDance AI IDE) — EN + CN guides | | 🎨 paper-illustration | General | AI-generated architecture diagrams via Gemini. Built on PaperBanana. Integrated into Workflow 3 | | 🤖 skills-codex | General | Codex CLI sync pack for the main research skills, now including training-check, result-to-claim, ablation-planner, rebuttal, plus the shared-references/ support directory | | 🎛️ auto-hparam-tuning | General | Automatic hyperparameter tuning — AI agent reads project, plans strategy, runs experiments, analyzes TensorBoard, learns from results. Hydra-based | | 🔁 Codex+Claude Review Bridge | General | Codex executes + Claude reviews via local claude-review MCP bridge with async polling | | 📚 paper-to-course | Education | Convert research papers (PDF/LaTeX) into interactive six-module HTML courses with formula breakdowns, literature timelines, quizzes, and glossary tooltips — single bundled file, no server needed |

🔄 Workflows

These skills compose into a full research lifecycle. The four workflows can be used independently or chained together:

  • Exploring a new area (e.g., writing a survey)? Start with Workflow 1 → /idea-discovery
  • Have a plan, need to implement and run? Workflow 1.5 → /experiment-bridge
  • Already have results, need iterative improvement? Workflow 2 → /auto-review-loop
  • Ready to write the paper? Workflow 3 → /paper-writing (or step by step: /paper-plan/paper-figure/paper-write/paper-compile/auto-paper-improvement-loop)
  • Got reviews back? Need to rebuttal? Workflow 4 → /rebuttal — parse reviews, draft safe rebuttal, follow-up rounds
  • Full pipeline? Workflow 1 → 1.5 → 2 → 3 → submit → 4 → /research-pipeline + /rebuttal — from idea to acceptance
  • Want ARIS to remember and learn? 📚 /research-wiki init — persistent memory across sessions. Papers, ideas, failed experiments compound over time
  • Want ARIS to improve itself? Workflow M → /meta-optimize — analyze usage logs, propose skill improvements, reviewer-gated

⚠️ Important: These tools accelerate research, but they don't replace your own critical thinking. Always review generated ideas with your domain expertise, question the assumptions, and make the final call yourself. The best research comes from human insight + AI execution, not full autopilot.

Full Pipeline 🚀

/research-lit → /idea-creator → /novelty-check → /research-refine → /experiment-bridge → /auto-review-loop → /paper-writing → submit → /rebuttal → accept! 🎉
  (survey)      (brainstorm)    (verify novel)   (refine method)   (implement+deploy)  (review & fix)      (write paper)   (send)   (reply to reviewers)
  ├────────────── Workflow 1: Idea Discovery ──────────────┤ ├ Workflow 1.5 ─┤ ├── Workflow 2 ──┤ ├── Workflow 3 ──┤         ├── Workflow 4 ──┤

                                     📚 research-wiki (persistent memory — papers, ideas, experiments, claims)
                                        ↕ reads before ideation, writes after every stage, failed ideas = anti-repetition memory

                                              /meta-optimize (Workflow M — runs independently, improves ARIS itself)
                                                 ↑ reads .aris/meta/events.jsonl (accumulated from all runs above)

📝 Blog post: 梦中科研全流程开源

Workflow 1: Idea Discovery & Method Refinement 🔍

"What's the state of the art? Where are the gaps? How do we solve it?"

Don't have a concrete idea yet? Just give a research direction — /idea-discovery handles the rest:

  1. 📚 Survey the landscape (recent papers, open problems, recurring limitations)
  2. 🧠 Brainstorm 8-12 concrete ideas via GPT-5.4 xhigh
  3. 🔍 Filter by feasibility, compute cost, and quick novelty search
  4. 🛡️ Validate top ideas with deep novelty check + devil's advocate review
  5. 🧪 Pilot top 2-3 ideas in parallel on different GPUs (30 min - 2 hr each)
  6. 🏆 Rank by empirical signal — ideas with positive pilot results rise to the top
  7. 🔬 Refine the top idea into a problem-anchored proposal via iterative GPT-5.4 review
  8. 🧪 Plan claim-driven experiments with ablations, budgets, and run order

The output is a ranked IDEA_REPORT.md plus a refined proposal (refine-logs/FINAL_PROPOSAL.md) and experiment plan (refine-logs/EXPERIMENT_PLAN.md) for the top idea. Dead-end ideas are documented too, saving future exploration.

┌─────────────────────────────────────────────────────────────────┐
│              Idea Discovery & Method Refinement                  │
│                                                                  │
│   /research-lit    /idea-creator    /novelty-check               │
│   (find papers)    (brainstorm)     (verify novelty)             │
│         │               │                │                       │
│         ▼               ▼                ▼                       │
│   ┌──────────┐    ┌──────────┐     ┌──────────┐                │
│   │ Scan     │───▶│ Generate │────▶│ Check if │                │
│   │ local    │    │ 8-12     │     │ idea is  │                │
│   │ papers + │    │ ideas    │     │ novel    │                │
│   │ search   │    │ + rank   │     │          │                │
│   └──────────┘    └──────────┘     └──────────┘                │
│                         │                │                       │
│                         ▼                ▼                       │
│                   ┌──────────┐     ┌──────────┐                │
│                   │ Filter   │────▶│ External │                │
│                   │ by cost, │     │ LLM      │                │
│                   │ novelty  │     │ evaluates│                │
│                   └──────────┘     └──────────┘                │
│                                          │                       │
│                   /research-refine       ▼                       │
│                   (refine method)   ┌──────────┐                │
│                         │          │ Freeze   │                │
│                         ▼          │ problem  │                │
│                   ┌──────────┐     │ anchor + │                │
│                   │ Iterate  │◀───▶│ refine   │                │
│                   │ until    │     │ method   │                │
│                   │ score≥9  │     └──────────┘                │
│                   └──────────┘          │                       │
│                         │               ▼                       │
│                   /experiment-plan  ┌──────────┐                │
│                         │          │ Claim-   │                │
│                         ▼          │ driven   │                │
│                   ┌──────────┐     │ experiment│               │
│                   │ Plan     │────▶│ roadmap  │                │
│                   │ runs     │     └──────────┘                │
│                   └──────────┘                                  │
│                                                                  │
│   Typical flow:                                                  │
│   1. /research-lit "discrete diffusion models"                   │
│   2. /idea-creator "DLLMs post training"                         │
│   3. Review ranked ideas, pick top 2-3                           │
│   4. /novelty-check "top idea" (deep verification)               │
│   5. /research-review "top idea" (critical feedback)             │
│   6. /research-refine "top idea" (problem anchor + method)       │
│   7. /experiment-plan (claim-driven roadmap)                     │
│   8. /run-experiment → /auto-review-loop                         │
└─────────────────────────────────────────────────────────────────┘

Skills involved: research-lit + idea-creator + novelty-check + research-review + research-refine-pipeline

💡 One-command shortcut: /idea-discovery "your research direction" runs this entire workflow automatically.

🔄 Human-in-the-loop: Each phase presents results and waits for your feedback. Not happy? Tell it what's missing — it refines the prompt and regenerates. Trust the defaults? It auto-proceeds with the top-ranked option. You decide how hands-on to be.

⚙️ Pilot experiment budgets (max hours, timeout, GPU budget) are configurable — see Customization.

📝 Blog post: Claude Code 两月 NeurIPS 指北

Workflow 1.5: Experiment Bridge 🔗

"I have a plan. Now implement it, deploy it, and get me initial results."

Already have an experiment plan (from Workflow 1 or your own)? /experiment-bridge turns it into running code:

  1. 📋 Parse the experiment plan (refine-logs/EXPERIMENT_PLAN.md)
  2. 💻 Implement experiment scripts (reuse existing code, add proper argparse/logging/seeds)
  3. 🔍 GPT-5.4 code review — cross-model review catches logic bugs before wasting GPU hours (code review: true by default)
  4. Sanity check — run the smallest experiment first to catch runtime bugs
  5. 🚀 Deploy full experiment suite to GPU via /run-experiment
  6. 📊 Collect initial results and update the experiment tracker
┌─────────────────────────────────────────────────────────────────┐
│                Workflow 1.5: Experiment Bridge                    │
│                                                                  │
│   EXPERIMENT_PLAN.md                                             │
│         │                                                        │
│         ▼                                                        │
│   ┌──────────┐     ┌──────────┐     ┌──────────┐               │
│   │ Claude   │────▶│ GPT-5.4  │────▶│ Sanity   │               │
│   │ Code     │     │ xhigh    │     │ Check    │               │
│   │ writes   │     │ reviews  │     │ (1 GPU)  │               │
│   │ code     │     │ code     │     │          │               │
│   └──────────┘     └──────────┘     └──────────┘               │
│                                          │                       │
│                                          ▼                       │
│   ┌──────────┐     ┌──────────┐     ┌──────────┐               │
│   │ Collect  │◀────│ Monitor  │◀────│ Deploy   │               │
│   │ results  │     │ progress │     │ to GPUs  │               │
│   │          │     │ (+ W&B)  │     │          │               │
│   └──────────┘     └──────────┘     └──────────┘               │
│         │                                                        │
│         ▼                                                        │
│   Ready for /auto-review-loop                                    │
└─────────────────────────────────────────────────────────────────┘

Skills involved: `experiment-bridg

Auto claude code research in sleep | hub.ai-engineering.at