Skills / bigdata analysis skill
bigdata analysis skill
AI coding skill for Hive/Impala/Spark ETL — 10 rules to prevent silent data bugs on HDFS/YARN
Installation
Kompatibilitaet
Beschreibung
bigdata-analysis-skill
An AI coding skill that makes your AI assistant production-safe when writing Hive, Impala, and Spark ETL code on HDFS/YARN.
It targets the class of bugs that are invisible — row counts look correct, no errors thrown, but data is silently wrong or performance collapses.
Install
npx skills add Oak-B/bigdata-analysis-skill@bigdata-analysis
What It Covers
| Rule | Problem It Prevents |
|------|---------------------|
| Rule 0 | DESCRIBE before coding — never guess column names or types |
| Rule 1 | Never hard-code table names in Spark source |
| Rule 2 | Keep long-text fields out of GROUP BY (control characters cause silent row explosion) |
| Rule 3 | Filter first, then aggregate — prevents OOM on billion-row tables |
| Rule 4 | Use Spark SQL, not DataFrame API (real benchmark: 3h → 15min) |
| Rule 5 | Control broadcast JOIN threshold — prevents task explosion |
| Rule 6 | Never use SELECT * in INSERT — prevents silent column shifts |
| Rule 7 | Use LEFT JOIN for optional fields — prevents silent row loss |
| Rule 8 | Refresh metadata after Spark write |
| Rule 9 | UDF type safety — nested collection return types crash at runtime |
Plus: date window off-by-one, Scala string interpolation pitfalls, regex engine differences, and more.
Two Modes
| Mode | Behavior | |------|----------| | Analysis | Run SQL → present numbers → ask the user before making decisions | | Coding | Follow the 10 rules strictly; never guess types, column order, or table names |
Quick Error Reference
| Symptom | Likely Root Cause |
|---------|-------------------|
| New column all NULL / field values shifted | SELECT * + schema change (Rule 6) |
| 45+ Spark Jobs, 3-hour runtime | DataFrame API + multiple .count() (Rule 4) |
| Job timeout, 26k+ tasks | Auto-broadcast on medium table (Rule 5) |
| Row explosion, field misalignment | Control characters in GROUP BY field (Rule 2) |
| OOM on aggregation | Direct GROUP BY on billion-row table (Rule 3) |
| Silent row loss after JOIN | INNER JOIN on optional field (Rule 7) |
| Hive/Impala sees no data after write | Metadata not refreshed (Rule 8) |
| UDF NoClassDefFoundError | Nested Scala collection return type (Rule 9) |
File Structure
bigdata-analysis/
├── SKILL.md # Main skill instructions (10 rules + quick reference)
└── references/
├── spark-pitfalls.md # Deep-dive: root cause analysis & extended examples
└── sql-patterns.md # AI-specific SQL anti-patterns
Who Is This For
- Data engineers writing Hive/Impala SQL or Spark Scala ETL jobs
- Anyone using AI coding assistants for big data workflows
- Teams that have been bitten by "data looks right but isn't" bugs
License
MIT
Aehnliche Skills
last30days skill
AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
context mode
Context window optimization for AI coding agents. Sandboxes tool output, 98% reduction. 12 platforms
claude seo
Universal SEO skill for Claude Code. 19 sub-skills, 12 subagents, 3 extensions (DataForSEO, Firecrawl, Banana). Technical SEO, E-E-A-T, schema, GEO/AEO, backlinks, local SEO, maps intelligence, Google APIs, and PDF/Excel reporting.
pinme
Deploy Your Frontend in a Single Command. Claude Code Skills supported.
godogen
Claude Code & Codex skills that build complete Godot projects from a game description
claude ads
Comprehensive paid advertising audit & optimization skill for Claude Code. 250+ checks across Google, Meta, YouTube, LinkedIn, TikTok, Microsoft & Apple Ads with weighted scoring, parallel agents, industry templates, and AI creative generation.