Skills / bigdata analysis skill

bigdata analysis skill

AI coding skill for Hive/Impala/Spark ETL — 10 rules to prevent silent data bugs on HDFS/YARN

★ 1von @Oak-Bvor 105d aktualisiertGitHub →

Installation

Kompatibilitaet

Claude Code

Beschreibung

bigdata-analysis-skill

An AI coding skill that makes your AI assistant production-safe when writing Hive, Impala, and Spark ETL code on HDFS/YARN.

It targets the class of bugs that are invisible — row counts look correct, no errors thrown, but data is silently wrong or performance collapses.

Install

npx skills add Oak-B/bigdata-analysis-skill@bigdata-analysis

What It Covers

| Rule | Problem It Prevents | |------|---------------------| | Rule 0 | DESCRIBE before coding — never guess column names or types | | Rule 1 | Never hard-code table names in Spark source | | Rule 2 | Keep long-text fields out of GROUP BY (control characters cause silent row explosion) | | Rule 3 | Filter first, then aggregate — prevents OOM on billion-row tables | | Rule 4 | Use Spark SQL, not DataFrame API (real benchmark: 3h → 15min) | | Rule 5 | Control broadcast JOIN threshold — prevents task explosion | | Rule 6 | Never use SELECT * in INSERT — prevents silent column shifts | | Rule 7 | Use LEFT JOIN for optional fields — prevents silent row loss | | Rule 8 | Refresh metadata after Spark write | | Rule 9 | UDF type safety — nested collection return types crash at runtime |

Plus: date window off-by-one, Scala string interpolation pitfalls, regex engine differences, and more.

Two Modes

| Mode | Behavior | |------|----------| | Analysis | Run SQL → present numbers → ask the user before making decisions | | Coding | Follow the 10 rules strictly; never guess types, column order, or table names |

Quick Error Reference

| Symptom | Likely Root Cause | |---------|-------------------| | New column all NULL / field values shifted | SELECT * + schema change (Rule 6) | | 45+ Spark Jobs, 3-hour runtime | DataFrame API + multiple .count() (Rule 4) | | Job timeout, 26k+ tasks | Auto-broadcast on medium table (Rule 5) | | Row explosion, field misalignment | Control characters in GROUP BY field (Rule 2) | | OOM on aggregation | Direct GROUP BY on billion-row table (Rule 3) | | Silent row loss after JOIN | INNER JOIN on optional field (Rule 7) | | Hive/Impala sees no data after write | Metadata not refreshed (Rule 8) | | UDF NoClassDefFoundError | Nested Scala collection return type (Rule 9) |

File Structure

bigdata-analysis/
├── SKILL.md                          # Main skill instructions (10 rules + quick reference)
└── references/
    ├── spark-pitfalls.md             # Deep-dive: root cause analysis & extended examples
    └── sql-patterns.md               # AI-specific SQL anti-patterns

Who Is This For

Data engineers writing Hive/Impala SQL or Spark Scala ETL jobs
Anyone using AI coding assistants for big data workflows
Teams that have been bitten by "data looks right but isn't" bugs

License

Claude CodeCodexGeminiCursorVS Code

★ 4,179@AgriciDaniel82d ago