Zum Inhalt springen

Skills / bigdata analysis skill

bigdata analysis skill

AI coding skill for Hive/Impala/Spark ETL — 10 rules to prevent silent data bugs on HDFS/YARN

1von @Oak-Bvor 13d aktualisiertGitHub →

Installation

Kompatibilitaet

Claude Code

Beschreibung

bigdata-analysis-skill

An AI coding skill that makes your AI assistant production-safe when writing Hive, Impala, and Spark ETL code on HDFS/YARN.

It targets the class of bugs that are invisible — row counts look correct, no errors thrown, but data is silently wrong or performance collapses.

Install

npx skills add Oak-B/bigdata-analysis-skill@bigdata-analysis

What It Covers

| Rule | Problem It Prevents | |------|---------------------| | Rule 0 | DESCRIBE before coding — never guess column names or types | | Rule 1 | Never hard-code table names in Spark source | | Rule 2 | Keep long-text fields out of GROUP BY (control characters cause silent row explosion) | | Rule 3 | Filter first, then aggregate — prevents OOM on billion-row tables | | Rule 4 | Use Spark SQL, not DataFrame API (real benchmark: 3h → 15min) | | Rule 5 | Control broadcast JOIN threshold — prevents task explosion | | Rule 6 | Never use SELECT * in INSERT — prevents silent column shifts | | Rule 7 | Use LEFT JOIN for optional fields — prevents silent row loss | | Rule 8 | Refresh metadata after Spark write | | Rule 9 | UDF type safety — nested collection return types crash at runtime |

Plus: date window off-by-one, Scala string interpolation pitfalls, regex engine differences, and more.

Two Modes

| Mode | Behavior | |------|----------| | Analysis | Run SQL → present numbers → ask the user before making decisions | | Coding | Follow the 10 rules strictly; never guess types, column order, or table names |

Quick Error Reference

| Symptom | Likely Root Cause | |---------|-------------------| | New column all NULL / field values shifted | SELECT * + schema change (Rule 6) | | 45+ Spark Jobs, 3-hour runtime | DataFrame API + multiple .count() (Rule 4) | | Job timeout, 26k+ tasks | Auto-broadcast on medium table (Rule 5) | | Row explosion, field misalignment | Control characters in GROUP BY field (Rule 2) | | OOM on aggregation | Direct GROUP BY on billion-row table (Rule 3) | | Silent row loss after JOIN | INNER JOIN on optional field (Rule 7) | | Hive/Impala sees no data after write | Metadata not refreshed (Rule 8) | | UDF NoClassDefFoundError | Nested Scala collection return type (Rule 9) |

File Structure

bigdata-analysis/
├── SKILL.md                          # Main skill instructions (10 rules + quick reference)
└── references/
    ├── spark-pitfalls.md             # Deep-dive: root cause analysis & extended examples
    └── sql-patterns.md               # AI-specific SQL anti-patterns

Who Is This For

  • Data engineers writing Hive/Impala SQL or Spark Scala ETL jobs
  • Anyone using AI coding assistants for big data workflows
  • Teams that have been bitten by "data looks right but isn't" bugs

License

MIT

Aehnliche Skills

bigdata analysis skill | hub.ai-engineering.at