๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Data Analysis/Data Engineering

๋ฐ์ดํ„ฐ ์ €์žฅ ํ˜•์‹ TextInputFormat/ Parquet ๊ด€๋ จํ•ด์„œ ์ •๋ฆฌ

by Hagrid 2024. 8. 6.
๋ฐ˜์‘ํ˜•

๋ฌธ์ œ

๐Ÿ“Œ Hue ์—์„œ ์ปฌ๋Ÿผ์ˆ˜์ •์‹œ >> ํ…Œ์ด๋ธ” ์Šคํ‚ค๋งˆ ๊ตฌ์กฐ๊ฐ€ ๋ฐ”๋€Œ๋ฉด backfill ํ•ด์ค˜์•ผ ํ•˜๋Š” ํ˜„์ƒ ์•„๋‹ˆ๋ฉด Mismatch ์˜ค๋ฅ˜๊ฐ€ ๋œธ

์›์ธ

  • ํ…Œ์ด๋ธ” ๋นŒ๋“œ์‹œ parquet ๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ (TextInputFormat) ์—๋Š” ์ปฌ๋Ÿผ ์ˆœ์„œ๋ฅผ ์ง€์ผœ์ค˜์•ผ ํ•จ
    • ์˜ํ–ฅ์„ ๋ฐ›๋Š” ํ˜•์‹
      • 'org.apache.hadoop.mapred.**TextInputFormat**'
  • ๊ทธ๋ ‡๋‹ค๋ฉด ? >>
    • ๋นŒ๋“œํ• ๋•Œ PARQUET ํ˜•์‹์œผ๋กœ ๋นŒ๋“œํ•ด์•ผํ•จ
      • external table ์ธ๋ฐ spark.sql.sources.schema ๋‹ฌ๋ฆฌ๋ฉด ์•ˆ๋จ
        • spark.sql.... property ๋ฅผ ๋ชจ๋‘ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜
        • spark.sql.sources.schema json ์— ์ƒˆ๋กœ ์ถ”๊ฐ€ํ•˜๋Š” ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€

ROW FORMAT SERDE

  • ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    • Parquet ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด, ์˜ฌ๋ฐ”๋ฅธ SerDe๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, **org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe**๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋นŒ๋“œํ• ๋•Œ ์ฝ”๋“œ (ํ™•์ธํ•ด๋ณด๋‹ˆ ๊ตณ์ด ์•ˆํ•ด์ค˜๋„ STORED AS PARQUET ์š”๊ฒƒ๋งŒ ๋„ฃ์–ด๋„ ๋ฌด๋ฐฉํ•จ)
      • **ParquetHiveSerDe**๋Š” ์ด ๋‘ ์‹œ์Šคํ…œ ๊ฐ„์˜ ํ˜ธํ™˜์„ฑ์„ ์ œ๊ณตํ•˜์—ฌ, Hive๊ฐ€ Parquet ํŒŒ์ผ ํ˜•์‹์œผ๋กœ ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๊ณ  ์“ธ ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • CREATE TABLE my_table ( ... ) STORED AS PARQUET WITH SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe';

TextInputFormat

  • **TextInputFormat**์€ ์ผ๋ฐ˜์ ์œผ๋กœ CSV๋‚˜ TSV ๊ฐ™์€ ํ…์ŠคํŠธ ํŒŒ์ผ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ์‚ฌ์šฉ

TextInputFormat๊ณผ Parquet์„ ๋น„๊ต

<aside> ๐Ÿ“Œ ์š”์•ฝ

  • TextInputFormat์€ ํ˜ธํ™˜์„ฑ์ด ๋†’๊ณ  ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šฐ๋ฉฐ, ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์ด๋‚˜ ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ์••์ถ• ๋ฐ I/O ํšจ์œจ์„ฑ์ด ๋‚ฎ๊ณ  ์Šคํ‚ค๋งˆ๋ฅผ ์ง€์›ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • Parquet์€ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ๋ถ„์„์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋†’์€ ์••์ถ•์œจ๊ณผ ์ปฌ๋Ÿผ ๋‹จ์œ„ ์ฒ˜๋ฆฌ๋กœ ์ฟผ๋ฆฌ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. ์Šคํ‚ค๋งˆ๋ฅผ ๋‚ด์žฅํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด์ง„ ํ˜•์‹์œผ๋กœ ์‚ฌ๋žŒ์ด ์ฝ๊ธฐ ์–ด๋ ต๊ณ  ๋ฐ์ดํ„ฐ ์ค€๋น„๊ฐ€ ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. </aside>

ํŠน์ง• TextInputFormat Parquet

๋ฐ์ดํ„ฐ ํ˜•์‹ ํ…์ŠคํŠธ (CSV, TSV ๋“ฑ) ์ปฌ๋Ÿผํ˜•
ํ˜ธํ™˜์„ฑ ๋†’์€ ํ˜ธํ™˜์„ฑ (๋งŽ์€ ๋ฐ์ดํ„ฐ ์†Œ์Šค๊ฐ€ ํ…์ŠคํŠธ ํ˜•์‹) ์ฃผ๋กœ ๋น…๋ฐ์ดํ„ฐ ๋ฐ ๋ถ„์„ ๋„๊ตฌ (Hadoop, Spark ๋“ฑ)
์ดํ•ด ๋ฐ ์ˆ˜์ • ์šฉ์ด์„ฑ ์‚ฌ๋žŒ ์ฝ๊ธฐ ๋ฐ ์ˆ˜์ • ์šฉ์ด ์ด์ง„ ํ˜•์‹, ์‚ฌ๋žŒ์ด ์ฝ๊ธฐ ์–ด๋ ค์›€
์••์ถ• ๋ฐ ์ €์žฅ ํšจ์œจ์„ฑ ์ผ๋ฐ˜์ ์œผ๋กœ ๋‚ฎ์Œ ๋†’์€ ์••์ถ•์œจ ๋ฐ ์ €์žฅ ํšจ์œจ์„ฑ
์Šคํ‚ค๋งˆ ์ง€์› ์—†์Œ (์™ธ๋ถ€์—์„œ ์ •์˜ ํ•„์š”) ๊ฐ•๋ ฅํ•œ ์Šคํ‚ค๋งˆ ๋‚ด์žฅ
์ฟผ๋ฆฌ ์„ฑ๋Šฅ ์ „์ฒด ํ–‰ ์Šค์บ” ํ•„์š” ํŠน์ • ์ปฌ๋Ÿผ๋งŒ ์ฝ๊ธฐ ๊ฐ€๋Šฅ, ๋†’์€ ์ฟผ๋ฆฌ ์„ฑ๋Šฅ
I/O ์„ฑ๋Šฅ ๋‚ฎ์Œ ๋†’์Œ (์ปฌ๋Ÿผ ๋‹จ์œ„ ์ฒ˜๋ฆฌ)
๋ฐ์ดํ„ฐ ์ค€๋น„ ์šฉ์ด์„ฑ ์‰ฌ์›€ (๊ฐ„๋‹จํ•œ ์Šคํฌ๋ฆฝํŠธ๋‚˜ ํŽธ์ง‘๊ธฐ๋กœ ์ƒ์„ฑ ๊ฐ€๋Šฅ) ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ณต์žก, ์ „์šฉ ๋„๊ตฌ ํ•„์š”
์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ ์ ํ•ฉ์„ฑ ์ ํ•ฉ (์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ์— ์šฉ์ด) ์ ํ•ฉํ•˜์ง€ ์•Š์Œ
์‚ฌ์šฉ ์‚ฌ๋ก€ ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ, ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ๋ถ„์„, ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค, BI ๋„๊ตฌ
ํŒŒ์ผ ํฌ๊ธฐ ๋น„๊ต์  ํผ ๋น„๊ต์  ์ž‘์Œ (์••์ถ• ์ ์šฉ)

TextInputFormat ์‚ฌ์šฉ ์ด์œ :

  1. ํ˜ธํ™˜์„ฑ ๋ฐ ๋‹จ์ˆœ์„ฑ:
    • ํ…์ŠคํŠธ ํŒŒ์ผ ํ˜ธํ™˜์„ฑ: ๋งŽ์€ ๋ฐ์ดํ„ฐ ์†Œ์Šค๊ฐ€ CSV, TSV ๊ฐ™์€ ํ…์ŠคํŠธ ํ˜•์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ˜ธํ™˜์„ฑ์ด ์ข‹์Šต๋‹ˆ๋‹ค.
    • ๋‹จ์ˆœ์„ฑ: ํ…์ŠคํŠธ ํ˜•์‹์€ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๊ณ , ์‚ฌ๋žŒ์ด ์ฝ๊ณ  ์“ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ํ™•์ธ ๋ฐ ๋””๋ฒ„๊น…์ด ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ๋กœ๋”ฉ:
    • ๋ฐ์ดํ„ฐ ์ค€๋น„๊ฐ€ ์‰ฝ๋‹ค: ํ…์ŠคํŠธ ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์€ ์ƒ๋Œ€์ ์œผ๋กœ ์‰ฝ์Šต๋‹ˆ๋‹ค. ๊ฐ„๋‹จํ•œ ์Šคํฌ๋ฆฝํŠธ๋‚˜ ํ‘œ์ค€ ํ…์ŠคํŠธ ํŽธ์ง‘๊ธฐ๋งŒ์œผ๋กœ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
    • ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ: ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜์Œ ๋กœ๋”ฉํ•˜๊ฑฐ๋‚˜ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์„ ๋‹ค๋ฃฐ ๋•Œ๋Š” ํ…์ŠคํŠธ ํ˜•์‹์ด ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆฌ๋ฐ:
    • ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ: ์‹ค์‹œ๊ฐ„ ๋˜๋Š” ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์˜ ๊ฒฝ์šฐ, ํ…์ŠคํŠธ ํ˜•์‹์€ ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ณ  ์†Œ๋น„ํ•  ์ˆ˜ ์žˆ๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Parquet ์‚ฌ์šฉ ์ด์œ :

  1. ํšจ์œจ์„ฑ ๋ฐ ์„ฑ๋Šฅ:
    • ์ปฌ๋Ÿผํ˜• ์ €์žฅ: Parquet์€ ์ปฌ๋Ÿผํ˜• ์ €์žฅ ํ˜•์‹์„ ์‚ฌ์šฉํ•˜์—ฌ, ํŠน์ • ์—ด๋งŒ ์ฝ๊ฑฐ๋‚˜ ์“ธ ๋•Œ ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์—์„œ ํŠน์ • ์—ด์— ๋Œ€ํ•ด ์ฟผ๋ฆฌ๋ฅผ ์‹คํ–‰ํ•  ๋•Œ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.
    • ์••์ถ• ๋ฐ ์ธ์ฝ”๋”ฉ: Parquet์€ ๊ณ ๋„๋กœ ์••์ถ•๋œ ํ˜•์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜์—ฌ ์ €์žฅ ๊ณต๊ฐ„์„ ์ ˆ์•ฝํ•˜๊ณ  I/O ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  2. ์Šคํ‚ค๋งˆ ์ง€์›:
    • ๊ฐ•๋ ฅํ•œ ์Šคํ‚ค๋งˆ ์ง€์›: Parquet ํŒŒ์ผ์—๋Š” ์Šคํ‚ค๋งˆ ์ •๋ณด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์–ด ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ ๊ตฌ์กฐ๋ฅผ ๋ช…ํ™•ํžˆ ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ์„ ์œ ์ง€ํ•˜๊ณ , ๋ถ„์„ ์ž‘์—…์„ ๋” ์‰ฝ๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  3. ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ์ตœ์ ํ™”:
    • ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ: Parquet์€ Hadoop, Spark, Presto, Athena ๊ฐ™์€ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ๊ณผ ์ž˜ ํ†ตํ•ฉ๋˜์–ด ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ๋ถ„์„ ์ž‘์—…์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์„ ํƒ ๊ธฐ์ค€:

  • ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์ด๋‚˜ ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์˜ ๊ฒฝ์šฐ, TextInputFormat์ด ๋” ๊ฐ„ํŽธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” Parquet์ด ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.
  • ์„ฑ๋Šฅ ์š”๊ตฌ ์‚ฌํ•ญ: ๋ถ„์„ ์„ฑ๋Šฅ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ, Parquet์ด ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ํŠน์ • ์—ด์— ๋Œ€ํ•œ ๋นˆ๋ฒˆํ•œ ์ฟผ๋ฆฌ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ Parquet์˜ ์ปฌ๋Ÿผํ˜• ์ €์žฅ ๋ฐฉ์‹์ด ํฐ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ: ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ ๊ตฌ์กฐ๋ฅผ ๋ช…ํ™•ํžˆ ์œ ์ง€ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ Parquet์ด ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. Parquet์€ ์Šคํ‚ค๋งˆ๋ฅผ ๋‚ด์žฅํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ ํ˜•ํƒœ: ์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ๋‚˜ ๋กœ๊ทธ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ, ํ…์ŠคํŠธ ํ˜•์‹์ด ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ…Œ์ŠคํŠธ๋ฅผ ํ•ด๋ณด์ž

  1. ํ…Œ์ด๋ธ” ๋นŒ๋“œ 1
CREATE EXTERNAL TABLE temp.test_format_table_v2 (
    user_id         bigint ,
    contents_id     bigint,
    contents_type   string,
    keycol_cnt      bigint
       )
PARTITIONED BY (
  base_dt string
       )
stored as parquet
location
  's3://bucketplace-emr-ba/hive-ba/output/temp/test_format_table'
;
  1. ํ•˜๋ฃจ์น˜ insert
INSERT INTO  temp.test_format_table_v2
select *
,'2024-04-10' as base_Dt
from ( SELECT
           user_id,
            project_name,
           contents_id,
           contents_type,
           count(DISTINCT concat(cast(contents_type AS VARCHAR), cast(contents_id AS VARCHAR))) AS keycol_cnt
       --     , base_dt
-- select *
       FROM contents.contents_meta_daily cmd
       WHERE 1 = 1
         AND base_dt = '2024-04-10'
       GROUP BY 1, 2, 3,4
       limit 100)
where 1=1

  1. ํœด์—์„œ ์ปฌ๋Ÿผ๋ณ€๊ฒฝ
-- -- content_stat_monthly_test_v2 ํ…Œ์ด๋ธ” ๊ต์ฒด
 alter table temp.test_format_table_v2
 add columns (project_name string)
-- ;

alter table temp.test_format_table_v2
change column project_name project_name string AFTER user_id
  1. ์กฐํšŒ
select * from temp.test_format_table_v2
  1. ๋ณ€๊ฒฝ๋œ ์ปฌ๋Ÿผ insert
INSERT INTO  temp.test_format_table_v2
select *
,'2024-04-11' as base_Dt
from ( SELECT
           user_id,
            project_name,
           contents_id,
           contents_type,
           count(DISTINCT concat(cast(contents_type AS VARCHAR), cast(contents_id AS VARCHAR))) AS keycol_cnt
       --     , base_dt
-- select *
       FROM contents.contents_meta_daily cmd
       WHERE 1 = 1
         AND base_dt = '2024-04-11'
       GROUP BY 1, 2, 3,4
       limit 100
       )
where 1=1

;
  1. ์กฐํšŒ
select * from temp.test_format_table_v2 where base_dt = '2024-04-11'
  • ์ž˜ ๋œ๋‹ค !

์‹ค์Šต

  • content_stat_monthly >> ๋ณ€๊ฒฝ์™„๋ฃŒ 

๊ธฐ์กด์— Textinputformat์œผ๋กœ ๋งŒ๋“  ํ…Œ์ด๋ธ”์€ ์–ด๋–กํ•˜๋‚˜ ?๋ฌธ์ œ

๐Ÿ“Œ Hue ์—์„œ ์ปฌ๋Ÿผ์ˆ˜์ •์‹œ >> ํ…Œ์ด๋ธ” ์Šคํ‚ค๋งˆ ๊ตฌ์กฐ๊ฐ€ ๋ฐ”๋€Œ๋ฉด backfill ํ•ด์ค˜์•ผ ํ•˜๋Š” ํ˜„์ƒ ์•„๋‹ˆ๋ฉด Mismatch ์˜ค๋ฅ˜๊ฐ€ ๋œน๋‹ˆ๋‹ค ใ… 

์›์ธ

  • ํ…Œ์ด๋ธ” ๋นŒ๋“œ์‹œ parquet ๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ (TextInputFormat) ์—๋Š” ์ปฌ๋Ÿผ ์ˆœ์„œ๋ฅผ ์ง€์ผœ์ค˜์•ผ ํ•จ
    • ์˜ํ–ฅ์„ ๋ฐ›๋Š” ํ˜•์‹
      • 'org.apache.hadoop.mapred.**TextInputFormat**'
  • ๊ทธ๋ ‡๋‹ค๋ฉด ? >>
    • ๋นŒ๋“œํ• ๋•Œ PARQUET ํ˜•์‹์œผ๋กœ ๋นŒ๋“œํ•ด์•ผํ•จ
      • external table ์ธ๋ฐ spark.sql.sources.schema ๋‹ฌ๋ฆฌ๋ฉด ์•ˆ๋จ
        • spark.sql.... property ๋ฅผ ๋ชจ๋‘ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜
        • spark.sql.sources.schema json ์— ์ƒˆ๋กœ ์ถ”๊ฐ€ํ•˜๋Š” ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€

ROW FORMAT SERDE

  • ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    • Parquet ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด, ์˜ฌ๋ฐ”๋ฅธ SerDe๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, **org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe**๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋นŒ๋“œํ• ๋•Œ ์ฝ”๋“œ (ํ™•์ธํ•ด๋ณด๋‹ˆ ๊ตณ์ด ์•ˆํ•ด์ค˜๋„ STORED AS PARQUET ์š”๊ฒƒ๋งŒ ๋„ฃ์–ด๋„ ๋ฌด๋ฐฉํ•จ)
      • **ParquetHiveSerDe**๋Š” ์ด ๋‘ ์‹œ์Šคํ…œ ๊ฐ„์˜ ํ˜ธํ™˜์„ฑ์„ ์ œ๊ณตํ•˜์—ฌ, Hive๊ฐ€ Parquet ํŒŒ์ผ ํ˜•์‹์œผ๋กœ ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๊ณ  ์“ธ ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • CREATE TABLE my_table ( ... ) STORED AS PARQUET WITH SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe';

TextInputFormat

  • **TextInputFormat**์€ ์ผ๋ฐ˜์ ์œผ๋กœ CSV๋‚˜ TSV ๊ฐ™์€ ํ…์ŠคํŠธ ํŒŒ์ผ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ์‚ฌ์šฉ

TextInputFormat๊ณผ Parquet์„ ๋น„๊ต

 ๐Ÿ“Œ ์š”์•ฝ

  • TextInputFormat์€ ํ˜ธํ™˜์„ฑ์ด ๋†’๊ณ  ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šฐ๋ฉฐ, ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์ด๋‚˜ ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ์••์ถ• ๋ฐ I/O ํšจ์œจ์„ฑ์ด ๋‚ฎ๊ณ  ์Šคํ‚ค๋งˆ๋ฅผ ์ง€์›ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • Parquet์€ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ๋ถ„์„์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋†’์€ ์••์ถ•์œจ๊ณผ ์ปฌ๋Ÿผ ๋‹จ์œ„ ์ฒ˜๋ฆฌ๋กœ ์ฟผ๋ฆฌ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. ์Šคํ‚ค๋งˆ๋ฅผ ๋‚ด์žฅํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด์ง„ ํ˜•์‹์œผ๋กœ ์‚ฌ๋žŒ์ด ์ฝ๊ธฐ ์–ด๋ ต๊ณ  ๋ฐ์ดํ„ฐ ์ค€๋น„๊ฐ€ ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. </aside>

ํŠน์ง• TextInputFormat Parquet

๋ฐ์ดํ„ฐ ํ˜•์‹ ํ…์ŠคํŠธ (CSV, TSV ๋“ฑ) ์ปฌ๋Ÿผํ˜•
ํ˜ธํ™˜์„ฑ ๋†’์€ ํ˜ธํ™˜์„ฑ (๋งŽ์€ ๋ฐ์ดํ„ฐ ์†Œ์Šค๊ฐ€ ํ…์ŠคํŠธ ํ˜•์‹) ์ฃผ๋กœ ๋น…๋ฐ์ดํ„ฐ ๋ฐ ๋ถ„์„ ๋„๊ตฌ (Hadoop, Spark ๋“ฑ)
์ดํ•ด ๋ฐ ์ˆ˜์ • ์šฉ์ด์„ฑ ์‚ฌ๋žŒ ์ฝ๊ธฐ ๋ฐ ์ˆ˜์ • ์šฉ์ด ์ด์ง„ ํ˜•์‹, ์‚ฌ๋žŒ์ด ์ฝ๊ธฐ ์–ด๋ ค์›€
์••์ถ• ๋ฐ ์ €์žฅ ํšจ์œจ์„ฑ ์ผ๋ฐ˜์ ์œผ๋กœ ๋‚ฎ์Œ ๋†’์€ ์••์ถ•์œจ ๋ฐ ์ €์žฅ ํšจ์œจ์„ฑ
์Šคํ‚ค๋งˆ ์ง€์› ์—†์Œ (์™ธ๋ถ€์—์„œ ์ •์˜ ํ•„์š”) ๊ฐ•๋ ฅํ•œ ์Šคํ‚ค๋งˆ ๋‚ด์žฅ
์ฟผ๋ฆฌ ์„ฑ๋Šฅ ์ „์ฒด ํ–‰ ์Šค์บ” ํ•„์š” ํŠน์ • ์ปฌ๋Ÿผ๋งŒ ์ฝ๊ธฐ ๊ฐ€๋Šฅ, ๋†’์€ ์ฟผ๋ฆฌ ์„ฑ๋Šฅ
I/O ์„ฑ๋Šฅ ๋‚ฎ์Œ ๋†’์Œ (์ปฌ๋Ÿผ ๋‹จ์œ„ ์ฒ˜๋ฆฌ)
๋ฐ์ดํ„ฐ ์ค€๋น„ ์šฉ์ด์„ฑ ์‰ฌ์›€ (๊ฐ„๋‹จํ•œ ์Šคํฌ๋ฆฝํŠธ๋‚˜ ํŽธ์ง‘๊ธฐ๋กœ ์ƒ์„ฑ ๊ฐ€๋Šฅ) ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ณต์žก, ์ „์šฉ ๋„๊ตฌ ํ•„์š”
์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ ์ ํ•ฉ์„ฑ ์ ํ•ฉ (์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ์— ์šฉ์ด) ์ ํ•ฉํ•˜์ง€ ์•Š์Œ
์‚ฌ์šฉ ์‚ฌ๋ก€ ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ, ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ๋ถ„์„, ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค, BI ๋„๊ตฌ
ํŒŒ์ผ ํฌ๊ธฐ ๋น„๊ต์  ํผ ๋น„๊ต์  ์ž‘์Œ (์••์ถ• ์ ์šฉ)

TextInputFormat ์‚ฌ์šฉ ์ด์œ :

  1. ํ˜ธํ™˜์„ฑ ๋ฐ ๋‹จ์ˆœ์„ฑ:
    • ํ…์ŠคํŠธ ํŒŒ์ผ ํ˜ธํ™˜์„ฑ: ๋งŽ์€ ๋ฐ์ดํ„ฐ ์†Œ์Šค๊ฐ€ CSV, TSV ๊ฐ™์€ ํ…์ŠคํŠธ ํ˜•์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ˜ธํ™˜์„ฑ์ด ์ข‹์Šต๋‹ˆ๋‹ค.
    • ๋‹จ์ˆœ์„ฑ: ํ…์ŠคํŠธ ํ˜•์‹์€ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๊ณ , ์‚ฌ๋žŒ์ด ์ฝ๊ณ  ์“ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ํ™•์ธ ๋ฐ ๋””๋ฒ„๊น…์ด ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ๋กœ๋”ฉ:
    • ๋ฐ์ดํ„ฐ ์ค€๋น„๊ฐ€ ์‰ฝ๋‹ค: ํ…์ŠคํŠธ ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์€ ์ƒ๋Œ€์ ์œผ๋กœ ์‰ฝ์Šต๋‹ˆ๋‹ค. ๊ฐ„๋‹จํ•œ ์Šคํฌ๋ฆฝํŠธ๋‚˜ ํ‘œ์ค€ ํ…์ŠคํŠธ ํŽธ์ง‘๊ธฐ๋งŒ์œผ๋กœ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
    • ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ: ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜์Œ ๋กœ๋”ฉํ•˜๊ฑฐ๋‚˜ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์„ ๋‹ค๋ฃฐ ๋•Œ๋Š” ํ…์ŠคํŠธ ํ˜•์‹์ด ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆฌ๋ฐ:
    • ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ: ์‹ค์‹œ๊ฐ„ ๋˜๋Š” ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์˜ ๊ฒฝ์šฐ, ํ…์ŠคํŠธ ํ˜•์‹์€ ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ณ  ์†Œ๋น„ํ•  ์ˆ˜ ์žˆ๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Parquet ์‚ฌ์šฉ ์ด์œ :

  1. ํšจ์œจ์„ฑ ๋ฐ ์„ฑ๋Šฅ:
    • ์ปฌ๋Ÿผํ˜• ์ €์žฅ: Parquet์€ ์ปฌ๋Ÿผํ˜• ์ €์žฅ ํ˜•์‹์„ ์‚ฌ์šฉํ•˜์—ฌ, ํŠน์ • ์—ด๋งŒ ์ฝ๊ฑฐ๋‚˜ ์“ธ ๋•Œ ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์—์„œ ํŠน์ • ์—ด์— ๋Œ€ํ•ด ์ฟผ๋ฆฌ๋ฅผ ์‹คํ–‰ํ•  ๋•Œ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.
    • ์••์ถ• ๋ฐ ์ธ์ฝ”๋”ฉ: Parquet์€ ๊ณ ๋„๋กœ ์••์ถ•๋œ ํ˜•์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜์—ฌ ์ €์žฅ ๊ณต๊ฐ„์„ ์ ˆ์•ฝํ•˜๊ณ  I/O ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  2. ์Šคํ‚ค๋งˆ ์ง€์›:
    • ๊ฐ•๋ ฅํ•œ ์Šคํ‚ค๋งˆ ์ง€์›: Parquet ํŒŒ์ผ์—๋Š” ์Šคํ‚ค๋งˆ ์ •๋ณด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์–ด ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ ๊ตฌ์กฐ๋ฅผ ๋ช…ํ™•ํžˆ ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ์„ ์œ ์ง€ํ•˜๊ณ , ๋ถ„์„ ์ž‘์—…์„ ๋” ์‰ฝ๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  3. ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ์ตœ์ ํ™”:
    • ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ: Parquet์€ Hadoop, Spark, Presto, Athena ๊ฐ™์€ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ๊ณผ ์ž˜ ํ†ตํ•ฉ๋˜์–ด ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ๋ถ„์„ ์ž‘์—…์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์„ ํƒ ๊ธฐ์ค€:

  • ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์ด๋‚˜ ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์˜ ๊ฒฝ์šฐ, TextInputFormat์ด ๋” ๊ฐ„ํŽธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” Parquet์ด ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.
  • ์„ฑ๋Šฅ ์š”๊ตฌ ์‚ฌํ•ญ: ๋ถ„์„ ์„ฑ๋Šฅ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ, Parquet์ด ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ํŠน์ • ์—ด์— ๋Œ€ํ•œ ๋นˆ๋ฒˆํ•œ ์ฟผ๋ฆฌ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ Parquet์˜ ์ปฌ๋Ÿผํ˜• ์ €์žฅ ๋ฐฉ์‹์ด ํฐ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ: ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ ๊ตฌ์กฐ๋ฅผ ๋ช…ํ™•ํžˆ ์œ ์ง€ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ Parquet์ด ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. Parquet์€ ์Šคํ‚ค๋งˆ๋ฅผ ๋‚ด์žฅํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ ํ˜•ํƒœ: ์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ๋‚˜ ๋กœ๊ทธ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ, ํ…์ŠคํŠธ ํ˜•์‹์ด ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ…Œ์ŠคํŠธ๋ฅผ ํ•ด๋ณด์ž

  • ํ…Œ์ด๋ธ” ๋นŒ๋“œ 1
CREATE EXTERNAL TABLE temp.test_format_table_v2 (
    user_id         bigint ,
    contents_id     bigint,
    contents_type   string,
    keycol_cnt      bigint
       )
PARTITIONED BY (
  base_dt string
       )
stored as parquet
location
  's3://bucketplace-emr-ba/hive-ba/output/temp/test_format_table'
;
  • ํ•˜๋ฃจ์น˜ insert
INSERT INTO  temp.test_format_table_v2
select *
,'2024-04-10' as base_Dt
from ( SELECT
           user_id,
            project_name,
           contents_id,
           contents_type,
           count(DISTINCT concat(cast(contents_type AS VARCHAR), cast(contents_id AS VARCHAR))) AS keycol_cnt
       --     , base_dt
-- select *
       FROM contents.contents_meta_daily cmd
       WHERE 1 = 1
         AND base_dt = '2024-04-10'
       GROUP BY 1, 2, 3,4
       limit 100)
where 1=1

  • ํœด์—์„œ ์ปฌ๋Ÿผ๋ณ€๊ฒฝ
-- -- content_stat_monthly_test_v2 ํ…Œ์ด๋ธ” ๊ต์ฒด
 alter table temp.test_format_table_v2
 add columns (project_name string)
-- ;

alter table temp.test_format_table_v2
change column project_name project_name string AFTER user_id
  • ์กฐํšŒ
select * from temp.test_format_table_v2
  • ๋ณ€๊ฒฝ๋œ ์ปฌ๋Ÿผ insert
INSERT INTO  temp.test_format_table_v2
select *
,'2024-04-11' as base_Dt
from ( SELECT
           user_id,
            project_name,
           contents_id,
           contents_type,
           count(DISTINCT concat(cast(contents_type AS VARCHAR), cast(contents_id AS VARCHAR))) AS keycol_cnt
       --     , base_dt
-- select *
       FROM contents.contents_meta_daily cmd
       WHERE 1 = 1
         AND base_dt = '2024-04-11'
       GROUP BY 1, 2, 3,4
       limit 100
       )
where 1=1

;
  • ์กฐํšŒ
select * from temp.test_format_table_v2 where base_dt = '2024-04-11'
  • ์ž˜ ๋œ๋‹ค !

์‹ค์Šต

  • content_stat_monthly >> ๋ณ€๊ฒฝ์™„๋ฃŒ 

๊ธฐ์กด์— Textinputformat์œผ๋กœ ๋งŒ๋“  ํ…Œ์ด๋ธ”์€ ์–ด๋–กํ•˜๋‚˜ ?

๋ฐ˜์‘ํ˜•

๋Œ“๊ธ€