Tag: spark

A Summary of MapReduce: Background, Processes, Example & Extension

Post author By Kittisak Chotikkakamthorn
Post date March 16, 2024

Recently, I read an interesting research article titled “MapReduce: Simplified Data Processing on Large Clusters” written by Google employees Jeffrey Dean and Sanjay Ghemawat.

After reading the article, I summarized its key points, including the background, processes, and extension to be Apache Hadoop.

Tags Apache Hadoop, Big Data, data engineer, Data Processing, Distributed Processing, Flink, Hadoop, Map, MapReduce, Parallel Processing, Shuffle, spark

Computer Data

#22 MapReduce ที่มา การทำงาน และการเอาไปใช้

Post author By Kittisak Chotikkakamthorn
Post date March 12, 2024

หลังจากที่เขียนเรื่องที่เกี่ยวกับ Data Structures & Algorithms ไปในบทความก่อนหน้าที่เขียนถึง Big-O Notation, Searching กับ Sorting Algorithms กับ Shortest Path อย่าง Dijkstra’s กับ Bellman-Ford’s Algorithm รวมถึง A* Search Algorithm

คราวนี้มาเข้าเรื่องที่เกี่ยวข้องกับ Data ที่เป็นพื้นฐานหนึ่งเลยคือ MapReduce

Tags Big Data, data, data engineering, Distributed Processing, Flink, google, Hadoop, Map, MapReduce, Parallel Processing, programming, spark, ดาต้า, บิ๊กดาต้า

Computer Data

#14 ดึงข้อมูลจาก Database มาโชว์ใน Dashboard

Post author By Kittisak Chotikkakamthorn
Post date January 31, 2024

ต่อมาโปรเจคก่อนหน้าที่ทำ Data Pipeline ที่ดึงข้อมูลไฟล์ Excel จากเว็บไซต์ของกระทรวงอว. (กระทรวงการอุดมศึกษา วิทยาศาสตร์ วิจัยและนวัตกรรม) คราวนี้เรามาทำอีกโปรเจคหนึ่งที่สร้าง Data Pipeline มาดึงข้อมูลจากฐานข้อมูล (Database) เพื่อนำมาทำ Dashboard

Computer Data

#13 ทำ Data Pipeline ดึง Data ต้นทุนนศ.ต่อปี

Post author By Kittisak Chotikkakamthorn
Post date January 26, 2024

ในภาพเอาท่อขนส่งมาเทียบกับ Data Pipeline ที่สื่อแบบเดียวกันคือการนำของจากต้นทาง (Source) ไปยังปลายทาง (Destination)

Data Pipeline คือกระบวนการลำเลียงข้อมูลจากแหล่งข้อมูล (Data Source) มายังจุดหมาย (Destination)

ข้อดีของการทำ Data Pipeline ตามกระบวนการนี้ ได้แก่ รวบรวมข้อมูลให้เป็นหนึ่งเดียว (Locality) กับไม่จำเป็นต้องต่อท่อตรงจาก Data Source ไปยัง Destination (Decoupling) และสามารถทำซ้ำได้ (Reproducible) เพื่อให้เราเก็บข้อมูลไว้สำหรับการนำข้อมูลไปประมวลผลใหม่อีกกี่รอบก็ได้ [1]