COUPANG
![]()
It was almost 10 years ago, but there was a time when I thought Hadoop was everything about big data. While I thought I was keeping up with things like Hive, Spark, and DAG configuration, I met someone at Google and asked how Google uses data. I was curious about how to configure and manage a pipeline to process data the size of Google. But he couldn't even understand what my question meant. Since data queries are mainly performed directly on ledger (raw) data, there are many join statements, which are cumbersome, but no further pipeline is needed. I couldn't help but be amazed. Big data implementations in the open source camp, such as Hadoop, were often inferior implementations of Google papers, and Google was already far ahead.
To put it in current terms, my question was to ask how he was doing ETL, and he answered that he was doing EL. Data engineers sometimes become a bottleneck in the transform stage, and ELT solves this. More members can get closer to data, which can be a great help in data-based decision-making.
I first encountered BigQuery at Lyft in 2018. At the time, Lyft mainly used AWS and its data infrastructure was S3 + Redshift, but due to difficulties in securing stability, it attempted to return to HDFS + Hive. Although Hive is stable for large-scale tasks, it has performance limitations such as slow performance, so we reviewed several other solutions and ended up using Google Big Query. All data events were being stored in AWS, and the lambda function was used to clone those events to Google Cloud Storage in real time, to the point where Big Query was used. It satisfies the requirements of being stable and fast at a low price, and the basic query and visualization tools are not bad, so the learning curve is low and new features such as ML are also good.
Knowing these advantages, I looked into using Big Query when I came to Korea, but at the time, there was no Seoul region in Google Cloud. Because it was a fintech startup, there were various restrictions. Now that I have time, I'm looking at Clyde service providers other than AWS, and BigQuery is still attractive. Nowadays, even startups believe that scalable back-end engineering and stable operations are no longer very difficult areas as best practices have been established and utilized, and it appears that data infrastructure configuration and use will soon reach this level. Big Query and AutoML will be the two major axes. (Let’s leave aside Big Tech’s fierce competition with its own infrastructure)
This book is faithful to its original purpose of becoming familiar with Big Query, but it also helps understand the larger framework mentioned above. You can learn or review SQL in Chapters 2, 3, 7, and 8, understand data preprocessing in Chapters 4 and 5, and learn about Big Query architecture and related Google technologies in Chapters 1 and 7. Chapter 9 superficially deals with machine learning, but it provides good examples of accessing and utilizing data to improve business competitiveness, so it will be helpful to anyone working at a tech company regardless of their job.
Lastly, I liked that the translation was smooth. I tend to read O'Reilly's books like the original, but there was no need for this book.