(Note: this is adapted from my talk at 2021 Scale by the Bay, Location-Based Data Engineering for Good)
Hey everyone, it’s been a while since I’ve blogged. As many of you know we’ve moved up to BC, Canada. Some of you might ask why, and discussing that is worth many blogs in itself. I’d like to keep this post focused on tips for moving up here, aimed towards those moving from the US or similar places – especially as it relates to cars, immigration and other topics. Some of it is BC-focused but most of it should apply to Canada in general. We’re doing fine - as fine as can be expected in a pandemic - but I’ll just say for fun that I’m still getting used to living here in a temperate rainforest (lots of grey skies, but tons of wildlife, very green, and almost all the electric power here comes from water), and that I’ve never had so much Korean fried chicken in my life before. :)
Yes, as an immigrant myself, I care very deeply about those families separated at the border trying to come into the country and escape war and gang violence in Central America. No, this is not a post about politics or why. It’s just a really short post to help those who want to help, a list of resources.
Writing high performance code is not easy in any language. I’m hoping to share some tips learned from working on many high performance data projects, including FiloDB, my high performance, Spark-based analytical database written all in Scala! Also, general tips on concurrency, actors, and other relevant topics will be included.
The election of Donald Trump was like an earthquake shattering our quiet, small, mountain town lives. Immigrants and people of color were targeted all over the country, as it seems like the election emboldened the nativist/racist/xenophobic elements in the nation. Folks who looked like me were being told to get out of the country and then shot at and killed in some cases, and getting eggs thrown at, their garage doors vandalized, shot at by police, and violently dragged off United Airlines planes (don’t ever fly United!). We were confused and scared - and suddenly, acutely aware that we were like the ONLY Asian family in town, amidst a sea of white folks. Every one around us is nice, but when it came to racial hatred, nobody seemed to understand how we felt (yes, we did engage in some dialog). We felt very alone. I thought about writing something, but then voices popped up. Maybe we should just assimilate. After all, we weren’t being physically harmed ourselves. We just have to brush off rude, annoying, and racist Facebook comments about dead Chinese bodies at a nearby lake and things like that. Life was busy enough with three kids and all their drama. Just do the Asian thing - stay silent and outwork everybody else. Put your head down. Go to church. Enjoy the mountains!
This is my May assignment for Exploring the Frontier’s “WE35” year-long exploration of the 35mm focal length in photography. If you are into photography and are interested in joining a passionate and welcoming community united by the goal of learning and honing your craft, I highly suggest you check it out. Armando and Justin are great leaders and teachers.
This article dives into how Spark SQL does multi-table JOINs and how we achieved really fast JOINs using Datastax DSE (Spark and Cassandra), and FiloDB.
Many of you know that I’m into photography as a side hobby. For many years I was happily shooting Canon DSLRs and using Apple Aperture to organize and develop photos. In 2014, Aperture development stopped, but I haven’t had time to research alternatives. This year, I picked up a Fuji X-series mirrorless, and renewed my devotion to photography. I love it - by the way - but that’s the subject of another post. I joined multiple online groups. It was the perfect time to take a step back to re-examine my workflow and see if I could simplify it.
What’s wrong with this Scala for-comprehension, which attempts to carry out three asynchronous I/O operations? (This comes from FiloDB’s actor code for creating a new dataset. There are three operations that must be done: First, creating a dataset, second, creating a whole bunch of column definitions, and third, initializing the actual data table. The second future is not supposed to execute if the first one did not return
Apache Spark is increasingly thought of as the new jack-of-all-trades distributed platform for big data crunching – what with everything from traditional MapReduce-like workloads, streaming, graph computation, statistics, and machine learning all in one package. Except for Spark Streaming, with its micro-batches, Spark is focused for the most part on higher-latency, rich/complex analytics workloads. What about using Spark as an embedded, web-speed / low-latency query engine? This post will dive into using Apache Spark for low-latency, higher concurrency reporting / dashboard / SQL-like applications - up to hundreds of queries a second!
Apache Cassandra is one of the most widely used distributed NoSQL databases in the modern open source data engineering repertoire. While lots of posts exists on best practices, data modeling and specific features, there hasn’t really been a comprehensive performance analysis done that compares the two key TCO metrics: storage cost and query efficiency. This post that I just wrote should give users and architects deep insights into the key factors affecting Cassandra TCO when used for analytics.
In the course of working on FiloDB and other Scala data apps, we often have to work with data whose types are not known at compile time. For FiloDB, for example, a user can ingest data with any schema. At runtime, the schema tells us what type each column of data is. What are some approaches we can take to work with data like that?
If you are a big data analyst, or build big data solutions for fast analytical queries, you are likely familiar with columnar storage technologies. The open source Parquet file format for HDFS saves space and powers query engines from Spark to Impala and more, while cloud solutions like Amazon Redshift use columnar storage to speed up queries and minimize I/O. Being a file format, Parquet is much more challenging to work with directly for real-time data ingest. For applications like IoT, time-series, and event data analytics, many developers have turned to NoSQL databases such as Apache Cassandra, due to their combination of high write scalability and the ease of using an idempotent, primary key-based database API. Most NoSQL databases are not designed for fast, bulk analytical scans, but instead for highly concurrent key-value lookups. What is missing is a solution that combines the ease of use of a database API, the scalability of NoSQL databases, with columnar storage technology for fast analytics.
Docker is the new hotness for app deployment – and for good reason. It seems all the infrastructure providers are supporting it, as is Mesos, etc. This is a guide to dockerizing your Scala apps using sbt- docker as well as setting up a dev environment for Docker on OSX. This guide will show how to use the power of Scala and SBT to generate Docker configs and images, and you can never have enough blog posts to teach SBT :)
I just decided on my next software job, and I have an immense sense of peace about my decision. I’d like to share about the extended time I spent to reflect and decide – so that you won’t just rush to the next big name / hot startup / highest paying job, but really aim to make a values-based career decision you’ll be at peace with. I’ll share about my values of relationship and “love your neighbor”, how they came to influence my decision process, and how they apply to the software world.
I gave this talk at the inaugural SF Spark and Friends Meetup group in San Francisco during the week of the Spark Summit this year. While researching this talk, I realized there is very little material out there giving an overview of the many rich options for deploying and configuring Apache Spark. There are some specific articles by vendors - targeting YARN, or DSE, etc., but I think what developers really want is a broad overview. So, this post will give you that, but you will have to look through the slides here to dig through the meat of it.
Welcome to my humble blog! I’ll be writing about random big data / distributed systems / high-tech / Scala / Cassandra / Spark topics, or if I’m inspired, topics about life!