This article dives into how Spark SQL does multi-table JOINs and how we achieved really fast JOINs using Datastax DSE (Spark and Cassandra), and FiloDB.
Many of you know that I’m into photography as a side hobby. For many years I was happily shooting Canon DSLRs and using Apple Aperture to organize and develop photos. In 2014, Aperture development stopped, but I haven’t had time to research alternatives. This year, I picked up a Fuji X-series mirrorless, and renewed my devotion to photography. I love it - by the way - but that’s the subject of another post. I joined multiple online groups. It was the perfect time to take a step back to re-examine my workflow and see if I could simplify it.
What’s wrong with this Scala for-comprehension, which attempts to carry out three asynchronous I/O operations? (This comes from FiloDB’s actor code for creating a new dataset. There are three operations that must be done: First, creating a dataset, second, creating a whole bunch of column definitions, and third, initializing the actual data table. The second future is not supposed to execute if the first one did not return
Apache Spark is increasingly thought of as the new jack-of-all-trades distributed platform for big data crunching – what with everything from traditional MapReduce-like workloads, streaming, graph computation, statistics, and machine learning all in one package. Except for Spark Streaming, with its micro-batches, Spark is focused for the most part on higher-latency, rich/complex analytics workloads. What about using Spark as an embedded, web-speed / low-latency query engine? This post will dive into using Apache Spark for low-latency, higher concurrency reporting / dashboard / SQL-like applications - up to hundreds of queries a second!
Apache Cassandra is one of the most widely used distributed NoSQL databases in the modern open source data engineering repertoire. While lots of posts exists on best practices, data modeling and specific features, there hasn’t really been a comprehensive performance analysis done that compares the two key TCO metrics: storage cost and query efficiency. This post that I just wrote should give users and architects deep insights into the key factors affecting Cassandra TCO when used for analytics.
In the course of working on FiloDB and other Scala data apps, we often have to work with data whose types are not known at compile time. For FiloDB, for example, a user can ingest data with any schema. At runtime, the schema tells us what type each column of data is. What are some approaches we can take to work with data like that?
If you are a big data analyst, or build big data solutions for fast analytical queries, you are likely familiar with columnar storage technologies. The open source Parquet file format for HDFS saves space and powers query engines from Spark to Impala and more, while cloud solutions like Amazon Redshift use columnar storage to speed up queries and minimize I/O. Being a file format, Parquet is much more challenging to work with directly for real-time data ingest. For applications like IoT, time-series, and event data analytics, many developers have turned to NoSQL databases such as Apache Cassandra, due to their combination of high write scalability and the ease of using an idempotent, primary key-based database API. Most NoSQL databases are not designed for fast, bulk analytical scans, but instead for highly concurrent key-value lookups. What is missing is a solution that combines the ease of use of a database API, the scalability of NoSQL databases, with columnar storage technology for fast analytics.
Docker is the new hotness for app deployment – and for good reason. It seems all the infrastructure providers are supporting it, as is Mesos, etc. This is a guide to dockerizing your Scala apps using sbt- docker as well as setting up a dev environment for Docker on OSX. This guide will show how to use the power of Scala and SBT to generate Docker configs and images, and you can never have enough blog posts to teach SBT :)
I just decided on my next software job, and I have an immense sense of peace about my decision. I’d like to share about the extended time I spent to reflect and decide – so that you won’t just rush to the next big name / hot startup / highest paying job, but really aim to make a values-based career decision you’ll be at peace with. I’ll share about my values of relationship and “love your neighbor”, how they came to influence my decision process, and how they apply to the software world.
I gave this talk at the inaugural SF Spark and Friends Meetup group in San Francisco during the week of the Spark Summit this year. While researching this talk, I realized there is very little material out there giving an overview of the many rich options for deploying and configuring Apache Spark. There are some specific articles by vendors - targeting YARN, or DSE, etc., but I think what developers really want is a broad overview. So, this post will give you that, but you will have to look through the slides here to dig through the meat of it.
Welcome to my humble blog! I’ll be writing about random big data / distributed systems / high-tech / Scala / Cassandra / Spark topics, or if I’m inspired, topics about life!