I’m an experienced software developer / data engineer with more than 20 years of experience, leading and building data systems and databases at massive scale, some of them productionized at companies such as Apple. Hobbyist photographer. Community and diversity, including gender diversity, are important to me. Tackling global challenges is also very important.
This page summarizes my professional credentials, including major open source projects. You will find code, documentation, talks, and interviews below.
Keywords: Spark, Kafka, SQL, OLAP, data warehouse, database, time series, Prometheus, distributed systems, data engineering
Software Projects and Experience
Most of these are all projects I created or co-created :)
FiloDB (2015-)
FiloDB and filo - high performance distributed Prometheus time series database.
I created FiloDB from scratch and lead its development from before when Apple acquired my company, all the way to its productionization as the core time series database technology at Apple handling billions of time series metrics and Prometheus PromQL queries for the last several years (since 2019).
- Heavily involved in the design, architecture, and development of virtually all components, including high speed data processing, columnar format, recovery, indexing, and the PromQL query engine which we wrote from scratch
- Got FiloDB productionized at Apple since 2019
- Scales up to process hundreds of thousands of Prometheus data samples per second per node
- Innovative histogram support, way more scalable than what is in Prometheus
- Created many innovative custom data structures, many of them zero-copy and with minimal allocations, super high performance
- Besides tech lead, also lead related open source efforts
FiloDB started its life as an Apache Spark-compatible OLAP layer and datasource on top of Apache Cassandra. This means traditional data warehouse applications can be supported on top of Cassandra thanks to a custom columnar data format which sped up queries by orders of magnitude.
- Worked with Enterprise customers on integrating FiloDB into enterprises as a data warehouse
- Worked on data modeling and custom Apache Spark query optimizers
Rust / Columnar Compression
I am a columnar compression and OLAP expert, and have created multiple columnar formats, including the one used in FiloDB.
- ying-profiler - I wrote my own sampling (designed for production) Rust memory profiler which can track allocation length
- compressed-vec - Rust SIMD columnar compression library, with innovative format designed for super fast reads esp of sparse columnar data
- telemetry-subscribers - A bunch of tracing subscribers for common app telemetry, such as Jaeger, span instrumentation, Tokio console, etc.
I have also worked on Apache Arrow (Rust) and Datafusion based query engines and contributed to the Arrow Rust codebase.
Other Data Projects I Created
- Spark Job Server - REST API for Apache Spark job submission, logging, control
- Job Server is being included in Datastax Enterprise!
- Scala links - useful links for learning Scala and Scala projects
- ScalaStorm - Scala API for Apache Storm real-time stream processor
- msgpack4s - a fast, streaming-friendly, type-safe MessagePack library for Scala
ScalaStorm was used for Ooyala’s real time analytics application, which started its life as my special project, and which I lead through to productionization.
Other Projects I Have Contributions To
- https://github.com/MystenLabs/sui - Layer1 Blockchain
- https://github.com/apache/arrow-rs - contributed timestamp column functionality to Arrow Rust
- Apache Spark
Presentations and Conferences
See SlideShare and my presentations site
MinneAnalytics 2024 - Time-State Analytics
SBTB 2023 - Porting a Streaming Data Pipeline from Scala to Rust - slides
SBTB 2021 - Location-Based Data Engineering for Good - slides
CNCF 2021 Rust Day - Allocating Less: Really Thin Rust Cloud Apps
Reactive Summit 2020 - Designing Stateful Apps for Cloud and Kubernetes
SBTB 2019 - Rust and Scala, Sitting in a Tree * Youtube video
Monitorama PDX 2019 - Rich Histograms at Scale: A New Hope
SBTB 2018 - FiloDB: Real-time, In-Memory Time Series at Massive SMACK Scale (video)
KEYNOTE - Reactive Summit 2018 - FiloDB: Reactive, Real-time, In-Memory Time Series at Scale
Scale by the Bay 2017 - 2017 High Performance Database with Scala, Akka, Spark
Scala by the Bay 2016 - Building a High-Performance Database in Scala, Akka, and Spark
Spark Summit 2016 - 700 QUERIES PER SECOND WITH UPDATES: SPARK AS A REAL-TIME WEB SERVICE and video
Strata San Jose 2016 - NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics
Strata Singapore 2015 - Breakthrough OLAP on Cassandra and Spark - sorry video doesn’t seem to be up yet but here is synopsis.
Spark Summit EU 2015 talk - Productionizing Spark and the Spark Job Server video and slides
Big Data Scala 2015 - End to End Pipeline Training and Breakthrough OLAP on Cassandra and Spark
SF Spark and Friends - Nov 2015 - FiloDB: Combining Spark Streaming and Ad-Hoc Analytics
Scala Days 2015 talk on Productionizing Akka
SF Spark and Friends .. Cassandra South Bay Meetup .. Scala Days 2015 SF .. FOSS4G-NA 2015 .. Cassandra Summit 13 14 .. Spark Summit 13 14 ..
Interviews, WebCasts, Blogs
I appeared on the Data Exchange podcast for 2023 Trends in Data Engineering and Infrastructure with good friends Ben Lorica and Jesse Andersen.
Reactive Foundation blog post - Decoupling Space: Create Flexibility by Embracing the Network - my blog post on how Actors and message passing decouples space and allows for flexible app architecture
O’Reilly Webcast - Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming
Typesafe blog/interview - Fast Forward With Fast Data, Scala and Akka: Q/A with Spark Job Server creator
O’Reilly blog: Apache Cassandra for Analytics: A Performance and Storage Cost Analysis
O’Reilly Podcast Interview - Building a Scalable Platform for Streaming Updates and Analytics
Where you can find me
- Discord - @tahoe_fp
- Gitter/spark-jobserver