Aggregate streaming data in real-time with WebAssembly

nielsbot · on Aug 24, 2021

This feature ("Aggregations for Smart Streams") sounds like what I'd normally call "reduce":

"Aggregates let you define functions that combine each record in a stream with some long-running state, or 'accumulator'."

nicholastmosher · on Aug 24, 2021

Yes, this is very similar to "reduce" from the functional programming world, in fact it is equivalent to the "fold" pattern. The main difference is that fold is slightly more flexible since your accumulator may have a different type than your stream elements.

In Rusty pseudocode, reduce requires a function with two inputs and an output of the same type:

fn reduce<T>(f: Fn(T, T) -> T)

Whereas fold may use one type for the accumulator and another type for the elements, but requires an initial accumulator value to be given explicitly:

fn fold<A, T>(init: A, f: Fn(A, T) -> A)

The Aggregate SmartStreams discussed in the blog follow this fold pattern, applied to a distributed persistent log as the stream and using WebAssembly modules as the functions.

nielsbot · on Aug 24, 2021

I didn't know about "fold"... In Swift it's still call `reduce` even if your accumulator type is arbitrary:

https://developer.apple.com/documentation/swift/array/229868...

ram_rar · on Aug 24, 2021

Finally, we are beginning to see some real back end applications of wasm apart from envoy proxy. This seems very similar to apache storm [1], where users can define UDFs (user defined functions) on their streams.

Although, I dont understand whats the value add of wasm (apart from security) if the user still has to write code in Rust -> wasm. Why not just execute in rust alone?

[1] https://storm.apache.org/

sehz · on Aug 24, 2021

Until now, container was only way to provide isolation boundary which as process. With WASM, we can provide very fine level isolation and execution control.

You can compile almost any language to WASM not just Rust. For example, Python, Go, Javascript: https://github.com/appcypher/awesome-wasm-langs.

ahunyady · on Aug 24, 2021

As mentioned in a prior reply, the product has 3 components, where WebAssembly is the programmability part. In short, WebAssembly gives us the ability to work on the data streams in real-time as the data hits the cluster (we call this data gravity). That allows us to process records within milliseconds. That being said, we are happy to work with the Storm community on a connector if there is such demand.

jgraettinger1 · on Aug 25, 2021

A similar strategy we [1] are pursuing is to use `reduce` JSON Schema annotations to make it easy to define generalized reductions over your JSON data types [2].

Taking the examples from the article, equivalent schemas might be:

  type: object
  properties:
    mySum:
      type: number
      reduce: { strategy: sum }

Or even:

  type: object
  properties:
    myDeeplyAggregatedMap:
      type: object
      reduce: { strategy: merge }
      additionalProperties:
        reduce: { strategy: sum }

Use of WASM is really interesting. We've been exploring it as a means for powering user-defined reduction strategies, in cases where the built-in strategies are insufficient.

[1] https://estuary.dev [2] https://docs.estuary.dev/reference/catalog-reference/schemas...

alexchamberlain · on Aug 24, 2021

Why use WASM here? Security? Apologies if I missed that in the post.

nicholastmosher · on Aug 24, 2021

No worries, it wasn't mentioned in this post in particular :)

Security is certainly one of the reasons to use WASM, the ability to run it in a sandbox means that untrusted user code can be uploaded to Fluvio's Streaming Processing Units and do the processing inline, rather than on the client side. This can save big on network bandwidth, especially with a dataset where filtering whittles down a lot on volume.

Other reasons include that WASM is a fast and portable bytecode format and that there is very good tooling and support for compiling Rust to WASM as well as embedding WASM runtimes in Rust, which works well for Fluvio as it's written in Rust.

Here's another post with a bit more detail about some of the design and motivational factors if you're interested: https://www.infinyon.com/blog/2021/06/introducing-fluvio/#fl...

rad_gruchalski · on Aug 24, 2021

This very much sounds like storm with bolts in wasm. I am not really sure why there is so much focus on the technology used for this product rather than what it actually does.

ahunyady · on Aug 24, 2021

This blog focuses on a small piece of technology. The product has 3 core components: immutable stores, data streaming, and programmability. The goal of the product is to make data streaming easily accessible to all engineers. The cluster is easy to roll out, has a powerful CLI, and covers multiple use cases from log aggregation to data cleansing.

BenoitP · on Aug 24, 2021

Interesting! I'd love to see the SQL version of that though.

CQRS reactive patterns with Flink or Spark computing a result to be send to the client could benefit from it: you could decide to move some aggregation client-side in the same business language.

sehz · on Aug 24, 2021

Yes, we are thinking of supporting SQL

tomrod · on Aug 24, 2021

Seconded, would love to see it support SQL. A subset of postgresql would go super far.

kumarski · on Aug 24, 2021

wasmer.io anyone?