Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Launch HN: Prequel (YC W21) – Sync data to your customer’s data warehouse
118 points by ctc24 on Sept 26, 2022 | hide | past | favorite | 40 comments
Hey HN! We’re Conor and Charles from Prequel (https://prequel.co). We make it easy for B2B companies to send data to their customers. Specifically, we help companies sync data directly to their customer's data warehouse, on an ongoing basis.

We’re building Prequel because we think the current ETL paradigm isn’t quite right. Today, it’s still hard to get data out of SaaS tools: customers have to write custom code to scrape APIs, or procure third-party tools like Fivetran to get access to their data. In other words, the burden of data exports is on the customer.

We think this is backwards! Instead, vendors should make it seamless for their customers to export data to their data warehouse. Not only does this make the customer’s life easier, it benefits the vendor too: they now have a competitive advantage, and they get to generate new revenue if they choose to charge for the feature. This approach is becoming more popular: companies like Stripe, Segment, Heap, and most recently Salesforce offer some flavor of this capability to their customers.

However, just as it doesn’t make sense for each customer to write their own API-scraping code, it doesn’t make sense for every SaaS company to build their own sync-to-customer-warehouse system. That’s where Prequel comes in. We give SaaS companies the infrastructure they need to easily connect to their customers’ data warehouses, start writing data to it, and keep that data updated on an ongoing basis. Here's a quick demo: https://www.loom.com/share/da181d0c83e44ef9b8c5200fa850a2fd.

Prequel takes less than an hour to set up: you (the SaaS vendor) connect Prequel to your source database/warehouse, configure your data model (aka which tables to sync), and that’s pretty much it. After that, your customers can connect their database/warehouse and start receiving their data in a matter of minutes. All of this can be done through our API or in our admin UI.

Moving all this data accurately and in a timely manner is a nontrivial technical problem. We potentially have to transfer billions of rows / terabytes of data per day, while guaranteeing that transfers are completely accurate. Since companies might use this data to drive business decisions or in financial reporting, we really can't afford to miss a single row.

There are a few things that make this particularly tricky. Each data warehouse speaks a slightly different dialect of SQL and has a different type system (which is not always well documented, as we've come to learn!). Each warehouse also has slightly different ingest characteristics (for example, Redshift has a hard cap of 16MB on any statement), meaning you need different data loading strategies to optimize throughput. Finally, most of the source databases we read data from are multi-tenant — meaning they contain data from multiple end customers, and part of our job is to make sure that the right data gets routed to the right customer. Again, it's pretty much mission-critical that we don't get this wrong, not even once.

As a result, we've invested in extensive testing a lot earlier than it makes sense for most startups to. We also tend to write code fairly defensively: we always try to think about the ways in which our code could fail (or anticipate what bugs might be introduced in the future), and make sure that the failure path is as innocuous as possible. Our backend is written in Go, our frontend is in React + Typescript (we're big fans of compiled languages!), we use Postgres as our application db, and we run the infra on Kubernetes.

The last piece we'll touch on is security and privacy. Since we're in the business of moving customer data, we know that security and privacy are paramount. We're SOC 2 Type II certified, and we go through annual white-box pentests to make sure that all our code is up to snuff. We also offer on-prem deployments, so data never has to touch our servers if our customers don't want it to.

It's kind of surreal to launch on here – we’re long time listeners, first time callers, and have been surfing HN since long before we first started dreaming about starting a company. Thanks for having us, and we're happy to answer any questions you may have! If you wanna take the product for a spin, you can sign up on our website or drop us a line at hn (at) prequel.co. We look forward to your comments!



Wow I've been looking for this for years! Always thought SaaS companies waste time building yet another mediocre analytics dashboard when they should just sync their data

My main thing is I don't want to think in terms of raw data from my database to the customer database, I have higher level API concepts.

Would be cool if there was some kind of sync protocol I could implement where prequel sent a request with a "last fetched" timestamp and the endpoint replied with all data to be updated.

Kind of like this: https://doc.replicache.dev/server-pull


That’s actually pretty close to how we connect to APIs today, but I love the idea of streamlining it as a protocol. We’re still working on improving the dev experience, especially around APIs, and would love to chat more if you have more feedback!


Check out the Airbyte Protocol - this is exactly the kind of thing we made it for! https://docs.airbyte.com/understanding-airbyte/airbyte-proto...


Congrats on the launch. Very cool idea and always wondered why this hasn't been done before.

One question though - don't you see Snowflake (or the cloud data warehouse vendors) building this? Snowflake has to build native support for CDC from some production databases like Postgres, MySQL, Oracle etc. Once the data has landed in the SaaS vendor's Snowflake, it can be shared (well the relevant rows which a customer should have access to) with each customer.

Isn't that long term right solution? Or am I missing something here?


Snowflake has made it really easy to share to other Snowflake instances, and the other major cloud data warehouses are working on similar warehouse specific features as well. This makes sense because it can be a major driver of growth for them. The way we see the space play out is that every cloud warehouse develops some version of same-vendor sharing, while neglecting competitive warehouse support.

Long term, we'd like to be the interoperable player focused purely on data sharing that plays nicely with any of the upstream sources, but also facilitates connecting to any data destination. (This also means we can spend more time building thoughtful interfaces – API and UI – for onboarding and managing source and destination connections)


Got it. Makes sense.

However, don't you think it makes sense for Snowflake to support "Replicate my Postgres/MySQL/Oracle" to Snowflake? Given how much they are investing in making it easier to get data into Snowflake.


Oh, yeah, it probably does make sense for the warehouses to make that part easier, at least for the more popular transactional db choices. You may have seen Google/BigQuery recently announced their off the shelf replication service for Oracle and MySQL. As far as Prequel goes, we connect to either (db or data warehouse sources), so we're largely agnostic to how the data moves around internally before it gets sent to customers.


There’s always someone bigger that could be building you’re idea. Can’t stop you.

However, what a great idea for snowflake just for the insane vendor lock-in they would get


This is excellent (the idea, I don't know anything about prequel!), and a much needed tool to support a reasonable trend. I fully support B2B companies taking on the responsibility to make data more readily available for analytics, beyond just exposing a fragile API.

For those unaware, this is a relatively recently established practice (direct to warehouse instead of via 3rd party ETL)

https://techcrunch.com/2022/09/15/salesforce-snowflake-partn...

https://stripe.com/en-gb-es/data-pipeline


Awesome product.

1. Do you expect to support SQL Server? If so, do you know when?

2. Watched the Loom video. How should we handle multi-tenant data that requires a join? For example, let' say I want to send data specific school. The student would belong to a Teacher who belongs to a School.


Thanks for watching and for the kind words! Re 1. – it’s definitely on the roadmap – we’re planning on getting to it in Q4/Q1, but we can move it up depending on customer need. Re 2. – for tables without a tenant ID column, we suggest creating a view on top of that table that performs the join to add the tenant column (e.g., "school_id") – it's a pretty common pre-req.


How do you deal with incompatible data types between the source and destination systems ? For example, the source might have a timestamp with timezone data type and the destination could just support timestamp in UTC.


That’s something that we spend a lot of time working on. Our general approach is to do both the simple and predictable thing (ie what format would I want to get the data in if I were the data team receiving it). For the specific example you give, we’d convert the timestamptz into a timestamp before writing it to the destination.

Another type where this comes up is JSON. Some warehouses support the type whereas others don’t. In those cases, we typically write the data to a text/varchar column instead.

The way it works under the hood is we’ve effectively written our own query planner. It takes in the source database flavor (eg Snowflake) and types, the destination database flavor, and figures out the happy path for handling all the specified types.


Love it.

As we continue to move toward more and more composable architectures combining SaaS an offering like this is going to really give your users a leg up.

Edit: deleted redundant question about SOC 2. I missed the whole paragraph on mobile


Very cool, we've had great success using Snowflake's sharing to give our customers access to their data... That obviously falls apart if the customer wants data in BigQuery or somewhere else.


One of the first Show HN's i've read where my first thought was "would invest based purely on the single sentence description"

Nice idea. Have fun executing!


Congrats on your launch! What's the data source that I can add? Does it also need to be another database listed here (1)? Does that mean that I need to move the data to one of these databases in order to sync our customer data to their data warehouses?

(1) https://docs.prequel.co/reference/post_sources


Thank you! And exactly right – Postgres, Snowflake, BigQuery, Redshift, and Databricks are the sources we support today. We also have some streaming sources (like Kafka) in beta with a couple pilot users. At this point, it's fairly negligible work for us to add support for new SQL based sources, so we can add new sources quickly as needed.


Very cool. We've just started exploring customer requests to sync our data to their warehouses, so great timing.

What sort of scale can you guys handle? One of our DBs is in the billions of rows.

I assume we'd need to potentially create custom sources for each destination as well? Or does your system automatically figure out the best "common" schema across all destinations? For example, an IP subnet column.


Re: scale – we do handle billions of rows! As you can imagine, exact throughput depends on the source and destination database (as well as the width of the rows) but to give you a rough sense – on most warehouses, we can sync on the order of 5M rows per minute for each customer that you want to send data to. In practice, for a source with billions of rows, the initial backfill might take a few hours, and each incremental sync thereafter will be much faster. We can hook you up with a sandbox account if you want to run your own speed test!

Re: configuration – you would create a config file for each "source" table that you want to make available to customers, including which columns should be sent over. Then at the destination level, you can specify the subset of tables you'd like to sync. This could be a single common schema for all customers, or different schemas based on the products the customer uses.


Oh man I love this idea. After working both on a team that utilized our own data warehouse extensively. And then recently went through an extensive private api integration with a third party. Being able to just sync data back and forth to each other’s warehouses solves a whole host of common partnership problems much more quickly and cheaply.


Hey congarts on the launch ! One question comes to mind is about the T in ETL, ingestion tools like Fivetran and Stitch allows you to do light transformations on the incoming data. This process is usually done by the ingestion team that hands the data to different value teams.


Thanks! In the case of traditional ETL tools, a lot of that transformation is usually schema cleaning, which is necessary because the data is extracted from an API and so needs a little love before it’s in analysis-ready shape (renaming columns, joining a couple tables, and so on). When the vendor is in charge of defining the data model, they know what format is most helpful to surface the data in and so can transform it to that format before the load (TEL?!).

In this model, it’s also still possible for the data team on the recipient side to further transform data before handing it to value teams. They can run dbt or whichever transform tool they use once it lands in their warehouse (so more like ELT, which is broadly what the industry is moving to).


Would you guys support the inverse of this - sending data to your vendors? If you buy a SaaS tool that needs to ingest certain data from your data warehouse (or SaaS tools) into the vendors Snowflake / Data warehouse, would Prequel be worth looking at trying to achieve that with?


As you can imagine, the underlying tech is pretty much the same – we’re still moving bits from one db/dwh to another – so we can use our existing transfer infra. The only difference is the UX and API we surface on top of it.

We got this request from a couple teams we work with, and so we have an alpha that’s live with them. It’ll probably go live in GA in Q4 or Q1 (and if you want access to it before then, drop us a line at hn (at) prequel.co!).


Hey guys we spoke a while back, glad to see you’re still going at it! Good luck with everything.


Hey Ian, great to hear from you – thanks for the note, and hope you're doing well!


It's an interesting take but I worry this might not be possible (for a lot of cases) because not everything is database backed and sometime you have logic behind the API and you actually want the result of that API in your data warehouses. Think AWS API for example, even their own AWS Config team is using the same AWS APIs because there is no one place where the data just resides (of-course it would be great if it was the case, it would make life very easy)

I think this is a good tweet explaining the problem that will never be solved - https://twitter.com/mattrickard/status/1542193426979909634

Full Disclaimer: Im the founder and original author of https://github.com/cloudquery/cloudquery


Very much hear you about the amount of logic contained within API layers. That said, it actually hasn’t come up a lot so far. What we’re finding is that a lot of teams that have complex API-layer logic also end up landing their data in a data warehouse and replicating the API’s data model there. In those cases, we can use that as the source. For those that don’t have a dwh, we do actually support connecting to an API as a source, it just requires a bit more config and it’s less efficient and so we don’t promote it quite as much :).

As far as the tweet goes, there’s a couple points we might disagree with. One of the assumptions made is that the problem is solved by an outsourced third-party. That’s actually exactly what we want to change! We think the problem should be solved by a first-party (the software vendor), and we want to give them the tools to accomplish that. Another assumption in there is that companies who charge for dashboards wouldn’t do this because they’d lose money. If they charge for dashboards, they can also choose to charge for data exports.


I very much hope it will work for you and it will succeed. This can be def an amazing turning point in data integration and ELT world.

I do still have hard time understanding how this is technically feasible because let's say a company have their data stored in PostgreSQL but the data model their is not exactly as in their API because they are doing some minimal transformation and then expose it to the user. How the user will know what to expect the new data model is not documented at all and they usually want what is documented and exposed via the API. so seems you still have to go via the API and if you are there already then your are in the ELT space.

Per your comment that it falls into the user and it should fall on the vendor - I agree but if it's a matter in the ELT space then they should just maintain a plugin to CloudQuery (https://github.com/cloudquery/cloudquery) or AirByte or something similar. Similarly how vendors maintain terraform or pulumi plugins.


The API-based part sounds a lot link https://www.singer.io, the API-based data ETL tooling Stitch Data developed -- is that accurate?


There’s definitely some similarities (both are generic-ish ways to get data out of an API). As mentioned in another comment, we’re still deciding whether to adopt an existing protocol or roll our own for those API connections. We’ve done the latter so far but it’s a work in progress.


Do you know about name clash with PRQL [1]?

While it doesn't have the same spelling, pronunciation is the same.

[1] https://github.com/prql/prql


Love the idea! I really see the value in shifting the conversation towards the vendor themselves being responsible for pushing the data to customers and it makes a lot of sense to do it directly from DB -> DB.

However, building a data product myself (Shipyard), we really try to encourage the idea of "connecting every data touchpoint together" so you can get an end-to-end view of how data is used and prevent downstream issues from ever occurring. This raised a few questions:

1. If the vendor owns the process of when the data gets delivered, how would a data team be able to have their pipelines react to the completion or failure of that specific vendor's delivery? Or does the ingestion process just become more of a black box?

While relying on a 3rd party ingestion platform or running ingestion scripts on your own orchestration platform isn't ideal, it at least centralizes the observability of ongoing ingestion processes into a single location.

2. From a business perspective, do you see a tool like Prequel encouraging businesses to restrict their data exports behind their own paywall rather than making the data accessible via external APIs?

--

Would love to connect and chat more if you're interested! Contact is in bio.


1. I think in practice, people are already using a mix of sources for ingestion today. It’s rare that a data team would rely on a single tool to ingest all their data – instead, they might get some data via one or more ETL tools, some data via a script they wrote themselves, and some other data from their own db. So in that regard, I don’t think a world where the vendor provides the data pipeline makes this a lot more complex.

One way we’re hoping to pre-empt some of this is by helping vendors to surface more observability primitives in the schema that they write data to. To give an example: Prequel writes a _transfer_status table in each destination with some metadata about the last time data was updated. The goal there is to decouple the means of moving data with the observability piece.

We can also help vendors expose hooks that people’s data pipelines & observability tools plug into (think webhooks and the like).

2. We don’t really – anecdotally, companies that offer data warehouse syncs tend to be pretty focused on providing a great user-experience. At the end of the day, that decision is pretty much entirely with the business. We see a pretty wide range today: some teams choose to make exports widely available, some choose to reserve it for their pro or enterprise tier, and some choose to sell it as a standalone SKU. It’s pretty similar to what already happens with APIs designed for data exports.

Would love to continue the convo and hear more about your take on obs! Sending you a note now.


Brilliant idea. Can see many use cases for this with our company. Congrats on the launch!


Since a lot of services do integrations these days, I wonder are there some common connectors they are using?


This is very cool. I applied! We have a use for it now (we're starting a B2B thing). Good luck!


Just saw and sent you a note – would love to hear more!


This looks awesome!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: