Launch HN: Prequel (YC W21) – Sync data to your customer’s data warehouse

thdxr · on Sept 26, 2022

Wow I've been looking for this for years! Always thought SaaS companies waste time building yet another mediocre analytics dashboard when they should just sync their data

My main thing is I don't want to think in terms of raw data from my database to the customer database, I have higher level API concepts.

Would be cool if there was some kind of sync protocol I could implement where prequel sent a request with a "last fetched" timestamp and the endpoint replied with all data to be updated.

Kind of like this: https://doc.replicache.dev/server-pull

ctc24 · on Sept 26, 2022

That’s actually pretty close to how we connect to APIs today, but I love the idea of streamlining it as a protocol. We’re still working on improving the dev experience, especially around APIs, and would love to chat more if you have more feedback!

evantahler · on Sept 26, 2022

Check out the Airbyte Protocol - this is exactly the kind of thing we made it for! https://docs.airbyte.com/understanding-airbyte/airbyte-proto...

soumyadeb · on Sept 26, 2022

Congrats on the launch. Very cool idea and always wondered why this hasn't been done before.

One question though - don't you see Snowflake (or the cloud data warehouse vendors) building this? Snowflake has to build native support for CDC from some production databases like Postgres, MySQL, Oracle etc. Once the data has landed in the SaaS vendor's Snowflake, it can be shared (well the relevant rows which a customer should have access to) with each customer.

Isn't that long term right solution? Or am I missing something here?

conormccarter · on Sept 26, 2022

Snowflake has made it really easy to share to other Snowflake instances, and the other major cloud data warehouses are working on similar warehouse specific features as well. This makes sense because it can be a major driver of growth for them. The way we see the space play out is that every cloud warehouse develops some version of same-vendor sharing, while neglecting competitive warehouse support.

Long term, we'd like to be the interoperable player focused purely on data sharing that plays nicely with any of the upstream sources, but also facilitates connecting to any data destination. (This also means we can spend more time building thoughtful interfaces – API and UI – for onboarding and managing source and destination connections)

soumyadeb · on Sept 26, 2022

Got it. Makes sense.

However, don't you think it makes sense for Snowflake to support "Replicate my Postgres/MySQL/Oracle" to Snowflake? Given how much they are investing in making it easier to get data into Snowflake.

conormccarter · on Sept 27, 2022

Oh, yeah, it probably does make sense for the warehouses to make that part easier, at least for the more popular transactional db choices. You may have seen Google/BigQuery recently announced their off the shelf replication service for Oracle and MySQL. As far as Prequel goes, we connect to either (db or data warehouse sources), so we're largely agnostic to how the data moves around internally before it gets sent to customers.

rco8786 · on Sept 27, 2022

There’s always someone bigger that could be building you’re idea. Can’t stop you.

However, what a great idea for snowflake just for the insane vendor lock-in they would get

sails · on Sept 26, 2022

This is excellent (the idea, I don't know anything about prequel!), and a much needed tool to support a reasonable trend. I fully support B2B companies taking on the responsibility to make data more readily available for analytics, beyond just exposing a fragile API.

For those unaware, this is a relatively recently established practice (direct to warehouse instead of via 3rd party ETL)

https://techcrunch.com/2022/09/15/salesforce-snowflake-partn...

https://stripe.com/en-gb-es/data-pipeline

wasd · on Sept 26, 2022

Awesome product.

1. Do you expect to support SQL Server? If so, do you know when?

2. Watched the Loom video. How should we handle multi-tenant data that requires a join? For example, let' say I want to send data specific school. The student would belong to a Teacher who belongs to a School.

conormccarter · on Sept 26, 2022

Thanks for watching and for the kind words! Re 1. – it’s definitely on the roadmap – we’re planning on getting to it in Q4/Q1, but we can move it up depending on customer need. Re 2. – for tables without a tenant ID column, we suggest creating a view on top of that table that performs the join to add the tenant column (e.g., "school_id") – it's a pretty common pre-req.

gourabmi · on Sept 26, 2022

How do you deal with incompatible data types between the source and destination systems ? For example, the source might have a timestamp with timezone data type and the destination could just support timestamp in UTC.

ctc24 · on Sept 26, 2022

That’s something that we spend a lot of time working on. Our general approach is to do both the simple and predictable thing (ie what format would I want to get the data in if I were the data team receiving it). For the specific example you give, we’d convert the timestamptz into a timestamp before writing it to the destination.

Another type where this comes up is JSON. Some warehouses support the type whereas others don’t. In those cases, we typically write the data to a text/varchar column instead.

The way it works under the hood is we’ve effectively written our own query planner. It takes in the source database flavor (eg Snowflake) and types, the destination database flavor, and figures out the happy path for handling all the specified types.

leetrout · on Sept 26, 2022

Love it.

As we continue to move toward more and more composable architectures combining SaaS an offering like this is going to really give your users a leg up.

Edit: deleted redundant question about SOC 2. I missed the whole paragraph on mobile

sv123 · on Sept 26, 2022

Very cool, we've had great success using Snowflake's sharing to give our customers access to their data... That obviously falls apart if the customer wants data in BigQuery or somewhere else.

ztratar · on Sept 26, 2022

One of the first Show HN's i've read where my first thought was "would invest based purely on the single sentence description"

Nice idea. Have fun executing!

buremba · on Sept 26, 2022

Congrats on your launch! What's the data source that I can add? Does it also need to be another database listed here (1)? Does that mean that I need to move the data to one of these databases in order to sync our customer data to their data warehouses?

(1) https://docs.prequel.co/reference/post_sources

conormccarter · on Sept 26, 2022

Thank you! And exactly right – Postgres, Snowflake, BigQuery, Redshift, and Databricks are the sources we support today. We also have some streaming sources (like Kafka) in beta with a couple pilot users. At this point, it's fairly negligible work for us to add support for new SQL based sources, so we can add new sources quickly as needed.

mfrye0 · on Sept 26, 2022

Very cool. We've just started exploring customer requests to sync our data to their warehouses, so great timing.

What sort of scale can you guys handle? One of our DBs is in the billions of rows.

I assume we'd need to potentially create custom sources for each destination as well? Or does your system automatically figure out the best "common" schema across all destinations? For example, an IP subnet column.

conormccarter · on Sept 26, 2022

Re: scale – we do handle billions of rows! As you can imagine, exact throughput depends on the source and destination database (as well as the width of the rows) but to give you a rough sense – on most warehouses, we can sync on the order of 5M rows per minute for each customer that you want to send data to. In practice, for a source with billions of rows, the initial backfill might take a few hours, and each incremental sync thereafter will be much faster. We can hook you up with a sandbox account if you want to run your own speed test!

Re: configuration – you would create a config file for each "source" table that you want to make available to customers, including which columns should be sent over. Then at the destination level, you can specify the subset of tables you'd like to sync. This could be a single common schema for all customers, or different schemas based on the products the customer uses.

rco8786 · on Sept 27, 2022

Oh man I love this idea. After working both on a team that utilized our own data warehouse extensively. And then recently went through an extensive private api integration with a third party. Being able to just sync data back and forth to each other’s warehouses solves a whole host of common partnership problems much more quickly and cheaply.

mardo5 · on Sept 27, 2022

Hey congarts on the launch ! One question comes to mind is about the T in ETL, ingestion tools like Fivetran and Stitch allows you to do light transformations on the incoming data. This process is usually done by the ingestion team that hands the data to different value teams.

ctc24 · on Sept 27, 2022

Thanks! In the case of traditional ETL tools, a lot of that transformation is usually schema cleaning, which is necessary because the data is extracted from an API and so needs a little love before it’s in analysis-ready shape (renaming columns, joining a couple tables, and so on). When the vendor is in charge of defining the data model, they know what format is most helpful to surface the data in and so can transform it to that format before the load (TEL?!).

In this model, it’s also still possible for the data team on the recipient side to further transform data before handing it to value teams. They can run dbt or whichever transform tool they use once it lands in their warehouse (so more like ELT, which is broadly what the industry is moving to).

r_thambapillai · on Sept 26, 2022

Would you guys support the inverse of this - sending data to your vendors? If you buy a SaaS tool that needs to ingest certain data from your data warehouse (or SaaS tools) into the vendors Snowflake / Data warehouse, would Prequel be worth looking at trying to achieve that with?

ctc24 · on Sept 26, 2022

As you can imagine, the underlying tech is pretty much the same – we’re still moving bits from one db/dwh to another – so we can use our existing transfer infra. The only difference is the UX and API we surface on top of it.

We got this request from a couple teams we work with, and so we have an alpha that’s live with them. It’ll probably go live in GA in Q4 or Q1 (and if you want access to it before then, drop us a line at hn (at) prequel.co!).

ianbutler · on Sept 26, 2022

Hey guys we spoke a while back, glad to see you’re still going at it! Good luck with everything.

conormccarter · on Sept 26, 2022

Hey Ian, great to hear from you – thanks for the note, and hope you're doing well!

yevpats · on Sept 26, 2022

It's an interesting take but I worry this might not be possible (for a lot of cases) because not everything is database backed and sometime you have logic behind the API and you actually want the result of that API in your data warehouses. Think AWS API for example, even their own AWS Config team is using the same AWS APIs because there is no one place where the data just resides (of-course it would be great if it was the case, it would make life very easy)

I think this is a good tweet explaining the problem that will never be solved - https://twitter.com/mattrickard/status/1542193426979909634

Full Disclaimer: Im the founder and original author of https://github.com/cloudquery/cloudquery

ctc24 · on Sept 26, 2022

Very much hear you about the amount of logic contained within API layers. That said, it actually hasn’t come up a lot so far. What we’re finding is that a lot of teams that have complex API-layer logic also end up landing their data in a data warehouse and replicating the API’s data model there. In those cases, we can use that as the source. For those that don’t have a dwh, we do actually support connecting to an API as a source, it just requires a bit more config and it’s less efficient and so we don’t promote it quite as much :).

As far as the tweet goes, there’s a couple points we might disagree with. One of the assumptions made is that the problem is solved by an outsourced third-party. That’s actually exactly what we want to change! We think the problem should be solved by a first-party (the software vendor), and we want to give them the tools to accomplish that. Another assumption in there is that companies who charge for dashboards wouldn’t do this because they’d lose money. If they charge for dashboards, they can also choose to charge for data exports.

yevpats · on Sept 26, 2022

I very much hope it will work for you and it will succeed. This can be def an amazing turning point in data integration and ELT world.

I do still have hard time understanding how this is technically feasible because let's say a company have their data stored in PostgreSQL but the data model their is not exactly as in their API because they are doing some minimal transformation and then expose it to the user. How the user will know what to expect the new data model is not documented at all and they usually want what is documented and exposed via the API. so seems you still have to go via the API and if you are there already then your are in the ELT space.

Per your comment that it falls into the user and it should fall on the vendor - I agree but if it's a matter in the ELT space then they should just maintain a plugin to CloudQuery (https://github.com/cloudquery/cloudquery) or AirByte or something similar. Similarly how vendors maintain terraform or pulumi plugins.

cayleyh · on Sept 26, 2022

The API-based part sounds a lot link https://www.singer.io, the API-based data ETL tooling Stitch Data developed -- is that accurate?

ctc24 · on Sept 26, 2022

There’s definitely some similarities (both are generic-ish ways to get data out of an API). As mentioned in another comment, we’re still deciding whether to adopt an existing protocol or roll our own for those API connections. We’ve done the latter so far but it’s a work in progress.

aerzen · on Sept 28, 2022

Do you know about name clash with PRQL [1]?

While it doesn't have the same spelling, pronunciation is the same.

[1] https://github.com/prql/prql

blakeburch · on Sept 26, 2022

Love the idea! I really see the value in shifting the conversation towards the vendor themselves being responsible for pushing the data to customers and it makes a lot of sense to do it directly from DB -> DB.

However, building a data product myself (Shipyard), we really try to encourage the idea of "connecting every data touchpoint together" so you can get an end-to-end view of how data is used and prevent downstream issues from ever occurring. This raised a few questions:

1. If the vendor owns the process of when the data gets delivered, how would a data team be able to have their pipelines react to the completion or failure of that specific vendor's delivery? Or does the ingestion process just become more of a black box?

While relying on a 3rd party ingestion platform or running ingestion scripts on your own orchestration platform isn't ideal, it at least centralizes the observability of ongoing ingestion processes into a single location.

2. From a business perspective, do you see a tool like Prequel encouraging businesses to restrict their data exports behind their own paywall rather than making the data accessible via external APIs?

--

Would love to connect and chat more if you're interested! Contact is in bio.

ctc24 · on Sept 27, 2022

1. I think in practice, people are already using a mix of sources for ingestion today. It’s rare that a data team would rely on a single tool to ingest all their data – instead, they might get some data via one or more ETL tools, some data via a script they wrote themselves, and some other data from their own db. So in that regard, I don’t think a world where the vendor provides the data pipeline makes this a lot more complex.

One way we’re hoping to pre-empt some of this is by helping vendors to surface more observability primitives in the schema that they write data to. To give an example: Prequel writes a _transfer_status table in each destination with some metadata about the last time data was updated. The goal there is to decouple the means of moving data with the observability piece.

We can also help vendors expose hooks that people’s data pipelines & observability tools plug into (think webhooks and the like).

2. We don’t really – anecdotally, companies that offer data warehouse syncs tend to be pretty focused on providing a great user-experience. At the end of the day, that decision is pretty much entirely with the business. We see a pretty wide range today: some teams choose to make exports widely available, some choose to reserve it for their pro or enterprise tier, and some choose to sell it as a standalone SKU. It’s pretty similar to what already happens with APIs designed for data exports.

Would love to continue the convo and hear more about your take on obs! Sending you a note now.

dwiner · on Sept 26, 2022

Brilliant idea. Can see many use cases for this with our company. Congrats on the launch!

mrwnmonm · on Sept 26, 2022

Since a lot of services do integrations these days, I wonder are there some common connectors they are using?

sgammon · on Sept 26, 2022

This is very cool. I applied! We have a use for it now (we're starting a B2B thing). Good luck!

conormccarter · on Sept 26, 2022

Just saw and sent you a note – would love to hear more!

publiccomps · on Oct 6, 2022

This looks awesome!