I continue to be surprised that in these discussions correctness is treated as some optional highest possible level of quality, not the only reasonable state.
Suppose we're talking about multiplayer game networking, where the central store receives torrents of UDP packets and it is assumed that like half of them will never arrive. It doesn't make sense to view this as "we don't care about the player's actual position". We do. The system just has tolerances for how often the updates must be communicated successfully. Lost packets do not make the system incorrect.
A soft-realtime multiplayer game is always incorrect(unless no one is moving).
There are various decisions the netcode can make about how to reconcile with this incorrectness, and different games make different tradeoffs.
For example in hitscan FPS games, when two players fatally shoot one another at the same time, some games will only process the first packet received, and award the kill to that player, while other games will allow kill trading within some time window.
A tolerance is just an amount of incorrectness that the designer of the system can accept.
When it comes to CRUD apps using read-replicas, so long as the designer of the system is aware of and accepts the consistency errors that will sometimes occur, does that make that system correct?
If you’re live streaming video, you can make sure every frame is a P-frame which brings your bandwidth costs to a minimum, but then a lost packet completely permanently disables the stream. Or you periodically refresh the stream with I-frames sent over a reliable channel so that lost packets corrupt the video going forward only momentarily.
Sure, if performance characteristics were the same, people would go for strong consistency. The reason many different consistency models are defined is that there’s different tradeoffs that are preferable to a given problem domain with specific business requirements.
If the video is streaming, people don't really care if a few frames drop, hell, most won't notice.
It's only when several frames in a row are dropped that people start to notice, and even then they rarely care as long as the message within the video has enough data points for them to make an (educated) guess.
P/B frames (which is usually most of them) reference other frames to compress motion effectively. So losing a packet doesn't mean a dropped frame, it means corruption that lasts until the next I-frame/slice. This can be seconds. If you've ever seen corrupt video that seems to "smear" wrong colors, etc. across the screen for a bunch of frames, that's what we're talking about here.
Okay but now you're explaining that correctness is not necessarily the only reasonable state. It's possible to sacrifice some degree of correctness for enormous gains in performance because having absolute correctness comes at a cost that might simply not be worth it.
Back in the day there were some P2P RTS games that just sent duplicates. Like each UDP packet would have a new game state and then 1 or more repetitions of previous ones. For lockstep P2P engines, the state that needs to be transferred tends towards just being the client's input, so it's tiny, just a handful of bytes. Makes more sense to just duplicate ahead of time vs ack/nack and resend.
I think we should stop calling these systems eventually consistent. They are actually never consistent. If the system is complex enough and there are always incoming changes, there is never a point in time in these "eventually consistent systems" that they are in consistent state. The problem of inconsistency is pushed to the users of the data.
Someone else stated this implicitly, but with your reasoning no complex system is ever consistent with ongoing changes. From the perspective of one of many concurrent writers outside of the database there’s no consistency they observe. Within the database there could be pending writes in flight that haven’t been persisted yet.
That’s why these consistency models are defined from the perspective of “if you did no more writes after write X, what happens”.
They are consistent (the C in ACID) for a particular transaction ID / timestamp. You are operating on a consistent snapshot. You can also view consistent states across time if you are archiving log.
"... with your reasoning no complex system is ever consistent with ongoing changes. From the perspective of one of many concurrent writers outside of the database there’s no consistency they observe."
That was kind of my point. We should stop callings such systems consistent.
It is possible, however, to build a complex system, even with "event sourcing", that has consistency guarantees.
Of course your comment has the key term "outside of the database". You will need to either use a database or built a homegrown system that has similar features as databases do.
One way is to pipe everything through a database that enforces the consistency. I have actually built such an event sourcing platform.
Second way is to have a reconciliation process that guarantees consistency at certain point of time. For example, bank payments systems use reconciliation to achieve end-of-day consistency. Even those are not really "guaranteed" to be consistent, just that inconsistencies are sufficiently improbable, so that they can be handled manually and with agreed on timeouts.
The way you're defining "eventually consistent" seems to imply it means "the current state of the system is eventually consistent," which is not what I think that means. Rather, it means "for any given previous state of the system, the current state will eventually reflect that."
"Eventually consistent," as I understand it, always implies a lag, whereas the way you're using it seems to imply that at some point there is no lag.
I was not trying to "define" eventually consistent, but to point out that people typically use the term quite loosely, for example when referring to the state of the system-of-systems of multiple microservices or event sourcing.
Those are never guaranteed to be in consistent state in the sense of C in ACID, which means it becomes the responsibility of the systems that use the data to handle the consistency. I see this often ignored, causing user interfaces to be flaky.
They eventually become consistent from the frame of a single write. They would become consistent if you stopped writes, so they will eventually get there
But in practice we are rarely interested in single writes when we talk about consistency, but the consistency of multiple writes ("transactions") to multiple systems such as microservices.
It is difficult to guarantee consistency by stopping writes, because whatever enforces the stopping typically does not know at what point all the writes that belong together have been made.
If you "stop the writes" for sufficiently long, the probability of inconsistencies becomes low, but it is still not guaranteed to be non-existant.
For instance in bank payment systems, end-of-day consistency is handled by a secondary process called "reconciliation" which makes the end-of-day conflicts so improbable that any conflict is handled by a manual tertiary process. And then there are agreed timeouts for multi-bank transactions etc. so that the payments ultimately end up in consistent state.
That’s a fair point. To be fair to the academic definitions, “eventually consistent” is a quiescent state in most definitions, and there are more specific ones (like “bounded staleness”, or “monotonic prefix”) that are meaningful to clients of the system.
But I agree with you in general - the dynamic nature of systems means, in my mind, that you need to use client-side guarantees, rather than state guarantees, to reason about this stuff in general. State guarantees are nicer to prove and work with formally (see Adya, for example) while client side guarantees are trickier and feel less fulfilling formally (see Crooks et al “Seeing is Believing”, or Herlihy and Wing).
I have no beef with the academic, careful definitions, although I dislike the practice where academics redefine colloquial terms more formally. That actually causes more, not less confusion. I was talking about the colloquial use of the term.
If I search for "eventual consistency", the AI tells me that one of the cons for using eventual consistency is: "Temporary inconsistencies: Clients may read stale or out-of-date data until synchronization is complete."
I see time and time again in actual companies that have "modern" business systems based on microservices that developers can state the same idea but have never actually paused to think that you something is needed to do the "synchronization". Then they build web UIs that just ignore the fact, causing application to become flaky.
Changing terminology is hard once a name sticks. But yeah, "eventual propagation" is probably more accurate. I do get the impression that "eventual consistency" often just means "does not have a well-defined consistency model".
Yes, I agree. I don't really believe we can change the terminology. But maybe we can get some people to at least think about the consistency model when using the term.
> If the system is complex enough and there are always incoming changes, there is never a point in time in these "eventually consistent systems" that they are in consistent state.
Sure, but that fact is almost never relevant. (And most systems do have periods where there aren't incoming changes, even if it's only when there's a big Internet outage). What would be the benefit in using that as a name?
It is highly relevant in many contexts. I see in my work all the time that developers building frontends on top of microservices believe that because the system is called "eventually consistent", they can ignore consistency issues in refences between objects, causing flaky apps.
> I see in my work all the time that developers building frontends on top of microservices believe that because the system is called "eventually consistent", they can ignore consistency issues in refences between objects, causing flaky apps.
If someone misunderstands that badly, why would one believe they would do any better if it was called "never consistent"?
I don't see it this way. Let's take a simple example - banks. Your employer sends you the salary from another bank. The transfer is (I'd say) eventually consistent - at some point, you WILL get the money. So how it can be "never consistent"?
Your example is too simple to show the problem with the "eventual consistency" as people use the term in real life.
Let's say you have two systems, one containing customers (A) and other containing contracts for the customers (B).
Now you create a new contract by first creating the customer in system A and then the contract on system B.
It may happen that web UI shows the contract in system B, which refers to the customer by id (in system A), but that customer becomes visible slightly after in system A.
The web UI has to either be built to manage the situation where fetching customer by id may temporarily fail -- or accept the risk that such cases are rare and you just throw an error.
If a system would be actually "eventually consistent" in the sense you use the term, it would be possible for the web UI to get guarantee from the system-of-systems to fetch objects in a way that they would see either both the contract and the customer info or none.
Because I will have spent it before it becomes available :)
For the record (IMO) banks are an EXCELLENT example of eventually consistent systems.
They're also EXCELLENT for demonstrating Event Sourcing (Bank statements, which are really projections of the banks internal Event log, but enough people have encountered them in such a way that that most people understand them)
I have worked with core systems in several financial institutions, as well as built several event sourcing production systems used as the core platform. One of these event sourcing systems was actually providing real consistency guarantees.
Based on my experience, I would recommend against using bank systems as an example of event sourcing, because they are actually much more complex than what people typically mean when they talk about event sourcing systems.
Bank systems cannot use normal event sourcing exactly because of the problem I describe. They have various other processes to have sufficiently probable consistency (needed by the bank statements for example), such as "reconciliation".
Even those do not actually "guarantee" anything, but you need tertiary manual process to fix any inconsistencies (on some days after the transaction). They also have timeouts agreed between banks to eventually resolve any inconsistencies related to cross-bank payments over several weeks.
In practice this means that the bank statements for source account and target account may actually be inconsistent with each other, although these are so rare that most people never encounter them.
If the bank transaction is eventually consistent, it means that the state can flip and the person receiving will "never" be sure. A state that the transaction will be finished later is a consistent state.
Branches, commits and merges are the means how people manually resolve conflicts so that a single repository can be used to see a state where revision steps forward in perfect lockstep.
In many branching strategies this consistent state is called "main". There are alternative branching stragies as well. For example the consistent state could be a release branch.
Obviously that does not guarantee ordering across repos, hence the popularity of "monorepo".
> If the system is complex enough and there are always incoming changes
You literally don't understand the definition of eventual consistency. The weakest form of eventual consistency, quiescent consistency, requires [0]:
that in any execution where the updates stop at
some point (i.e. where there are only finitely many updates), there
must exist some state, such that each session converges to that state
(i.e. all but finitely many operations e in each session [f] see that state).
Emphasis on the "updates stop[ping] at some point," or there being only "finitely many updates." By positing that there are always incoming changes you already fail to satisfy the hypothesis of the definition.
In this model all other forms of eventual consistency exhibit at least this property of quiescent consistency (and possibly more).
My point was kind of tongue-in-cheek. Like the other comment suggests, I was talking about how people actually use the term "eventually consistent" for example to refer to system-of-systems of multiple microservices or event sourcing systems. It is possible to define and use the terms more exactly like you suggest. I have no problem with that kind of use. But even if you use the terms more carefully, most people do not, meaning that when you talk about these systems using the misunderstood terms, people may misunderstand you although you are careful.
It's wishful thinking. It's like choosing Newtonian physics over relativity because it's simpler or the equations are neater.
If you have strong consistency, then you have at best availability xor partition tolerance.
"Eventual" consistency is the best tradeoff we have for an AP system.
Computation happens at a time and a place. Your frontend is not the same computer as your backend service, or your database, or your cloud providers, or your partners.
So you can insist on full-ACID on your DB (which it probably isn't running btw - search "READ COMMITTED".) but your DB will only be consistent with itself.
We always talk about multiple bank accounts in these consistency modelling exercises. Do yourself a favour and start thinking about multiple banks.
There is no reason a database can’t be both strongly consistent (linearizable, or equivalent) and available to clients on the majority side of a partition. This is, by far, the common case of real-world partitions in deployments with 3 data centers. One is disconnected or fails. The other two can continue, offering both strong consistency and availability to clients on their side of the partition.
The Gilbert and Lynch definition of CAP calls this state ‘unavailable’, in that it’s not available to all clients. Practically, though, it’s still available for two thirds of clients (or more, if we can reroute clients from the outside), which seems meaningfully ‘available’ to me!
If you don’t believe me, check out Phil Bernstein’s paper (Bernstein and Das) about this. Or read the Gilbert and Lynch proof carefully.
> The Gilbert and Lynch definition of CAP calls this state ‘unavailable’, in that it’s not available to all clients. Practically, though, it’s still available for two thirds of clients (or more, if we can reroute clients from the outside), which seems meaningfully ‘available’ to me!
That's great for those two thirds but not for the other one third. (Indeed you will notice that it's "available" precisely to the clients that are not "partitioned").
It helps in the case where clients are (a) able to contact a minority partition, and (b) can tolerate eventual consistency, and (c) can’t contact the majority partition. These cases are quite rare in modern internet-connected applications.
Consider a 3AZ cloud deployment with remote clients on the internet, and one AZ partitioned off. Most often, clients from the outside will either be able to contact the remaining majority (the two healthy AZs), or will be able to contact nobody. Rarely, clients from the outside will have a path into the minority partition but not the majority partition, but I don’t think I’ve seen that happen in nearly two decades of watching systems like this.
What about internal clients in the partitioned off DC? Yes, the trade-off is that they won’t be able to make isolated progress. If they’re web servers or whatever, that’s moot because they’re partitioned off and there’s no work to do. Same if they’re a training cluster, or other highly connected workloads. There are workloads that can tolerate a ton of asynchrony where being able to continue while disconnected is interesting, but they’re the exception rather than the rule.
Weak consistency is much more interesting as a mechanism for reducing latency (as DynamoDB does, for example) or increasing scalability (as the typical RDBMS ‘read replicas’ pattern does).
> Rarely, clients from the outside will have a path into the minority partition but not the majority partition, but I don’t think I’ve seen that happen in nearly two decades of watching systems like this.
It happens any time you have a real partition, where e.g. one country or one office is cut off from the rest of the network. You're assuming that all of your system's use is external users from the internet, and you don't care about losing a small region when it's isolated from the internet, but most software systems are internal and if you're a company with 3 locations then being able to continue to work when one is cut off from the other 2 is pretty valuable.
> There are workloads that can tolerate a ton of asynchrony where being able to continue while disconnected is interesting, but they’re the exception rather than the rule.
I'd say it's pretty normal if you've got a system that actually does anything rather than just gluing together external stuff. Although sadly that may be the minority these days.
There is code running on a server. When poll that server, that code will choose (in the best case) between given you whatever answer it has Available, or it will check with its peers (on a different continent) to make sure it serves you up-to-date (Consistent) information.
CAP problems being a "rare occurrence" isn't a thing. The running code is either executing AP code or CP code.
They don't address it so much as assume it away. Of course if all your load balancers and end clients can still talk to both sides of your partition then you don't have a problem - that's because you don't actually have a partition in that case.
Having worked as the lead architect for bank core payment systems, multiple bank scenario is a special case that is way too complex for the purpose of these discussions.
It is a multi-layered processes that ultimately makes it very probable that the state of a payment transaction is consistent between banks, involving reconciliation processes, manual handling of failed transactions over extended time period if the reconciliation fails, settlement accounts for each of the involved banks and sometimes even central banks for instant payments.
But I can imagine scenarios when even those can fail to make the transaction state globally consistent. For example a catastrophic event that destroys the bank's systems and a small bank has failed to take off-site backups, and one payment has some hic-up so that the receiving bank cannot know what happened with the transaction. So they would probably assume something.
True serializability doesn't model the real world. IRL humans observe something then make decisions and take action, without holding "locks" on the thing they observed. Everything from the stock market to the sitcom industry depend on this behavior.
Other models exist and are more popular than serializability, e.g. for practicality, PostgreSQL uses MVCC and read consistency, not serializability.
Eventual consistency arises from necessity -- a need to prioritise AP more. Not every application needs strong consistency as a primary constraint. Why would you optimise for that, at the cost of availability, when eventual consistency is an acceptable default?
Practically, the difference in availability for typical internet connected application is very small. Partitions do happen, but in most cases its possible to route user traffic around them, given the paths that traffic tends to take into large-scale data center clusters (redundant, typically not the same paths as the cross-DC traffic). The remaining cases do exist, but are exceedingly rare in practice.
Note that I’m not saying that partitions don’t happen. They do! But in typical internet connected applications the cases where a significant proportion of clients is partitioned into the same partition as a minority of the database (i.e. the cases where AP actually improves availability) are very rare in practice.
For client devices and IoT, partitions off from the main internet are rare, and there local copies of data are a necessity.
Because the incidence and cost of mistaken under-consistency are both generally higher than those of mistaken over-consistency—especially at the scale where people would need to rely on managed off-the-shelf services like aurora instead of being able to build their own.
I would be hesitant to generalise that. There is an inherent tension with its impact on the larger availability of your system. We can't analyse the effect in isolation.
Most systems can tolerate downtime but not data incorrectness. Also, eventual consistency is a bit of a misnomer because it implies that the only cost you’re paying is staleness. In reality these systems are “never consistent” because you often give up guarantees like full serializability making you susceptible to outright data corruption.
It might arise from necessity, but what I see in practice that even senior developers deprioritize consistency on platforms and backends apparently just because the scalability and performance is so fashionable.
That pushes the hard problem of maintaining a consistent experience for the end users to the frontend. Frontend developers are often less experienced.
So in practice you end up with flaky applications, and frontend and backend developers blaming each other.
Most systems do not need "webscale". I would challenge the idea that "eventual consistency" is an acceptable default.
This isn't a reason to have strong consistency and pay the costs, it's a reason to not do read-modify-write. Indeed I'd argue it's actually a reason to prefer eventually consistent systems, as they will nudge you away from adopting this misguided architecture before your system becomes too big to migrate.
Adopt an event streaming/sourcing architecture and all these problems go away, and you are forced to have a sensible dataflow rather than the deadlocky nonsense that strongly-consistent systems nudge you towards.
(Op here) No deadlocks needed! There’s nothing about providing strong consistency (or even strong isolation) that requires deadlocks to be a thing. DSQL, for example, doesn’t have them*.
Event sourcing architectures can be great, but they also tend to be fairly complex (a lot of moving parts). The bigger practical problem is that they make it quite hard to offer clients ‘outside the architecture’ meaningful read-time guarantees stronger than a consistent prefix. That makes clients’ lives hard for the reasons I argue in the blog post.
I really like event-based architectures for things like observability, metering, reporting, and so on where clients can be very tolerant to seeing bounded stale data. For control planes, website backends, etc, I think strongly consistent DB architectures tend to be both simpler and offer a better customer experience.
* Ok, there’s one edge case in the cross-shard commit protocol where two committers can deadlock, which needs to be resolved by aborting one of them (the moral equivalent of WAIT-DIE). This never happens with single-shard transactions, and can’t be triggered by any SQL patterns.
> There’s nothing about providing strong consistency (or even strong isolation) that requires deadlocks to be a thing. DSQL, for example, doesn’t have them*.
If you want to have the kind of consistency people expect (transactional) in this kind of environment, they're unavoidable, right? I see you have optimistic concurrency control, which, sure, but that then means read-modify-write won't work the way people expect (the initial read may be a phantom read if their transaction gets retried), and fundamentally there's no good option here, only different kinds of bad option.
> Event sourcing architectures can be great, but they also tend to be fairly complex (a lot of moving parts).
Disagree. I would say event sourcing architectures are a lot simpler than consistent architectures; indeed most consistent systems are built on top of something that looks rather like an event based architecture underneath (e.g. that's presumably how your optimistic concurrency control works).
> The bigger practical problem is that they make it quite hard to offer clients ‘outside the architecture’ meaningful read-time guarantees stronger than a consistent prefix. That makes clients’ lives hard for the reasons I argue in the blog post.
You can give them a consistent snapshot quite easily. What you can't give them is the ability to do in-place modification while maintaining consistency.
It makes clients' lives hard if they want to do the wrong thing. But the solution to that is to not let them do that thing! Yes, read-modify-write won't work well in an event-based architecture, but read-modify-write never works well in any architecture. You can paper over the cracks at an ever-increasing cost in performance and complexity and suppress most of the edge cases (but you'll never get rid of them entirely), or you can commit to not doing it at the design stage.
I've never had the luxury of multiple db but in general race conditions..
Could you put an almost empty db in front that only records recent changes? Deletes become rows, updates require posting all values of the row. If no record is found forward the query to the read db. If modifications are posted forward the query to the write db.
If correctness is merely nice to have I always use a "Pain" value that influences the sleep duration. It rarely gets very busy instantaneously, activity usually changes gradually.
Yes, you can do stuff like that. You might enjoy the CRAQ paper by Terrace et al, which does something similar to what you are saying (in a very different setting, chain replication rather than DBs).
> read-modify-write is the canonical transactional workload. That applies to explicit transactions (anything that does an UPDATE or SELECT followed by a write in a transaction), but also things that do implicit transactions (like the example above)
Your "implicit transaction" would not be consistent even if there was no replication involved at all. Explicit db transactions exist for a reason - use them.
The point is that, in a disaggregated system, the transaction processor has less flexibility about how to route parts of the same transaction (that section is a point about internal implementation details of transaction systems).
it sometimes can be just an architectural issue...
You can use the critical query against the RW instance, the first point.
The other point is that most of the time, specially concerning to web where the amount of concurrent access may be critical, the data doesn't need to be time-critical.
With the advent of reactive in apps and web things became overcomplex.
Yes, strong consistency will always be an issue. And mitigation should start in the architecture. More often than not, the problem arise from architectural overcomplication. Each case is a case.
in the read after write scenario, why not use something like consistency tokens ? and redirect to primary if the secondary detects it has not caught up ?
I spoke about this exact thing at a conference (HPTS’19) a while back. This can work, but introduces modal behaviors into systems that make reasoning about availability very difficult and tends to cause meta stable behaviors and long outages.
The feedback loop is replicas slow -> traffic increases to primary -> primary slows -> replicas slow, etc. The only way out of this loop is to shed traffic.
The argument seems to rely on the point that the replicas are only valuable if you can send reads to them, which I don't think is true. Eventually-consistent replicated databases are valuable on their own terms even if you can only send traffic to the leader.
Can you spell this out for a newb like myself? It looks like you’re saying that a read replica that can’t be read from is still useful, but that doesn’t sound right. What is its usefulness if I can’t read from it?
I assume he is referring to master's holding all the primary data, and you threat replica's as backups. So if a master goes down, a replica becomes the master that handles all the data. You never get inconstancies between write and read data, as its all in the same node.
Issue is that you give up 1x, 2x ... extra read capability, depending on your RF=x. But you gain strong consistency without the overhead of dealing with eventual consistency or full consistency over multiple nodes (with stuff like raft).
1.
If you use something like Raft to ensure that you get full consistency, your writes slow down as its multiple trips of network communication per write (insert/update/delete). That takes 2 network operations (send data, receive acknowledgement, send confirm to commit, receive confirm it committed).
2.
if you use eventual consistency you have fast writes, and only 1 hop ((send data, receive acknowledgement). But like the article talked about, if you write something and at also try to read that data... It can happen that the server writing is slower, then the next part in your code that is reading. So you write X = X + 1, it goes to server A, but your connection now ready from server B = 1 because the change has not yet been committed on Server B.
3.
the above mentioned comment simply assume there is one master over your data, its server A. You write AND read from server A all the time. Any changes to the data are send to Server B, but only as a hot standby. Advantage is that writes are fast, and very consistent. Disadvantage is that you can not scale your reads beyond server A.
----
Now imagine that your data is cut int smaller pieces, or you do not write sequentially like 1 2 3 ... but write with hash distribution, where a goes to server 2, b goes to server 3, f goes to server 2 ...
It really depends on your workload what is better in the combination of what server setup your running, single node, sharded, replicated sharded, distributed seq, ...
Blogs like this make me go on the same rant for the n-th time:
Consistency for distributed systems is impossible without APIs returning cookies containing vector clocks.
The idea is simple: every database has a logical sequence number (LSN), which the replicas try to catch up to -- but may be a little bit behind. Every time an API talks to a set of databases (or their replicas) to produce a JSON response (or whatever), it ought to return the LSNs of each database that produced the query in a cookie. Something like "db1:591284;db2:10697438".
Client software must then union this with their existing cookie, and return the result of that to the next API call.
That way if they've just inserted some value into db1 and the read-after-write query ends up going to a read replica that's slightly behind the write master (LSN 591280 instead of 591284) then the replica can either wait until it sees LSN >= 591284, or it can proxy the query back to the write master. A simple "expected latency of waiting vs proxying" heuristic can be used for this decision.
That's (almost entirely) all you need for read-after-write transactional consistency at every layer, even through Redis caches and stateless APIs layers!
(OP here). I don’t love leaking this kind of thing through the API. I think that, for most client/server shaped systems at least, we can offer guarantees like linearizability to all clients with few hard real-world trade-offs. That does require a very careful approach to designing the database, and especially to read scale-out (as you say) but it’s real and doable.
By pushing things like read-scale-out into the core database, and away from replicas and caches, we get to have stronger client and application guarantees with less architectural complexity. A great combination.
For the love of all that’s holy, please stop doing read-after-write. In nearly all cases, it isn’t needed. The only cases I can think of are if you need a DB-generated value (so, DATETIME or UUIDv1) from MySQL, or you did a multi-row INSERT in a concurrent environment.
For MySQL, you can get the first auto-incrementing integer created from your INSERT from the cursor. If you only inserted one row, congratulations, there’s your PK. If you inserted multiple rows, you could also get the number of rows inserted and add that to get the range, but there’s no guarantee that it wasn’t interleaved with other statements. Anything else you wrote, you should already have, because you wrote it.
For MariaDB, SQLite, and Postgres, you can just use the RETURNING clause and get back the entire row with your INSERT, or specific columns.
But that could be applied only in context of a single function. What if I save a resource and then mash F5 in the browser to see what was saved? I could hit a read replica that wasn't fast enough and the consistency promise breaks. I don't know how to solve it.
The comment at [1] hints at a solution: in the response of the write request return the id of the transaction or its commit position in the TX log (LSN). When routing a subsequent read request to a replica, the system can either wait until the transaction is present on the replica, or it redirect to to the primary node. Discussed a similar solution a while ago in a talk [2], in the context of serving denormalized data views from a cache.
Yep. Your SQL transactions are only consistent to the extent that they stay in the db.
Mashing F5 is a perfect example of stepping outside the bounds of consistency.
If want to update a counter, do you read the number on your frontend, add 2 then send it back to the backend? If someone else does the same, that's a lost write regardless of how "strongly consistent" your db vendor promises to be.
But that's how the article says programmers work. Read, update, write.
If you thought "that's dumb, just send in (+2)", congrats, that's EC thinking!
So why isn't the section that needs consistency enclosed in a transaction, with all operations between BEGIN TRANSACTION and COMMIT TRANSACTION? That's the standard way to get strong consistency in SQL. It's fully supported in MySQL, at least for InnoDB. You have to talk to the master, not a read slave, when updating, but that's normal.
The point of that section, which maybe isn’t obvious enough, is to reflect on how eventually-consistent read replicas limit the options of the database system builder (rather than the application builder). If I’m building the transaction layer of a database, I want to have a bunch of options for where to send me reads, so I don’t have the send the whole read part of every RMW workloads to the single leader.
I don't understand this article and It's like the author doesn't really know what they're talking about. They don't want eventual consistency, they want read-your-writes, a consistency level that's stronger than EC yet still not strong.
Read-your-writes is indeed useful because it makes code easier to write: every process can behave as if it was the only one in the world, devs can write synchronous code, that's great ! But you don't need strong consistency.
I hope developers learn a little bit more about the domain before going to strong consistency.
I am not an expert, but from the examples in the article I think the author is looking for a bit more than read-your-writes.
E.g. They mention reading a list of attachements and want to ensure they get all currently created attachements, which includes the ones created by other processes.
So they want to have "read-all-writes" or something like that.
Read-your-writes is a client guarantee, that requires stickiness (i.e. a definition of “your”) to be meaningful. It’s not a level of consistency I love, because it raises all kinds of edge-case questions. For example, if I have to reconnect, am I still the same “your”? This isn’t even the some rare edge case! If I’m automating around a CLI, for example, how is the server meant to know that the next CLI invocation from the same script (a different process) is the same “your”? Sure, I can fix that with some kind of token, but then I’ve made the API more complicated.
Linearizability, as a global guarantee, is much nicer because it avoids all those edge cases.
I continue to be surprised that in these discussions correctness is treated as some optional highest possible level of quality, not the only reasonable state.
Suppose we're talking about multiplayer game networking, where the central store receives torrents of UDP packets and it is assumed that like half of them will never arrive. It doesn't make sense to view this as "we don't care about the player's actual position". We do. The system just has tolerances for how often the updates must be communicated successfully. Lost packets do not make the system incorrect.
A soft-realtime multiplayer game is always incorrect(unless no one is moving).
There are various decisions the netcode can make about how to reconcile with this incorrectness, and different games make different tradeoffs.
For example in hitscan FPS games, when two players fatally shoot one another at the same time, some games will only process the first packet received, and award the kill to that player, while other games will allow kill trading within some time window.
A tolerance is just an amount of incorrectness that the designer of the system can accept.
When it comes to CRUD apps using read-replicas, so long as the designer of the system is aware of and accepts the consistency errors that will sometimes occur, does that make that system correct?
If you’re live streaming video, you can make sure every frame is a P-frame which brings your bandwidth costs to a minimum, but then a lost packet completely permanently disables the stream. Or you periodically refresh the stream with I-frames sent over a reliable channel so that lost packets corrupt the video going forward only momentarily.
Sure, if performance characteristics were the same, people would go for strong consistency. The reason many different consistency models are defined is that there’s different tradeoffs that are preferable to a given problem domain with specific business requirements.
You've got the frame types backwards, which is probably contributing to the disagreement you're seeing.
If the video is streaming, people don't really care if a few frames drop, hell, most won't notice.
It's only when several frames in a row are dropped that people start to notice, and even then they rarely care as long as the message within the video has enough data points for them to make an (educated) guess.
P/B frames (which is usually most of them) reference other frames to compress motion effectively. So losing a packet doesn't mean a dropped frame, it means corruption that lasts until the next I-frame/slice. This can be seconds. If you've ever seen corrupt video that seems to "smear" wrong colors, etc. across the screen for a bunch of frames, that's what we're talking about here.
Again - the viewer rarely cares when that happens
Minor annoyance, maybe, rage quit the application? Not a chance.
If you’re never sending an I-frame then it’s permanently corrupt. Sending an I-frame is the equivalent of eventual consistency.
Your users must be very different from the ones I'm familiar with.
The bar is at not rage quitting the application? A good experience is not even thought about?
If the area affected literally doesn't change for minutes afterwards it will not get refreshed and fixed.
Okay but now you're explaining that correctness is not necessarily the only reasonable state. It's possible to sacrifice some degree of correctness for enormous gains in performance because having absolute correctness comes at a cost that might simply not be worth it.
Back in the day there were some P2P RTS games that just sent duplicates. Like each UDP packet would have a new game state and then 1 or more repetitions of previous ones. For lockstep P2P engines, the state that needs to be transferred tends towards just being the client's input, so it's tiny, just a handful of bytes. Makes more sense to just duplicate ahead of time vs ack/nack and resend.
I think we should stop calling these systems eventually consistent. They are actually never consistent. If the system is complex enough and there are always incoming changes, there is never a point in time in these "eventually consistent systems" that they are in consistent state. The problem of inconsistency is pushed to the users of the data.
Someone else stated this implicitly, but with your reasoning no complex system is ever consistent with ongoing changes. From the perspective of one of many concurrent writers outside of the database there’s no consistency they observe. Within the database there could be pending writes in flight that haven’t been persisted yet.
That’s why these consistency models are defined from the perspective of “if you did no more writes after write X, what happens”.
They are consistent (the C in ACID) for a particular transaction ID / timestamp. You are operating on a consistent snapshot. You can also view consistent states across time if you are archiving log.
"... with your reasoning no complex system is ever consistent with ongoing changes. From the perspective of one of many concurrent writers outside of the database there’s no consistency they observe."
That was kind of my point. We should stop callings such systems consistent.
It is possible, however, to build a complex system, even with "event sourcing", that has consistency guarantees.
Of course your comment has the key term "outside of the database". You will need to either use a database or built a homegrown system that has similar features as databases do.
One way is to pipe everything through a database that enforces the consistency. I have actually built such an event sourcing platform.
Second way is to have a reconciliation process that guarantees consistency at certain point of time. For example, bank payments systems use reconciliation to achieve end-of-day consistency. Even those are not really "guaranteed" to be consistent, just that inconsistencies are sufficiently improbable, so that they can be handled manually and with agreed on timeouts.
The way you're defining "eventually consistent" seems to imply it means "the current state of the system is eventually consistent," which is not what I think that means. Rather, it means "for any given previous state of the system, the current state will eventually reflect that."
"Eventually consistent," as I understand it, always implies a lag, whereas the way you're using it seems to imply that at some point there is no lag.
I was not trying to "define" eventually consistent, but to point out that people typically use the term quite loosely, for example when referring to the state of the system-of-systems of multiple microservices or event sourcing.
Those are never guaranteed to be in consistent state in the sense of C in ACID, which means it becomes the responsibility of the systems that use the data to handle the consistency. I see this often ignored, causing user interfaces to be flaky.
Inconsistent sounds so bad :)
I know. That is why it is useful way to think about it, because it both is true and makes you think.
They eventually become consistent from the frame of a single write. They would become consistent if you stopped writes, so they will eventually get there
Both of your statements are true.
But in practice we are rarely interested in single writes when we talk about consistency, but the consistency of multiple writes ("transactions") to multiple systems such as microservices.
It is difficult to guarantee consistency by stopping writes, because whatever enforces the stopping typically does not know at what point all the writes that belong together have been made.
If you "stop the writes" for sufficiently long, the probability of inconsistencies becomes low, but it is still not guaranteed to be non-existant.
For instance in bank payment systems, end-of-day consistency is handled by a secondary process called "reconciliation" which makes the end-of-day conflicts so improbable that any conflict is handled by a manual tertiary process. And then there are agreed timeouts for multi-bank transactions etc. so that the payments ultimately end up in consistent state.
That’s a fair point. To be fair to the academic definitions, “eventually consistent” is a quiescent state in most definitions, and there are more specific ones (like “bounded staleness”, or “monotonic prefix”) that are meaningful to clients of the system.
But I agree with you in general - the dynamic nature of systems means, in my mind, that you need to use client-side guarantees, rather than state guarantees, to reason about this stuff in general. State guarantees are nicer to prove and work with formally (see Adya, for example) while client side guarantees are trickier and feel less fulfilling formally (see Crooks et al “Seeing is Believing”, or Herlihy and Wing).
I have no beef with the academic, careful definitions, although I dislike the practice where academics redefine colloquial terms more formally. That actually causes more, not less confusion. I was talking about the colloquial use of the term.
If I search for "eventual consistency", the AI tells me that one of the cons for using eventual consistency is: "Temporary inconsistencies: Clients may read stale or out-of-date data until synchronization is complete."
I see time and time again in actual companies that have "modern" business systems based on microservices that developers can state the same idea but have never actually paused to think that you something is needed to do the "synchronization". Then they build web UIs that just ignore the fact, causing application to become flaky.
Changing terminology is hard once a name sticks. But yeah, "eventual propagation" is probably more accurate. I do get the impression that "eventual consistency" often just means "does not have a well-defined consistency model".
Yes, I agree. I don't really believe we can change the terminology. But maybe we can get some people to at least think about the consistency model when using the term.
> If the system is complex enough and there are always incoming changes, there is never a point in time in these "eventually consistent systems" that they are in consistent state.
Sure, but that fact is almost never relevant. (And most systems do have periods where there aren't incoming changes, even if it's only when there's a big Internet outage). What would be the benefit in using that as a name?
It is highly relevant in many contexts. I see in my work all the time that developers building frontends on top of microservices believe that because the system is called "eventually consistent", they can ignore consistency issues in refences between objects, causing flaky apps.
> I see in my work all the time that developers building frontends on top of microservices believe that because the system is called "eventually consistent", they can ignore consistency issues in refences between objects, causing flaky apps.
If someone misunderstands that badly, why would one believe they would do any better if it was called "never consistent"?
That is a fair point, although I believe that misleading name is contributing to the confusion.
> They are actually never consistent
I don't see it this way. Let's take a simple example - banks. Your employer sends you the salary from another bank. The transfer is (I'd say) eventually consistent - at some point, you WILL get the money. So how it can be "never consistent"?
Your example is too simple to show the problem with the "eventual consistency" as people use the term in real life.
Let's say you have two systems, one containing customers (A) and other containing contracts for the customers (B).
Now you create a new contract by first creating the customer in system A and then the contract on system B.
It may happen that web UI shows the contract in system B, which refers to the customer by id (in system A), but that customer becomes visible slightly after in system A.
The web UI has to either be built to manage the situation where fetching customer by id may temporarily fail -- or accept the risk that such cases are rare and you just throw an error.
If a system would be actually "eventually consistent" in the sense you use the term, it would be possible for the web UI to get guarantee from the system-of-systems to fetch objects in a way that they would see either both the contract and the customer info or none.
Because I will have spent it before it becomes available :)
For the record (IMO) banks are an EXCELLENT example of eventually consistent systems.
They're also EXCELLENT for demonstrating Event Sourcing (Bank statements, which are really projections of the banks internal Event log, but enough people have encountered them in such a way that that most people understand them)
I have worked with core systems in several financial institutions, as well as built several event sourcing production systems used as the core platform. One of these event sourcing systems was actually providing real consistency guarantees.
Based on my experience, I would recommend against using bank systems as an example of event sourcing, because they are actually much more complex than what people typically mean when they talk about event sourcing systems.
Bank systems cannot use normal event sourcing exactly because of the problem I describe. They have various other processes to have sufficiently probable consistency (needed by the bank statements for example), such as "reconciliation".
Even those do not actually "guarantee" anything, but you need tertiary manual process to fix any inconsistencies (on some days after the transaction). They also have timeouts agreed between banks to eventually resolve any inconsistencies related to cross-bank payments over several weeks.
In practice this means that the bank statements for source account and target account may actually be inconsistent with each other, although these are so rare that most people never encounter them.
If the bank transaction is eventually consistent, it means that the state can flip and the person receiving will "never" be sure. A state that the transaction will be finished later is a consistent state.
Just like Git. Why bother with all these branches, commits and merges?
Just make it so everyone's revision steps forward in perfect lockstep.
Branches, commits and merges are the means how people manually resolve conflicts so that a single repository can be used to see a state where revision steps forward in perfect lockstep.
In many branching strategies this consistent state is called "main". There are alternative branching stragies as well. For example the consistent state could be a release branch.
Obviously that does not guarantee ordering across repos, hence the popularity of "monorepo".
Different situations require different solutions.
> If the system is complex enough and there are always incoming changes
You literally don't understand the definition of eventual consistency. The weakest form of eventual consistency, quiescent consistency, requires [0]:
Emphasis on the "updates stop[ping] at some point," or there being only "finitely many updates." By positing that there are always incoming changes you already fail to satisfy the hypothesis of the definition.In this model all other forms of eventual consistency exhibit at least this property of quiescent consistency (and possibly more).
[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...
My point was kind of tongue-in-cheek. Like the other comment suggests, I was talking about how people actually use the term "eventually consistent" for example to refer to system-of-systems of multiple microservices or event sourcing systems. It is possible to define and use the terms more exactly like you suggest. I have no problem with that kind of use. But even if you use the terms more carefully, most people do not, meaning that when you talk about these systems using the misunderstood terms, people may misunderstand you although you are careful.
The GP proposed that the definition should be changed. That in no way implies a lack of understanding of the present definition.
It's wishful thinking. It's like choosing Newtonian physics over relativity because it's simpler or the equations are neater.
If you have strong consistency, then you have at best availability xor partition tolerance.
"Eventual" consistency is the best tradeoff we have for an AP system.
Computation happens at a time and a place. Your frontend is not the same computer as your backend service, or your database, or your cloud providers, or your partners.
So you can insist on full-ACID on your DB (which it probably isn't running btw - search "READ COMMITTED".) but your DB will only be consistent with itself.
We always talk about multiple bank accounts in these consistency modelling exercises. Do yourself a favour and start thinking about multiple banks.
There is no reason a database can’t be both strongly consistent (linearizable, or equivalent) and available to clients on the majority side of a partition. This is, by far, the common case of real-world partitions in deployments with 3 data centers. One is disconnected or fails. The other two can continue, offering both strong consistency and availability to clients on their side of the partition.
The Gilbert and Lynch definition of CAP calls this state ‘unavailable’, in that it’s not available to all clients. Practically, though, it’s still available for two thirds of clients (or more, if we can reroute clients from the outside), which seems meaningfully ‘available’ to me!
If you don’t believe me, check out Phil Bernstein’s paper (Bernstein and Das) about this. Or read the Gilbert and Lynch proof carefully.
> The Gilbert and Lynch definition of CAP calls this state ‘unavailable’, in that it’s not available to all clients. Practically, though, it’s still available for two thirds of clients (or more, if we can reroute clients from the outside), which seems meaningfully ‘available’ to me!
That's great for those two thirds but not for the other one third. (Indeed you will notice that it's "available" precisely to the clients that are not "partitioned").
When does AP help?
It helps in the case where clients are (a) able to contact a minority partition, and (b) can tolerate eventual consistency, and (c) can’t contact the majority partition. These cases are quite rare in modern internet-connected applications.
Consider a 3AZ cloud deployment with remote clients on the internet, and one AZ partitioned off. Most often, clients from the outside will either be able to contact the remaining majority (the two healthy AZs), or will be able to contact nobody. Rarely, clients from the outside will have a path into the minority partition but not the majority partition, but I don’t think I’ve seen that happen in nearly two decades of watching systems like this.
What about internal clients in the partitioned off DC? Yes, the trade-off is that they won’t be able to make isolated progress. If they’re web servers or whatever, that’s moot because they’re partitioned off and there’s no work to do. Same if they’re a training cluster, or other highly connected workloads. There are workloads that can tolerate a ton of asynchrony where being able to continue while disconnected is interesting, but they’re the exception rather than the rule.
Weak consistency is much more interesting as a mechanism for reducing latency (as DynamoDB does, for example) or increasing scalability (as the typical RDBMS ‘read replicas’ pattern does).
> Rarely, clients from the outside will have a path into the minority partition but not the majority partition, but I don’t think I’ve seen that happen in nearly two decades of watching systems like this.
It happens any time you have a real partition, where e.g. one country or one office is cut off from the rest of the network. You're assuming that all of your system's use is external users from the internet, and you don't care about losing a small region when it's isolated from the internet, but most software systems are internal and if you're a company with 3 locations then being able to continue to work when one is cut off from the other 2 is pretty valuable.
> There are workloads that can tolerate a ton of asynchrony where being able to continue while disconnected is interesting, but they’re the exception rather than the rule.
I'd say it's pretty normal if you've got a system that actually does anything rather than just gluing together external stuff. Although sadly that may be the minority these days.
Latency = Partition.
There is code running on a server. When poll that server, that code will choose (in the best case) between given you whatever answer it has Available, or it will check with its peers (on a different continent) to make sure it serves you up-to-date (Consistent) information.
CAP problems being a "rare occurrence" isn't a thing. The running code is either executing AP code or CP code.
The author addresses that in a linked post: https://brooker.co.za/blog/2024/07/25/cap-again.html
They don't address it so much as assume it away. Of course if all your load balancers and end clients can still talk to both sides of your partition then you don't have a problem - that's because you don't actually have a partition in that case.
The author’s position seems to be that “actual partition” in this sense is a very unusual case for most cloud applications.
Having worked as the lead architect for bank core payment systems, multiple bank scenario is a special case that is way too complex for the purpose of these discussions.
It is a multi-layered processes that ultimately makes it very probable that the state of a payment transaction is consistent between banks, involving reconciliation processes, manual handling of failed transactions over extended time period if the reconciliation fails, settlement accounts for each of the involved banks and sometimes even central banks for instant payments.
But I can imagine scenarios when even those can fail to make the transaction state globally consistent. For example a catastrophic event that destroys the bank's systems and a small bank has failed to take off-site backups, and one payment has some hic-up so that the receiving bank cannot know what happened with the transaction. So they would probably assume something.
True serializability doesn't model the real world. IRL humans observe something then make decisions and take action, without holding "locks" on the thing they observed. Everything from the stock market to the sitcom industry depend on this behavior.
Other models exist and are more popular than serializability, e.g. for practicality, PostgreSQL uses MVCC and read consistency, not serializability.
Eventual consistency arises from necessity -- a need to prioritise AP more. Not every application needs strong consistency as a primary constraint. Why would you optimise for that, at the cost of availability, when eventual consistency is an acceptable default?
Practically, the difference in availability for typical internet connected application is very small. Partitions do happen, but in most cases its possible to route user traffic around them, given the paths that traffic tends to take into large-scale data center clusters (redundant, typically not the same paths as the cross-DC traffic). The remaining cases do exist, but are exceedingly rare in practice.
Note that I’m not saying that partitions don’t happen. They do! But in typical internet connected applications the cases where a significant proportion of clients is partitioned into the same partition as a minority of the database (i.e. the cases where AP actually improves availability) are very rare in practice.
For client devices and IoT, partitions off from the main internet are rare, and there local copies of data are a necessity.
Because the incidence and cost of mistaken under-consistency are both generally higher than those of mistaken over-consistency—especially at the scale where people would need to rely on managed off-the-shelf services like aurora instead of being able to build their own.
I would be hesitant to generalise that. There is an inherent tension with its impact on the larger availability of your system. We can't analyse the effect in isolation.
Most systems can tolerate downtime but not data incorrectness. Also, eventual consistency is a bit of a misnomer because it implies that the only cost you’re paying is staleness. In reality these systems are “never consistent” because you often give up guarantees like full serializability making you susceptible to outright data corruption.
It might arise from necessity, but what I see in practice that even senior developers deprioritize consistency on platforms and backends apparently just because the scalability and performance is so fashionable.
That pushes the hard problem of maintaining a consistent experience for the end users to the frontend. Frontend developers are often less experienced.
So in practice you end up with flaky applications, and frontend and backend developers blaming each other.
Most systems do not need "webscale". I would challenge the idea that "eventual consistency" is an acceptable default.
This isn't a reason to have strong consistency and pay the costs, it's a reason to not do read-modify-write. Indeed I'd argue it's actually a reason to prefer eventually consistent systems, as they will nudge you away from adopting this misguided architecture before your system becomes too big to migrate.
Adopt an event streaming/sourcing architecture and all these problems go away, and you are forced to have a sensible dataflow rather than the deadlocky nonsense that strongly-consistent systems nudge you towards.
(Op here) No deadlocks needed! There’s nothing about providing strong consistency (or even strong isolation) that requires deadlocks to be a thing. DSQL, for example, doesn’t have them*.
Event sourcing architectures can be great, but they also tend to be fairly complex (a lot of moving parts). The bigger practical problem is that they make it quite hard to offer clients ‘outside the architecture’ meaningful read-time guarantees stronger than a consistent prefix. That makes clients’ lives hard for the reasons I argue in the blog post.
I really like event-based architectures for things like observability, metering, reporting, and so on where clients can be very tolerant to seeing bounded stale data. For control planes, website backends, etc, I think strongly consistent DB architectures tend to be both simpler and offer a better customer experience.
* Ok, there’s one edge case in the cross-shard commit protocol where two committers can deadlock, which needs to be resolved by aborting one of them (the moral equivalent of WAIT-DIE). This never happens with single-shard transactions, and can’t be triggered by any SQL patterns.
> There’s nothing about providing strong consistency (or even strong isolation) that requires deadlocks to be a thing. DSQL, for example, doesn’t have them*.
If you want to have the kind of consistency people expect (transactional) in this kind of environment, they're unavoidable, right? I see you have optimistic concurrency control, which, sure, but that then means read-modify-write won't work the way people expect (the initial read may be a phantom read if their transaction gets retried), and fundamentally there's no good option here, only different kinds of bad option.
> Event sourcing architectures can be great, but they also tend to be fairly complex (a lot of moving parts).
Disagree. I would say event sourcing architectures are a lot simpler than consistent architectures; indeed most consistent systems are built on top of something that looks rather like an event based architecture underneath (e.g. that's presumably how your optimistic concurrency control works).
> The bigger practical problem is that they make it quite hard to offer clients ‘outside the architecture’ meaningful read-time guarantees stronger than a consistent prefix. That makes clients’ lives hard for the reasons I argue in the blog post.
You can give them a consistent snapshot quite easily. What you can't give them is the ability to do in-place modification while maintaining consistency.
It makes clients' lives hard if they want to do the wrong thing. But the solution to that is to not let them do that thing! Yes, read-modify-write won't work well in an event-based architecture, but read-modify-write never works well in any architecture. You can paper over the cracks at an ever-increasing cost in performance and complexity and suppress most of the edge cases (but you'll never get rid of them entirely), or you can commit to not doing it at the design stage.
> You can give them a consistent snapshot quite easily.
How would you do that in a standard event sourcing system where data originates from multiple sources?
I've never had the luxury of multiple db but in general race conditions..
Could you put an almost empty db in front that only records recent changes? Deletes become rows, updates require posting all values of the row. If no record is found forward the query to the read db. If modifications are posted forward the query to the write db.
If correctness is merely nice to have I always use a "Pain" value that influences the sleep duration. It rarely gets very busy instantaneously, activity usually changes gradually.
Yes, you can do stuff like that. You might enjoy the CRAQ paper by Terrace et al, which does something similar to what you are saying (in a very different setting, chain replication rather than DBs).
> read-modify-write is the canonical transactional workload. That applies to explicit transactions (anything that does an UPDATE or SELECT followed by a write in a transaction), but also things that do implicit transactions (like the example above)
Your "implicit transaction" would not be consistent even if there was no replication involved at all. Explicit db transactions exist for a reason - use them.
The point is that, in a disaggregated system, the transaction processor has less flexibility about how to route parts of the same transaction (that section is a point about internal implementation details of transaction systems).
it sometimes can be just an architectural issue...
You can use the critical query against the RW instance, the first point.
The other point is that most of the time, specially concerning to web where the amount of concurrent access may be critical, the data doesn't need to be time-critical.
With the advent of reactive in apps and web things became overcomplex.
Yes, strong consistency will always be an issue. And mitigation should start in the architecture. More often than not, the problem arise from architectural overcomplication. Each case is a case.
I keep wondering how the recent 15h outage have affected these eventually consistent systems.
I really hope to see a paper on the effects of it.
in the read after write scenario, why not use something like consistency tokens ? and redirect to primary if the secondary detects it has not caught up ?
I spoke about this exact thing at a conference (HPTS’19) a while back. This can work, but introduces modal behaviors into systems that make reasoning about availability very difficult and tends to cause meta stable behaviors and long outages.
The feedback loop is replicas slow -> traffic increases to primary -> primary slows -> replicas slow, etc. The only way out of this loop is to shed traffic.
The argument seems to rely on the point that the replicas are only valuable if you can send reads to them, which I don't think is true. Eventually-consistent replicated databases are valuable on their own terms even if you can only send traffic to the leader.
Can you spell this out for a newb like myself? It looks like you’re saying that a read replica that can’t be read from is still useful, but that doesn’t sound right. What is its usefulness if I can’t read from it?
I assume he is referring to master's holding all the primary data, and you threat replica's as backups. So if a master goes down, a replica becomes the master that handles all the data. You never get inconstancies between write and read data, as its all in the same node.
Issue is that you give up 1x, 2x ... extra read capability, depending on your RF=x. But you gain strong consistency without the overhead of dealing with eventual consistency or full consistency over multiple nodes (with stuff like raft).
1.
If you use something like Raft to ensure that you get full consistency, your writes slow down as its multiple trips of network communication per write (insert/update/delete). That takes 2 network operations (send data, receive acknowledgement, send confirm to commit, receive confirm it committed).
2.
if you use eventual consistency you have fast writes, and only 1 hop ((send data, receive acknowledgement). But like the article talked about, if you write something and at also try to read that data... It can happen that the server writing is slower, then the next part in your code that is reading. So you write X = X + 1, it goes to server A, but your connection now ready from server B = 1 because the change has not yet been committed on Server B.
3.
the above mentioned comment simply assume there is one master over your data, its server A. You write AND read from server A all the time. Any changes to the data are send to Server B, but only as a hot standby. Advantage is that writes are fast, and very consistent. Disadvantage is that you can not scale your reads beyond server A.
----
Now imagine that your data is cut int smaller pieces, or you do not write sequentially like 1 2 3 ... but write with hash distribution, where a goes to server 2, b goes to server 3, f goes to server 2 ...
It really depends on your workload what is better in the combination of what server setup your running, single node, sharded, replicated sharded, distributed seq, ...
The DB field is complex ...
Blogs like this make me go on the same rant for the n-th time:
Consistency for distributed systems is impossible without APIs returning cookies containing vector clocks.
The idea is simple: every database has a logical sequence number (LSN), which the replicas try to catch up to -- but may be a little bit behind. Every time an API talks to a set of databases (or their replicas) to produce a JSON response (or whatever), it ought to return the LSNs of each database that produced the query in a cookie. Something like "db1:591284;db2:10697438".
Client software must then union this with their existing cookie, and return the result of that to the next API call.
That way if they've just inserted some value into db1 and the read-after-write query ends up going to a read replica that's slightly behind the write master (LSN 591280 instead of 591284) then the replica can either wait until it sees LSN >= 591284, or it can proxy the query back to the write master. A simple "expected latency of waiting vs proxying" heuristic can be used for this decision.
That's (almost entirely) all you need for read-after-write transactional consistency at every layer, even through Redis caches and stateless APIs layers!
(OP here). I don’t love leaking this kind of thing through the API. I think that, for most client/server shaped systems at least, we can offer guarantees like linearizability to all clients with few hard real-world trade-offs. That does require a very careful approach to designing the database, and especially to read scale-out (as you say) but it’s real and doable.
By pushing things like read-scale-out into the core database, and away from replicas and caches, we get to have stronger client and application guarantees with less architectural complexity. A great combination.
FWIW, I think that’s essentially how Aurora DSQL works, and sort of explained at the end of the article.
For the love of all that’s holy, please stop doing read-after-write. In nearly all cases, it isn’t needed. The only cases I can think of are if you need a DB-generated value (so, DATETIME or UUIDv1) from MySQL, or you did a multi-row INSERT in a concurrent environment.
For MySQL, you can get the first auto-incrementing integer created from your INSERT from the cursor. If you only inserted one row, congratulations, there’s your PK. If you inserted multiple rows, you could also get the number of rows inserted and add that to get the range, but there’s no guarantee that it wasn’t interleaved with other statements. Anything else you wrote, you should already have, because you wrote it.
For MariaDB, SQLite, and Postgres, you can just use the RETURNING clause and get back the entire row with your INSERT, or specific columns.
> please stop doing read-after-write
But that could be applied only in context of a single function. What if I save a resource and then mash F5 in the browser to see what was saved? I could hit a read replica that wasn't fast enough and the consistency promise breaks. I don't know how to solve it.
The comment at [1] hints at a solution: in the response of the write request return the id of the transaction or its commit position in the TX log (LSN). When routing a subsequent read request to a replica, the system can either wait until the transaction is present on the replica, or it redirect to to the primary node. Discussed a similar solution a while ago in a talk [2], in the context of serving denormalized data views from a cache.
[1] https://news.ycombinator.com/item?id=46073630 [2] https://speakerdeck.com/gunnarmorling/keep-your-cache-always...
Yep. Your SQL transactions are only consistent to the extent that they stay in the db.
Mashing F5 is a perfect example of stepping outside the bounds of consistency.
If want to update a counter, do you read the number on your frontend, add 2 then send it back to the backend? If someone else does the same, that's a lost write regardless of how "strongly consistent" your db vendor promises to be.
But that's how the article says programmers work. Read, update, write.
If you thought "that's dumb, just send in (+2)", congrats, that's EC thinking!
Local storage, sticky sessions, consistent hashing cache
I think the point is that read-after-write is exactly the desired property here.
And connection affinity
Assuming that the stickied datastore hasn't experienced an "issue"
So why isn't the section that needs consistency enclosed in a transaction, with all operations between BEGIN TRANSACTION and COMMIT TRANSACTION? That's the standard way to get strong consistency in SQL. It's fully supported in MySQL, at least for InnoDB. You have to talk to the master, not a read slave, when updating, but that's normal.
(OP here).
The point of that section, which maybe isn’t obvious enough, is to reflect on how eventually-consistent read replicas limit the options of the database system builder (rather than the application builder). If I’m building the transaction layer of a database, I want to have a bunch of options for where to send me reads, so I don’t have the send the whole read part of every RMW workloads to the single leader.
I don't understand this article and It's like the author doesn't really know what they're talking about. They don't want eventual consistency, they want read-your-writes, a consistency level that's stronger than EC yet still not strong.
https://jepsen.io/consistency/models/read-your-writes
Read-your-writes is indeed useful because it makes code easier to write: every process can behave as if it was the only one in the world, devs can write synchronous code, that's great ! But you don't need strong consistency.
I hope developers learn a little bit more about the domain before going to strong consistency.
I am not an expert, but from the examples in the article I think the author is looking for a bit more than read-your-writes.
E.g. They mention reading a list of attachements and want to ensure they get all currently created attachements, which includes the ones created by other processes.
So they want to have "read-all-writes" or something like that.
Read-your-writes is a client guarantee, that requires stickiness (i.e. a definition of “your”) to be meaningful. It’s not a level of consistency I love, because it raises all kinds of edge-case questions. For example, if I have to reconnect, am I still the same “your”? This isn’t even the some rare edge case! If I’m automating around a CLI, for example, how is the server meant to know that the next CLI invocation from the same script (a different process) is the same “your”? Sure, I can fix that with some kind of token, but then I’ve made the API more complicated.
Linearizability, as a global guarantee, is much nicer because it avoids all those edge cases.