Cap Theorem Part 6: Conclusion

This is the sixth and final part of a multi-part blog post about the CAP Theorem. You can read the first part here. You can download all six parts in one document here.

Conclusion

Curt Monash begins his post on “Transparent relational OLTP scale-out” (http://www.dbms2.com/2011/10/23/transparent-relational-oltp-scale-out/) by saying that

There’s a perception that, if you want (relatively) worry-free database scale-out, you need a non-relational/NoSQL strategy. That perception is false.

As more and more people try NoSQL solutions, we are beginning to see that they are not the universal “cure-all” that they are often made out to be. As we show here in this series of blog posts, the CAP Theorem does not say that in “picking two”, you have to entirely forsake the third.

As Daniel Abadi has pointed out in his PACELC blog post (http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html), a normally running system must make some tradeoffs between latency and consistency and what becomes interesting is how the system reacts to a network partition, does it favor availability or consistency?

For the vast majority of applications, a traditional relational database is just fine, and can provide worry-free scale-out. But, if you are really building a very large (number of nodes) database that is going to be running on an unreliable network, or is running on a widely distributed network, the implications of the CAP Theorem are more profound.

The Domain Name Service (DNS) is an example of such a system; if you think about it, it was a NoSQL database before the term was even coined. Similarly, if you are building a huge distributed search infrastructure (such as Google or Yahoo), the implications of the CAP Theorem are certainly significant. However, if you aren’t operating in that rarefied atmosphere and your infrastructure is running on tens or hundreds of servers operating in a data center or the cloud, the implications of the CAP Theorem are just not that applicable to you!

To understand why this is the case, consider this. In a system with a relatively small number of nodes, or operating on a reliable network, and on reliable infrastructure, one can very readily reduce Tp and thereby provide a very low floor for Ta and Tc. However, if you are operating a very large number of nodes, or operating a highly distributed system, or one that has a high likelihood of partitioning for some other reason, you are likely to have a higher value for Tp, and you may in fact have to contend with that reality. Should that be the case, such as is the case with the DNS system, or the distributed search infrastructures at Google or Yahoo, you are forced into a situation where Ta + Tc may in fact be pretty large.

And in situations like that, the traditional database (which is fundamentally a CA system) does face some serious challenges.

The relationship between Ta, Tc and Tp help you determine how your system will behave, and make meaningful tradeoffs between availability and consistency based on the infrastructure and environment within which the distributed database is functioning.

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.