Dec 9, 2022

The Tragedy of Database Schema

10 mins read

Schema was inadvertently thrown into the battle of SQL vs. NoSQL when it is really just an innocent bystander. Even the person who coined the term NoSQL — Carlo Strozzi when he created a NoSQL relational database in 1988 — commented that the current NoSQL movement should have been called the "NoREL" movement because that movement seems more focused on doing away with relational databases than with SQL itself.

The same is true of Schemas. All data eventually coalesces into some kind of Schema — another term for database structure — even a text document, even if that schema doesn't exist at the start.

Blanketly doing away with all schema tools and concepts is like getting rid of all tools invented before the Modern Age. Goodbye, geometry, because that was first used back in Mesopotamia.

Schema confusion

Like many words in the rapidly developing tech scene, schema has changed in meaning and adopted nuances.

Non-technically speaking, a schema is just the design of a data structure with no reference to language. It could be as simple as a diagram describing the database.

But in the world of RDMSes — relational database management systems — schemas refer specifically to the description of a database's structure using formal language constructs. In this sense, schemas are tightly coupled with the database itself.

Built into the schema language are aspects that enforce that schema and maintain the database's integrity — constraints such as foreign keys, primary keys, and various rules to ensure that the database does not ever deviate from its blueprint.

Schema could also be called a blueprint.

Unfortunately, this starts to get murky when we move into different database providers. In MySQL, a SCHEMA is the same as a DATABASE but this isn't the case for SQL SERVER. Oracle also has its own concept of schema which applies to only a portion of the database.

Clear as mud?

This is probably part of the problem. People reject things they don't understand and corporate competition means that corporations will usurp a term, create their own implementation and then say that their way is the right way.

So let's agree on a simple definition of schema forthis article and skip the technical implementations of it that are as varied as directions of the wind: Schema is a blueprint and it describes the structure of data. This typically means organizing complex data into a set of fields, each with a name and a type (e.g. string, integer, etc.).

The NoSQL movement — burn the schema witches!

The concept of NoSQL as a philosophical movement came about when it was used as the name of a 2009 conference covering open-source distributed databases. This movement had nothing to do with that original 1988 NoSQL database created by Strozzi.

Amazon probably helped give the movement much of its clout, first with its SimpleDB service, and then with DynamoDB. So did Google and Facebook, with their own NoSQL databases — BigTable and Cassandra respectively.

As is typical of all ad-hoc internet movements, NoSQL's principles are varied, driven largely by opinion, "vetted" by people and companies with axes to grind, and "distributed" — many different opinions spread across different locations.

This last attribute is somewhat humorous, seeing as one of the key concepts of NoSQL is supposed to be its easily distributed nature (which, by the way, is entirely possible with SQL databases, too).

The point is that there is no "official" NoSQL standards group and there probably never will be — at least not one that everyone in the field will agree with. So people form their own opinions about what this movement is all about. Some of the concepts that are most often mentioned for this movement are:

  • Do away with Structured Query Language (SQL) to access data.
  • No schema, at least initially. The schema is established by the data itself. (But certainly no schema in the sense described earlier, where it's tightly coupled to the database itself.)
  • NoSQL DBs are supposed to be faster than relational databases (this statement is too oversimplified).
  • It's supposed to be easier for coders to work with NoSQL databases because they don't need to "worry about schema or data types." (This is delusory. The next section on "Data Swamps" covers it more in depth.)
  • NoSQL DBs are supposed to scale more easily (this is another oversimplification).
  • They are supposed to be more agile.
  • They are "eventually consistent" and favor availability over consistency.

Iggy Fernandez, author of Beginning Oracle Database 11g Administration, does a good job of debunking most of the above in this excellent blog post.

Our concern here is schema — the innocent bystander caught in the driveby.

So let's talk about structure. Or, more importantly, the lack of it.

Data swamps — where schemaless data goes very wrong

By 2014, five years after that NoSQL conference, the world had a new problem, and needed a new term for it: Data swamps. It's exactly what it sounds like. A total mess of data. And it happens with structureless data. It also happens when programming is completely decoupled from data. This paradigm simply doesn't work.

The "breakthrough" was that, for data to be useful, it can't just be dumped in a repository but should be handled with proper "data management and governance." Wasn't that what the NoSQL movement had been fighting against? Well, it doesn't work. Data needs governance.

Put another way, coders need to understand something about the data's structure. Eventually.

Impedance Mismatching

Let's introduce another term: Impedance mismatch. And I'll tie this in with Data Swamps and NoSQL in a second.

Microsoft expert Raymond Chen gives the best explanation I've ever read about impedance mismatch in this short blog post. The term comes from electronics and has been appropriated into programming to mean, basically, that two systems don't jibe. In database programming, it refers to the data-mismatching between an underlying datastore and a programming language. For example, relational databases don't support objects. Their data types also don't exactly match those of a programming language. There are cases where the NULL value in databases really has no equivalent in a programming language (for example, a string type in the Go programming language can be empty, but it cannot be null or nil.)

NoSQL supporters like to cite impedance mismatching as a huge relational database problem. Some NoSQL databases let you dump any type of object into the datastore and then retrieve it.

But relational databases let you do this, too. Binary data types let you dump deserialized blobs into a field and then serialize those back into an object. Some SQL databases also support a JSON type which is similar to a TEXT (long string) field but with more JSON-specific tooling. In some cases this goes as far as even being able to search and index individual fields within the JSON.

Data swamps and impedance mismatching

Let's get back to the data swamp. "Oh, NoSQL databases are so awesome because you can just code code code and not give a hoot about the underlying data store!"

Yes, and then you have a data swamp.

NoSQL databases didn't do away with impedance mismatching. They basically told programmers that the underlying data store was a "fairyland with absolutely no schema and you can do what you want as a coder and the fairyland will take care of everything!"

Uhm, no. Fairyland ended up becoming a swamp and then we did have impedance mismatching because no one knew how to retrieve anything from that morass.

So here's the point: You eventually need a schema and coders who are entirely divorced from the structure and nature of the underlying data store will code in a state that is totally disassociated from the underlying reality. Eventually, someone needs to bring the coders and the data people together and work out how they can communicate.

This works entirely with the agile paradigm — teamwork, iterative progress, regular meetings, changing the schema as needed, input from all stakeholders.

Relational Database models have a strong record of changing to suit the times, so why wouldn't they keep doing that when Google and Amazon cited their need for "availability as a priority"?

It's possible that the conflict between "schemaless NoSQL" and "SQL with schema" probably has nothing to do with the NoSQL movement, and might have more to do with the fact that Amazon had a new product and it differed from Microsoft's (SQL Server), so Amazon engaged in a campaign to tell the world that its product was much better.

Conjecture, yes. But likely also very true.

NoSQL brought in a slew of other problems

By burning SQL relational databases at the stake, we opened the door to a slew of problems that these robust and highly tested products had solved in the many decades they had been used. We're not even talking about SQL or database tech. We're just talking about product maturity.

When installing a MySQL, MS SQL, or Oracle Database, nobody worries about whether they'll be prompted to enter basic credentials for these database systems. Data Design 101 teaches us that "Data is valuable, so password-protect it!"

Somehow, MongoDB didn't get this memo. The database was at the heart of a rash of hacks (one estimate puts it at around 100,000) in its early years because its default installation didn't insist on basic credentials. Installers for the database were easy to find, and inexperienced users installed it on their systems with no further configuration.

MongoDB has had a series of other insecure elements, all typical of an immature product. The question here is: Why introduce a new type of product when the old one works just fine and can be upgraded?

CouchDB, another popular NoSQL database, has likewise been no stranger to hacks. Here's the irony: The fact that it isn't SQL-based was the source of this particular flaw.

Another irony is that coders who were familiar with SQL injections went the extra mile to sanitize user input. The opposite has happened in NoSQL coding. Coders think these databases are impervious to SQL injections, don't sanitize user input (or don't do it well enough), and so open the door to "NoSQL Injections."

Out of the frying pan and into the fire. Same problems, different product.

The reality is that data is relational

Whether SQL or NoSQL, data is almost always relational when it involves multiple entities. Representing that data relationally makes the most sense. The sooner this gets done in the development process, the more complexities will be avoided later in the development stages.

Perhaps the tightly coupled schema language that was built into RDMSes went too far, making it difficult to change anything once the schema was "set in stone." But schema itself — the data's blueprint — isn't the one at fault.

Programmers are smart people. If a team of developers and experts could drastically change the structure of a $142-billion blockchain (the Ethereum merge) with zero bugs, we're pretty sure it isn't too difficult to figure out how to make relational SQL databases and their schemas more agile.

Some ideas on how to solve SQL/relational database shortcomings

Every software product has its shortcomings, and the same is true of relational SQL databases (just as it is true of NoSQL databases).  

One complaint from developers has been the inefficiency of SQL queries on big data when it isn't backed by an index. The answer to that is the same answer that prevents Data Swamps — the devs should talk to the data people. Devs who create a data swamp in NoSQL data stores are the same type of devs who write random SQL queries without a thorough grasp of the underlying data. The result is the same: Awful data access, regardless of whether it's SQL or NoSQL.

So the most obvious (and simplest) solution to most relational database shortcomings is to employ agile principles — evaluate, meet, plan, design, test. In essence, just get the teams talking to each other and working together. It's really very simple.

Yes, SQL might be slightly less familiar to devs who prefer the programmatic-style syntax used to access some NoSQL databases. Microsoft tried to solve this with LINQ and did a fairly good job except that LINQ is difficult to debug when queries become inefficient because so much is happening under the hood. Here, again, is the same scene: Completely divorcing devs from data access simply doesn't work. It's too much of a "black box" solution.

The solution is to learn SQL. Coders learn new languages and frameworks every day. Why should learning SQL be such a pain?

If they do have some deep moral disagreement with learning SQL, then just hire an SQL guru and tell the guru what data you need and how you need it, and then wrap the guru's queries in a code class so you, as a coder, don't have to think about it.

Hopefully, SQL databases will evolve to a point where working with schema could be as easy as executing the command CREATE OR MODIFY TABLE ( /* fields */ ). If field data was stored internally in the database in a way that the database engine didn’t have to rewrite the table contents to effect this change, performing schema changes on a live database would become drastically simpler.

Lots of ideas. And these are just off the top of our heads. If you're saying that the developers of DynamoDB couldn't come up with solutions better than these and keep schemas and SQL, then we'll tell you that someone had an ax to grind.

In Conclusion — the wheel has been invented

Don't buy that SQL and schemas are dead. Some of their implementations might be dated but that happens to all solutions in this rapid-paced world.

For example, file systems have been around for ages, and while some early file systems are dated and no longer common (FAT12/FAT16 anyone?), the core concept of a hierarchical data storage for blobs of data as “files” is alive and well. Many modern file systems were born out of solving inefficiencies of earlier file systems while still retaining these core concepts.

Building software with zero schema in mind makes zero sense. The schema will eventually come into play. And the question of whether to choose SQL or NoSQL is moot because, in both cases, schema is necessary.

So let the SQL versus NoSQL battle rage on but pull this innocent bystander out of the fray because its place in both worlds is vital and necessary.