- Family Photos
- Dave's Blog
- About Us
The Wentzels website is a collection of my opinions and findings on technology, especially SQL Server. I also occassionally mention politics and economics, I am an anarcho-capitalistic (die-hard libertarian) in the Rothbard tradition.
I also run my LLC from this website and have sections devoted to showcasing specific skills that I have (ipod/ipad development, data architecture, SQL, Oracle, and Drupal). My latest creation is www.tigersbaseball.info which showcases an automated baseball scoring and statistics generation system that uses Drupal. Please bear in mind that I have no artistic capabilities so everything I create tends to look terrible, the content is the key.
This is the next post in my NoSQL series. A fundamental concept endorsed by many NoSQL solutions is "eventual consistency." Each product can implement eventual consistency however it likes. RDBMS people are familiar with ACID properties (Atomicity, Consistency, Isolation, Durability). NoSQL people like to bend the rules a bit and I like to call these BASE properties (Basic Availability, Soft-state, Eventual Consistency).
So, when would you use BASE transactions?
One case is when your network is not reliable. For instance, transactions that must span your local network as well as a cloud provider. Local updates can be queued and batched and sent later. While the remote transactions are in-flight the "syste" is in an inconsistent state, but eventually all updates will be applied and the system will be in balance again.
Inventory systems can withstand eventual consistency, usually, thus are good candidates.
In general, consider BASE transactions in cases where ACID transactions are not immediately required and the data eventually becomes consistent and is never lost.
I am presenting at the December meeting of PSSUG. December The topic is "Metadata Driven Database Deployments." I have a mechanism I have been using for about 13 years where we can deploy a new database or upgrade an existing database in a fast, reliable, repeatable manner. And the best part is, you don't need to know esoteric DDL commands. Certainly you can use things like "scripting wizards" and "SQL compare" tools to do this but I'll show you some benefits that having a custom DDL deployer can handle that other tools cannot.
Feel free to come out and learn about my deployment techniques. Or just come out to heckle me.
|Data Model||Description||Has Schema/No Schema||Relationships are Modeled As...||Data Is...||Examples||Possible Disadvantages|
|Relational||We probably all know this by now.||Has Schema||Predefined||Typed||SQL Server, MySQL, Oracle||Everybody has different opinions on this.|
Probably the oldest electronic data storage mechanism. Data is modeled in a pyramid fashion (parent and child records).With the advent of the relational model most hierarchical data stores died out. However, XML started a bit of a renaissance in hierarchical data stores. Many NoSQL solutions could be categorized as hierarchical.
|Has Schema||Predefined||Typed||IBM IMS, the Windows registry,XML|
|Network||very similar to hierarchical datastores. It was the next evolution of the hierarchical model, allowing multiple parents to have multiple children. Much like a graph. Many healthcare data stores are based on MUMPS which falls into this category. "data" is stored as data and methods (instructions on what to do with the data)||Has Schema||Predefined||Typed||CODASYL, Intersystems Cache and other Object-oriented databases||Never gained much traction because IBM wouldn't embrace it and the relational model came along and displaced both.|
|Graph||could be considered the next generation of network data stores. These are prevalent in the social-networking space where applications like Twitter, LinkedIn, and Facebook want to visualize your "network" for you. Nodes represent users who have relationships (the edges) to each other. Modeling this relationally is challenging at best. A graph system can query these structures easily. Basically, if you can model your data on a whiteboard, then a graph database can model it too.||It depends (but generally no schema)||Edges||Ad hoc||Neo4j||data is usually untyped. Very complex.|
|Document||designed for storing document-oriented data, such as XML, or semi-structured data.||No Schema||None (generally)||Typed||CouchDb, MongoDb||Does not support ad hoc reporting tools, ie Crystal Reports|
|Columnar||stores its data as columns or groups of columns of data, rather than as rows of data. Good for OLAP applications and anywhere where aggregation query is important.||Has Schema||Predefined (similar to relational)||Typed||HBase||generally not good for OLTP applications.|
|Key-Value||stores its data like an EAV model. Entirely schema-less. Data can be stored as byte-streams so you can persist the programming language's objects directly in the key-value store||No Schema||Links via keys||Semi-typed||Redis, Riak|
I mentioned in my last NoSQL post that developers complain that SQL is a lousy declarative language. On some levels I agree with that. But those same developers then damn all relational theory and data stores because of shortcomings in the SQL language. This is wrong. It is usually vendor implementations of SQL that are the problem. There are relational languages other than SQL. This post will cover some of them and how they are better than SQL in many regards.
TutorialD is not really a language, but rather a specification co-authored by CJ Date. It describes the characteristics that a good declarative, relational language should have. CJ Date devised TutorialD as an educational aide in one of his books. The "D", I believe, stands for "data". Date believed that SQL was not a particularly good language for expressing relational concepts. D was supposed to be a better alternative to SQL.
There is currently no real-world implementation of D, it is merely a proscriptive list of items that a good relational query language should expouse. It is merely educational. Rel is an open-source implementation of D, written in Java. I'm not aware of anything that formally supports it.
Perhaps it is time that SQL people understand what the NoSQL people hate about SQL. Then we can focus on providing extensions that cover some of the things listed above that would make all of us more productive without having to abandon the many good things of the relational model.
This is my third post on my evaluation of NoSQL products for a client who wishes to replace their (rather expensive) SQL Servers. Many relational guys ask themselves, "why all the fuss about NoSQL? What is wrong with relational data stores?" I felt the same way 15 years ago when people wanted to replace RDBMS with XML stores. But the movement is underway because people feel the relational model is flawed. In my mind, NoSQL is not a direct replacement for relational data stores, NoSQL has its place, just like relational does. But, we relational guys are making it tough on ourselves. We don't want to understand WHY people like relational alternatives.
This post is a series of things that annoy me with relational people. These annoyances bother lots of people...the same people who control the decision making powers of the data persistence engines that are chosen for new projects. If we relational guys don't change our ways we are going to be out of jobs. My complaints, in no particular order:
Relational database Schemas are Inflexible
I've been to numerous NoSQL presentations and this is the key point that these vendors drive home.
How many times have you asked your DBA to add a new column to a table and they refused? With <insert NoSQL product here> we are schema-less. We don't care if you add new data elements to your schema or not, because we have no schema. Add whatever data elements you want, whenever you want. We'll support it.
The fact is, relational databases can do this too. It's not the RDBMS that is inflexible, it is the DBA. More on that in the next section.
There are no valid reasons why a relational database schema has to be inflexible. The reasons are man-made contrivances. I have many blog posts where I show that you can add new columns and value them while a system is up with ZERO impact to the users. Some examples are here,here,here, and most importantly here.
There are some schema changes that are disallowed by the major RDBMS vendors. For instance, adding a non-nullable column to a billion row table generally means some downtime. But that is not a flaw of the relational model, that is a flaw in the vendor's implementation of the relational model. There is nothing in relational theory that limits schema flexibility. The problem is that the data modeler and data architect are not sufficiently well-versed in how to decouple the logical from the physical model such that evolutionary relational databases can be a reality. These people want to tie table structure to its on-disk representation. Data architects and data modelers need to update their skills.
DBAs, Data Architects, and Database Developers SUCK
I don't feel this way, but many of our fellow software developers do. Sorry, but it's true.
DBAs tend to ask too many questions and we make the developers think differently about their data. They don't like that. We data professionals like to think we are doing our employers a valuable service by asking the difficult questions about new data requirements, but in reality we are not. Developers are fighting us at every turn and we don't even realize it.
We've all seen this happen: A developer/analyst requests a new column in a table. What does the DBA do?
Get the picture? DBAs are notorious for not being helpful. Sorry, but that's my opinion, and the perception of many others too.
Don't believe me. Then why do we all see so many of the following traits in our databases:
To my DBA friends, PLEASE change your attitudes.
I Can't "Discover" My Schema in a Relational Database
This is another complaint I hear too often. Developers want to experiment and discover their schemas without a lot of up-front formal modeling. And I agree. And I do that every day too. Nobody says you need to have DBA-approval to build some tables on your scrum server and do a proof-of-concept. When you are ready then you should get proper change management approval. This argument is another function of rigid policies and processes.
Many companies practice rigid "waterfall" development...they MUST do data modeling before any coding is done. And that is often WRONG. In these companies the model is ALWAYS determined to be lacking late in the development phase. The schema can't support all of the data requirements perfectly. But again, the developers and analysts fear the DBAs so they "work around" the schema deficiences.
Did you ever notice there aren't a lot of jobs for NoSQL DBAs? This is because the NoSQL vendors don't want DBAs. DBAs limit the creativity process. Flexibility to assist in prototyping and experimentation is not a function solely of NoSQL.
SQL is hard to use and is not expressive enough
This is true. But it is getting better. Doing "paging" in SQL 10 years ago required lots of lines of code and performed poorly. Now we have constructs like TOP and LIMIT that are easier to use. And there will be even more improvements to the SQL language that will make this even easier, for instance, we'll likely soon have the ability to use a TOP without needing a CTE first. That's just a guess.
The NATURAL JOIN syntax would be a welcome addition too.
Here are some other things SQL really needs:
It is getting better. And I defy you to tell me that Hive or Pig is more expressive than SQL.
And, of course, we always have ORMs to help us avoid hand-crafting SQL for mundane, repetitive tasks.
There are good reasons to use a NoSQL solution. There are also bad reasons. This blog post was an attempt to outline a few egregious reasons why people choose NoSQL solutions. These are all issues of perception and education.
If you deal with a lot of SQL Servers you know that Central Management Server (CMS) is your friend. If you use CMS correctly you can run queries against all of your instances with very little additional work. But there is *some* additional work. For instance, what if you need to run queries against databases with different names or naming conventions on each instance? Or only run the query on a given database if it contains a given application table? Or only run a query for user databases? Here are the query patterns I use to overcome some of these issues. I hope you find them useful.
Show me all register servers in my CMS
You should connect directly to your CMS and switch your db context to msdb to run this query.
declare @name varchar(256), @server_name varchar(256) declare dave cursor for select name, server_name from dbo.sysmanagement_shared_registered_servers_internal --from dbo.sysmanagement_shared_server_groups_internal where server_group_id = 6 open dave fetch next from dave into @name, @server_name while (@@FETCH_STATUS = 0) begin print 'echo Running: ' + @name print 'PAUSE' print 'powershell .\something -ServerName "' + @server_name + '"' fetch next from dave into @name, @server_name end close dave deallocate dave
The remaining queries can be run by connecting to your CMS server in "Registered Servers" in Management Studio, and then selecting "Connect to All"
Run a command on a subset of databases based on a property and name
declare @something varchar(4000) declare dave cursor for select name from sys.databases WHERE name like '%tnd%' and is_broker_enabled = 1 open dave fetch next from dave into @something while @@FETCH_STATUS = 0 begin select @something = 'EXEC ' + @something + '.dbo.Blah' exec (@something) fetch next from dave into @something end close dave deallocate dave
Run a query against all databases named something
declare @something varchar(4000) declare dave cursor for select 'select COUNT(*) from ' + name + '.dbo.Something WHERE Blah IS NULL AND StpDTime IS NULL ' from sys.databases WHERE name like '%blah%' open dave fetch next from dave into @something while @@FETCH_STATUS = 0 begin exec (@something) fetch next from dave into @something end close dave deallocate dave
This is a follow-on to Performant ETL and SSIS Patterns. I really need to do a longer, explanatory post on this. Two of the largest performance problems I see with SSIS packages is their lack of parallelism and the fact that they are written to run large batches during a defined time window. The former is totally unnecessary, the latter is unecessary if you structure your processing smartly.
It's a helluva job market out there right now if you have ETL, SSIS, DataStage, or equivalent experience. I guess you can make some generalized deductions about this:
Unfortunately, too many job postings ask candidates to have specific ETL tooling experience such as SSIS or DataStage. This is unfortunate. Too many candidates have great tooling experience but have very little grounding in ETL best practices, regardless of chosen tool. I've been called in a lot lately to help fix various ETL processes. Each one is using a different ETL tool and each one is exhibiting the same terrible ETL anti-patterns. When I fix those anti-patterns everything just magically runs better. I have yet to touch actual ETL code.
To quickly summarize the most egregious issue...developers are doing too much work in the ETL tool and not enough work in their RDBMS. The RDBMS will almost always do things faster than the ETL tool can. There are few exceptions to this rule (string manipulation and regexp is better in most ETL tools than in SQL for instance).
I've written tons of blog posts (here's a link to an entire series of ETL Best Practices) about how to do performant ETL with good patterns. However, I find myself constantly searching my blog to find a succinct list of things to check whenever I'm brought into another ETL engagement. Here's the biggies:
Are some new SQL Server features a response to the NoSQL movement?
The last few releases of SQL Server, in my mind, were focused heavily on business intelligence. And I believe the best SQL Server jobs in the last few years are focused on BI and data warehousing. OLTP does not seem to be where the new features and money are these days. Here is a quick list of relatively new SQL Server features targeted to "bigger data":
But there is a shift. I see many more new features centered around "BigData" and as a response to the NoSQL movement (if it is a movement, is it of type religious...or bowel?). Here are some examples: