slideshow 1 slideshow 2 slideshow 3 slideshow 4 slideshow 5

Welcome

You've found davewentzel.com.  I am an independent consultant and principle with David Wentzel LLC, focusing on anything related to data.  Here you'll find a collection of my thoughts, mostly on data architecture and SQL Server.  I host a blog as well as a Features section with searchable content.  I also showcase a Portfolio of open source projects I am working on, especially Drupal integration.  I host some family media to showcase some of those skills. Lastly, I have done quite a few presentations on SQL Server topics that you may be interested in.  

Data is one of your company's most valuable assets.  We can help you most the most of it.  

Creating a Vertica Database

This is the next post in my Vertica series.  In a previous post we covered Installing Vertica 7.  I finished that post by mentioning that once you have a node installed you can connect and do basic tasks using /opt/vertica/bin/admintools.  If you aren't familiar with Linux then even that simple command can throw you for a loop.  Here is the exact process to start your exploration:

  1. Log out of Ubuntu if needed.  
  2. Log in as dbadmin with your dbadmin password
  3. cd /opt/vertica/bin
  4. ./admintools

You should be presented with an "old-school console GUI".   This will work over telnet or SSH and although it looks kinda like a Windows GUI you won't be able to use your mouse for anything.  You navigate using Enter and the up and down arrows.  You can poke around the menu options but you'll find you can't do anything with Vertica yet because we have not created any databases. 

VMart

 HP's sample database is called VMart. It is a data warehouse for a supermarket chain. Every installation of VMart is a bit different because the data is generated on-the-fly via a data generator. This makes query comparisons between Vertica installations a bit of a hassle vs something like AdventureWorks. HP also supplies a series of sample queries that demonstrate various Vertica features and can also be used for performance tuning with Database Designer. In this blog post we'll cover creating our first database by installing VMart.  

Installing the VMart shell db

  1. Start AdminTools
  2. Enter "Configuration Menu"
  3. "Create Database"
  4. Call it "VMart"
  5. You will need to enter a password.  This is equivalent to "sa" in SQL Server.  
  6. You will need to select the hosts for this database.  As mentioned in a previous post, Vertica nodes are named using their IP addresses.  Since you only have one host currently you should see that host listed for this database.  Click OK.  
  7. Enter the catalog and data file locations.  This will default to the home folder for your dbadmin user.  
  8. You'll get a warning that a database with 1 or 2 hosts cannot be k-safe.  That's fine, we'll fix that in a future post.  Just click OK.  

And that's it.  You now have a shell database akin to the model database in SQL Server.  At this point it is interesting to look first at the catalog and data directories.  You'll end up with a structure similar to this:  

Folder File Purpose
/home/dbadmin   Home folder for dbadmin user

/VMart

  folder containing your new VMart data
  dblog contains the log file for the spread daemon, which is an open source messaging system not unlike JMS.  Spread is used so individual cluster nodes can communicate with each other.  
  port.dat the port Vertica is listening on.  nano port.dat will show you 5433 which is the default Vertica port.  

/v_vmart_node0001_catalog

  this is the catalog folder.  You'll find various error logs and configuration files here.  
  vertica.log This, along with dblog, are the log files for your database.  

/v_vmart_node001_data

  this contains your data.  This folder will have 0 files since we haven't created any data or schema yet.  We'll come back to this later.  

This gives you an idea of what the file structures and naming conventions look like with a Vertica database.  

Running the VMart scripts

With the VMart database created we need to create the schema and load the data.  

  1. From the admintools Main Menu choose "2 Connect to Database".  This will launch vsql which is similar to sqlcmd in SQL Server.  We'll cover this in detail in a future post.  
  2. The VMart schema files are located in /opt/vertica/examples/VMart_Schema.  The README file show you various parameters to make a smaller or large sample data.  You can view any of the files in that folder using nano.  We'll take a look at some of these files in more detail in a later post.  For now we just want to install a base VMart db.  
  3. From vsql:  \i /opt/vertica/examples/VMart_Schema/vmart_define_schema.sql
  4. \q to quit vsql, which will return you to admintools
  5. Exit admintools.
  6. cd /opt/vertica/examples/VMart_Schema
  7. ./vmart_gen :  this will create a small default db.  What it really does is create a bunch of .tbl files in your folder.  These files are pipe delimited text files that can be read using Vertica's COPY command, which is equivalent to SQL Server's bcp command.  
  8. We have the .tbl files that we now need to load into VMart.  Run: vsql -h <IP address> -d VMart -U dbadmin  (this is an alternative method to launching vsql without launching admintools first).  
  9. \i vmart_load_data.sql :  this will load the .tbl files.  This may take some time.  

In the next post we'll take a look at the VMart schema, run some sample queries, and see how the data files are laid out on disk.  


You have just read "Creating a Vertica Database" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Vertica Physical Architecture

This post is the next in my Vertica series and covers how Vertica is physically architected.  I covered pieces of this when we installed our first Vertica instance.  Now we'll go into a bit more detail.  This is all leading up to installing a second Vertica node and sharding our data in a manner that helps us with both recoverability and performance.  

Instances, Nodes, Clusters, Databases

Only one instance of Vertica can run on a given host. This is also called a node. If a node fails the database will still recover if the cluster has at least 3 nodes. Kinda like RAID 5 for your nodes. HP recommends 4 nodes minimum and more is always better because additional nodes work just like SQL Server's readable secondaries...they are active for querying and updates. They are not just passive cluster nodes. And, they are not necessarily COMPLETE copies of any other node in the cluster, unlike SQL Server.  This means, similarly to the concept of RAID 5 for storage, that you can lose a certain number of nodes without bringing your database down.  Also, additional nodes with partial (or complete) copies of data also gives you a performance boost when querying.  

There is also something called a "control node" in Vertica.  You always have at least one control node but you can have many if your cluster is huge or geographically dispersed.  The control node can also migrate to other nodes automatically if the control node goes down without any DBA intervention.  The control node is the traffic cop.  As queries are submitted to Vertica the control node sends the requests to other nodes (...similar to Map...) to collect data and return it to the control node which then does any final calculations before returning the data to the requestor (...similar to Reduce in MapReduce).  All nodes run a process called "spread" which assigns nodes to be the control node if you haven't configured it yourself. 

K-safety

"K-safety" is how Vertica measures fault tolerance and is the key to understanding the node-to-data relationship. Let K equal the number of replicas of the data in a given cluster. This is not the same as the number of nodes. The more K-safety you have the more nodes can fail without impacting performance or having the db go offline. When a node comes back online it will automatically recover just like a failed disk in a RAID 5 array.  The failed node queries neighboring nodes for missing data.  The recovery status can be queried using the Management Console which we'll install and look at in a future post.  

The "K" can be 0, 1, or 2 and depends on your physical design. You don't want to screw up your design and end up with something that isn't as k-safe as you planned, not to mention not as performant.  The Database Designer (again, we'll cover this in a future post) helps you by noting your design's K-safety level. It offers suggestions on how to improve your k-safety and can even help you convert your existing design to be either more or less k-safe depending on your needs.  

  • K=0 means that if ANY node fails the cluster fails and your database will be marked offline for querying until that node is back online. If there was a hardware problem all you need to do is fix the hardware and upon reboot Vertica will begin the recovery process automatically.  If you need to entirely rebuild the node from bare metal then you MUST ensure the new node has the SAME IP address.  Vertica uses IP addressing, not FQDNs to refer to other nodes.  
  • K=1 means that any single node can be lost and the database will remain up (perhaps with performance degradation). This means that every node's data is "replicated" to at least one other node.
  • K=2 means that your db can remain up if ANY 2 nodes fail. Here's where it gets tricky. If you WANT K=2 then your design must ensure that if 2 nodes fail that a third node must have your data.

The formula for K-safety is simple: to get K-safety you must have AT LEAST 2K+1 nodes.

So, to have a K-safety of 2 (in other words, you want to survive 2 nodes going down simultaneously) you must have AT LEAST 5 nodes. Vertica only officially supports K=2. For argument's sake, if you really wanted K=3 you would need 7 total nodes, minimum.

So, how does data get to the other nodes?

Unlike distributing data in SQL Server which ALWAYS requires DBA intervention (be it replication, log shipping, readable secondaries) everything is automatic in Vertica.  You simply add a statement to the CREATE PROJECTION clause telling Vertica how to handle the sharding.  You have 2 choices: 

  1. When the projection is small and unsegmented it is replicated to ALL nodes.  No need to add the overhead of segmentation in this case.  
  2. When the projection is large and segmented then a "buddy projection" is copied to at least one other node.  This is done using offsets. In the graphic to the right the fact table is segmented by a hash function to every node in the cluster.  OFFSET 1 indicates that the buddy projection for a given node will be available on the next neighboring node in the cluster.  

All of this k-safety stuff probably sounds complicated and it is easy to screw it up if you go off-the-reservation and do something that the Database Designer did not recommend for you.  Even if YOU think you should be at k-safety = 2, that doesn't mean Vertica agrees with you.  SELECT current_fault_tolerance FROM system; will show you your current k-safety.  If it isn't what you expected just rern the Database Designer.  

Fault Groups

None of this K-safety stuff matters if all of your nodes are in the same rack and that rack loses power. This is a "correlated failure" and is a function of externalities. If this happens you certainly can't blame Vertica for your stupidity.  In Vertica you can define your own fault groups to give Vertica hints to understand these correlated failure points better and influence its activities accordingly.  For instance, defining a fault group will also let you smartly define control nodes.  You wouldn't want all of your nodes to be in one rack and your control node to be connected to a different switch in a rack on the other side of your data center.  

When half of your cluster nodes fail, your cluster fails, regardless of your K-safety.  But it's common to have buddy projections on many nodes in your cluster.  If possible it is best if those buddy projections/nodes have as little shared infrastructure as possible.  Say for instance, SAN storage.  If you have multiple SANs (or even VM hosts) you want those buddy projections to be as separate as possible for DR and performance reasons.  Fault groups are the ticket for all of this.  

What is the catalog?

When you install Vertica you must provide catalog and data directories and the path must be identical on all nodes. The catalog directory stores all of the metadata about your database; basically everything except your actual data. The catalog is akin to the PRIMARY filegroup in SQL Server although it contains many files. It holds the system tables, data about your nodes, snapshots, file locations, etc.

The catalog files are replicated to all nodes in a cluster whereas the data files are unique to each node.  Vertica uses lots of files to store its data.  Once a file is written to it is never altered, which makes recovery quite easy.  Vertica simply needs to copy missing files to the recovering node from any other node with the necessary replicated data or buddy projections.  Since files are never altered Vertica has no concept of FiLLFACTOR or PCTFREE.  Since files are columnstores it is guaranteed that neighboring data will have the same datatype, therefore Vertica's abilty to encode and compress data is absolutely amazing.  As mentioned in the last post, this becomes a challenge when deleting data.  Since files are never altered Vertica uses "delete vectors" which are markers as to which rows in which files should be discarded during query execution.  At certain intervals a background process will rewrite files to purge data that was deleted.  

Summary

This post was a quick overview of Vertica's physical architecture.  Hopefully you learned about some really cool features like buddy projections that guarantee that your data is safe even if you lose a few nodes.  Let's say you are a SQL Server Guy, like most of my readers.  You may be wondering why you would want to learn about Vertica if you have no plans to implement it.  I find it fascinating to learn about competing technologies to help me understand limitations in the solutions I work with that I didn't previously understand.  This makes us all a bit more well-rounded.  In the next post we'll cover installing Management Console which is the GUI for Vertica.  


You have just read "Vertica Physical Architecture" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Vertica Objects

It's been a few weeks since I covered any Vertica topics, the last being Installing Vertica 7 .  Before we create our first Vertica database it's important to understand the core Vertica objects that we will eventually interact with.  Remember that Vertica data is stored in a column-oriented manner.  It is optimized for read-only data warehouse-style applications.  

Logical Modeling

You do your logical modeling just like you would on any other relational platform. Vertica is relational, not hierarchical or network-modeled. From a physical implementation perspective, It has schemas, tables, views, RI constraints, etc that are all visible to your SQLers. There's no reason why a good data warehouse schema (star schema) that you would concoct for, say, SQL Server or Oracle, wouldn't logically be the same on Vertica.  

Physical Modeling

The difference with Vertica and other RDBMSs is at the physical modeling layer.   Physical modeling is quite a bit different than what you may be accustomed too. "Tables" are physical storage objects in the relational world. With Vertica a "table" is a collections of table columns called projections. Every table must have at least one projection that stores all of the table's columns' data. This is called a superprojection. Every table has one and only one superprojection.  

As mentioned above, you have schemas, tables, and constraints.  What you don't have is indexes.  Instead of indexes data is materialized directly into the necessary projections.  

More on Projections

So, remember that projections are really just groups of columns.  Almost everything we do in Vertica revolves around a projection.  

More simplistically, a projection (any projection) is persisted data, usually optimized. Projections are not always raw data like in an RDBMS table (but it is if it is the superprojection). Instead, a projection is more like a materialized view or a saved result set. It is optimized for reads and is automatically updated as data changes.  Vertica can encode and compress the data in projections since like data types are always stored next to each other.  Note that in the graphic to the left that the Name "column" is physical stored together and is the same data type.  This will encode and compress quite nicely.  

Just like a SQL Server partition or index, you don't query a projection directly (or a superprojection...from now on when I mention "projection" you can assume either a projection or superprojection unless I specifiy otherwise), you query a table. Vertica picks the best projection for you just as SQL Server would pick the best index for you.  

Think of a projection just like a SQL Server indexed view or an Oracle materialized view. It is legal to put just about anything in a projection that you would put into an indexed view. You can join tables/projections (but only on PK/FK cols and only as INNER JOINs...these are called pre-join projections), calculations and aggregations (as of a new feature in Vertica 7)...you can even put ORDER BY clauses which you can't generally do in a view in SQL Server. In fact, this is very much preferred.  

When modeling a projection, store columns together that are regularly accessed together to optimize IO.  This is called a column grouping.  

One exception that may seem odd at first is that a projection CANNOT have a WHERE clause.  The reason is quite simple.  Remember that a projection is really just a set of materialized columns.  You can't filter the columns otherwise the projection isn't a true representation of the underlying table.  To overcome this the conventional Vertica wisdom is to ensure that often-used predicates are contained in leading columns of your projections. 

Segmentation

Small projections (lookup tables for instance) are replicated to every node in your cluster. They are not segmented, they are unsegmented projections.

Large projections are segmented (similar to SQL Server partitioning, or maybe the better analogy is "sharding" which SQL Server does not have natively) and the segments are smartly copied to other cluster nodes for redundancy and peformance.  

Segmentation is handled by built-in hashing and there are multiple algorithms available to you to do this. The "segmentation key" should have a high cardinality (like PKs) to be efficient otherwise the data will be skewed to certain segments/nodes. You will never achieve perfect segmentation so some skewing will always occur.  No need to over-engineer this.  

In the next post I'll cover how segmentation and hashing and something called "buddy projections" ensure that your data is smartly copied to other nodes to increase availability, recoverability, and performance.  

Table Partitioning

Vertica also has table partitioning.  Conceptually this is similar to SQL Server's partitioning.  Partitioning simply divides a large table into smaller pieces.  It applies to all projections for that table. Partitioning is generally used for fast data purging and query performance. It will segregate data on each node. Partitions can be dropped quickly and switched in and out just like SQL Server.  

DELETEs (and UPDATEs which are really DELETEs followed by INSERTs) are cumbersome for a columnstore like Vertica.  In a rowstore a given row is generally physically contiguous on disk.  Not so with a columnstore.  Therefore Vertica handles this by using delete vectors which are simply markers that a row(s) should be removed from any result sets.  This means that data can be as compact as possible on disk without any risk of fragmentation.  There are background "Tuple Mover" processes that actually rewrite the underlying files (projection data) to remove deletes.  Therefore, it's best to avoid DELETE processing as much as possible and this is where smart partitioning can help.  

So, other than as an alternative to deleting large chunks of data, what is the purpose of table partitioning?  Same as SQL Server.  If your query is written correctly you can benefit from "partition elimination" where Vertica does not even need to scan a given partition if it knows none of that partition's data will qualify for the result.  

Partitioning and Segmentation Working Together

Segmentation is defined on the projection and is used to gain distributed computing in Vertica.  That is its sole purpose.  Different projections for the same table will have identical partitioning but can be segmented differently.  Partitioning is declared on the CREATE TABLE statement.  Segmentation is declared on the CREATE PARTITION statement.  

Views

Vertica has the concept of a view or a saved query. A view will dynamically access and do any necessary computations from the underlying data at execution time. Unlike a projection, a view is not materialized and stores no data itself. A view does not need to be refreshed, unlike a projection, whenever underlying table data changes. A view can reference tables (and temp tables) and other views. Unlike most relational products, views in Vertica are read-only.  

Licensing and Projections and Segments

When you use the Community Edition of Vertica you only get 3 nodes and 1TB of data.  The definition of "data" is important since you can have a small amount of base table (superprojection) data but tons and tons of projections.  Data, in terms of licensing, is the base table data ONLY.  If you begin to run low on available space (in a future post I'll show you how to determine this and stay in license compliance) you can always drop partitions or delete data.  

Summary

In a future post we'll actually get to work hands-on with all of this stuff.  For now I think it's just important to understand key Vertica concepts and how they compare with SQL Server.  


You have just read "Vertica Objects" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Vertica Certification

I have just received news that I am now an HP ATP certified Big Data Solutions architect for Vertica.  I have been using Vertica off and on for a few years now but have really ramped up my exposure over the past year.  A few months ago I began a blog series on getting to know Vertica but I haven't had the time to continue that series.  Now that I am doing study for the Vertica exam I hope to continue sharing that knowledge on my blog.  

 

 

Upgrading Guacamole

I wrote about Guacamole (a client-less RDP Gateway) a few months ago.  In that post I covered how to install Guacamole 0.9.0 on a Ubuntu server.  In this post I'll quickly cover how to upgrade 0.9.0 to guac 0.9.2.  Actually, this method should work for any guac upgrade.  

Firstly, what is guacamole?  Let's say you need to RDP (or even SSH or VNC) to a machine at a client site (or your home).  You can certainly use a VPN to tunnel the RDP traffic for port 3389.  Or you could poke a hole in your  firewall for 3389 (which is a bad idea).  Guacamole is a nifty alternative.  It runs on a Ubuntu VM (for one or two connections you need about 256MB RAM) allowing you to RDP to any server behind the firewall using HTTP(S) ports ONLY.  It's wicked fast and even works on a smartphone without having to deal with a VPN setup.  When you log in to your Guacamole server you pick which internal server you want to RDP to and you get a RDP session right in the browser.  To the right is a screenshot of an RDP session from IE 9.  It's like the mstsc client is right in your browser!  Below is a screenshot of a guac connection using my Galaxy S5...no VPN, no RDP app...you do it right from the browser.  

The Upgrade Process

Sometimes OSS like guacamole lacks documentation, or at least good documentation for noobs.  Guac doesn't have an official upgrade document so I worked out the kinks on my own.  

We are going to follow the steps in my original post on setting up guac.  Please refer to that as needed.  

  1. Take a backup of your VM or at least your .properties file and user file.  
  2. Do Step 4 and 5 in my original blog post but change the file name to 0.9.2.  
  3. Replace the old war file with the new version you just downloaded:  sudo mv guacamole-0.9.2.war /var/lib/tomcat7/webapps/guacamole.war
  4. Do Step 6 and 7.  
  5. I think there is a bug in make with guac where they reference uuid incorrectly, but I'm not the expert.  I didn't want to change the official configure script (which has lots of checkins regarding the uuid bug) so I got it to finally work by installing a few extra packages.  Run this command:  sudo apt-get install uuid osspd osspd-alsa osspd-dbg osspd-pulseaudio libossp-sa12 libossp-sa-dev ibossp-uuid16 libossp-uuid-dev libossp-uuid-perl
  6. Now you can run steps 8 - 12.  
  7. Restart Ubuntu.  
  8. Close your browser and reopen it (otherwise the javascript from the old version will be cached and guac will act strangely).  
  9. Connect to your previous guac URL and log in.  

I don't advocate upgrading software unless there is a compelling reason to do so.  Guacamole 0.9.2 anecdotally seems to be quite a bit faster than 0.9.0, especially over low-bandwidth connections such as Android tablets.  In all, I can't praise the guac team enough for this great piece of software.  I use guac daily for hours at a time and it just works perfectly.  


You have just read Upgrading Guacamole on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Understand accounting and you'll understand the trends in software development and IT

(Overheard at the water cooler many years ago)..."Why are we moving to kanban and agile?  I don't understand it, our teams always deliver."

(Overheard at a recent SQL Saturday I attended)..."One of our ISVs (Independent Software Vendors) is now offering subscription-based licensing in addition to their standard release-based licensing.  The subscriptions are almost twice as much as release-based licensing, yet my stupid company just switched to subscription.  There goes my raise."

(Overhead by a DBA in the hallway)..."I can't believe we are moving all of our data to the cloud.  Security is going to suck and by my calculations it's going to actually cost us more money.  I don't understand the rationale for this.  I guess we'll all be laid off soon."  

There are trends in our industry that baffle IT folks.  The above are just a few examples.  There are (somewhat) hidden reasons for these trends.

Accounting "gimmicks".  

If you are involved in software development or IT it behooves you to understand this stuff. YOUR CAREER AND FUTURE IS AT STAKE.  

Let's dive right in.  

Quick Accounting Primer on Expenses

Companies spend money on stuff.  These are expenses.  Expenses are always classified as either Capital Expenses or Operational Expenses.  A table is a good way to visually represent the differences.  

  Capital Expense Operational Expense
aka capex opex
definition

the cost of developing or providing non-consumable goods that support the business.

the ongoing cost of operating a business.  
"value" the expense has a usable life of more than a year, hence it has longer-term value the expense is consumed within a year and then has zero value
example a 3D printer the toner material for the 3D printer
another example a new delivery truck toilet paper  (it has no value once consumed)
one last example building a new warehouse that the company owns leasing a new warehouse from another company that owns the land and erected the building for you
accounting treatment it is added to an asset account and the company's cash flow statement as an investment.   shown as a current expense, recorded immediately and subtracts from income, reducing net profit and thus taxes
affect on profits and taxes is deducted from earnings/reduces profits over its usable life.  this is called depreciation deducted from earnings and will reduce profits and taxes in the year it is paid/incurred
extremely simple example a truck is purchased for 100K and is added as a fixed asset with a useful 10 year life.  10K may be deducted to offset profit/taxes in each year for the next 10 years (VERY simplified example).  The annual 10K allotment is handled similarly to opex for that year.   Toilet paper purchases are expensed and deduct from profits/taxes in the year the tp is used.  

One quick note...R&D Expenses

The above table shouldn't be difficult to understand.  If it is, just trust me, or do your own research.  

Now let's start to get a bit tricky.  GAAP (Generally Accepted Accounting Principles) has a classification for expenses called "research and development".  These expenses are opex.  This is sometimes counter-intuitive for people.  If I'm a pharma company my R&D could lead to a breakthrough drug that nets me millions in profits over decades.  Shouldn't these be capex?  Is this an "investment"?  

Not generally.  The rationale is that at any point an R&D project might be abandoned and there will be no "asset" to the pharma company.  

If you work in software development then you may consider yourself R&D.  Right?  

Not always.  

But let's not get ahead of ourselves.  

Which is better...capex or opex?

First of all, there are a lot of gray areas where a given expense might be classified as capex or opex depending on just how you are willing to bend the GAAP rules and justify your position.  

In situations where an expense could be either capex or opex some companies will prefer one or the other based on how it will benefit them the most. 

A good auditor/investor/accountant can learn a lot about a company's management and goals by looking at how it handles decisions around whether an expense is on-the-books as capex and opex.  Frankly, many expenses, especially IT, like developing and maintaining software, could be either capex or opex depending on your world view.  Here are some generalities that I believe that others may disagree with:

Entity Capex Opex Reasoning
pre-IPO companies x   Pre-IPO companies prefer large expenses to be capex because the expense can be split amongst many years which has the appearance of inflating current year's profits at the expense of future profits.  Pre-IPO companies want to look as profitable as possible to get investors excited.  
Investors   x See above.  Investors would rather see higher opex because nothing is "hidden" by using depreciation methods that hide expenses in later years.  
Companies interested in minimizing taxes   x Costs are accounted for sooner.  This also has a "smoothing" affect on profits.  Forecasting is easier and you won't see as many huge "one time charges" to profits for some big capex item.  (Note that investors don't want to lose their profits to taxes, which is why investors like companies that don't try to hide big capex expenses).  
software development expenses (R&D) at ISVs (Independent Software Vendors)   x Many will disagree with me on this.  I'll discuss this more below.  
IT department expenses (including software development) for non ISVs (banks, pharma companies, finance, etc) x   Here is the sticking point and what I feel is the answer to all of those questions at the beginning of this post.  Companies want to move to Opex nowadays.

Some companies may want to defer as much expense as possible to look profitable (those pre-IPOs).  Those companies will want to capitalize as much as they can.  Otherwise, generally, companies nowadays prefer opex to capex.  

Even within a company there are conflicting interests.  Some folks want opex, some want capex.  A good CFO/CEO will favor opex because that is what their investors want to see.  But within those companies the CIO/CTO may feel differently.  Many companies view IT costs as out-of-control.  How better to make the budget look smaller than to shift expenses to capex?  So now you have the CIO/CTO working at cross-purposes with the goals of the company.  

Isn't this stuff complicated?  

The Rules for Various IT Expenses

Here are some "rules" for various IT expenditures.  Look at the list carefully and you'll see that the trends in our industry today is to move away from capex and towards opex.  This has been the general trend in business since at least Enron and the Dot Com Bubble.  

Expenditure Capex Opex Reasoning
software licenses x   They have a usable life of more than one year.  
software subscriptions   x You are only "renting" the software.  Fail to pay the rent and you have no asset.  
purchased laptop x   Has a usable life of more than one year.  
leased laptop   x No asset after the lease expires.  
"cloud"   x "renting"...no asset...do you see the trend?
data centers x   Huge upfront costs.  This is an asset.  
software licenses for software deployed in the cloud   x Yup, you can buy a license for an IDE and deploy it on AWS and move it all to opex.  I'll say it again.  If you purchase software that would otherwise run in your data center, yet deploy it on a VM in the cloud, magically the capex becomes opex.  

 

The Rules for Software Development Expenses

There are actually FASB and IFRS rules that govern how to expense software development.   They are very complex.  This post is a simplification of the issues.  Feel free to use google to confirm.  You may find a lot of information that conflicts what I have here.  I suggest you actually read the rules and you may find your company is doing things incorrectly, at its detriment.  But I'm not an accountant nor do I play one on TV.  

First your company's primary business activity determines whether you should be using capex or opex for software development/IT costs.   

  ISV non-ISV
Core Business The company's core business is developing software to sell to others.   The company's core business is not software development but it undertakes software development to increase efficiencies in its core competencies.  
Example DaveSQL LLC creates tools that are sold to other companies to aid them in SQL Server DBA tasks.  Dave BioPharma, Inc creates new drugs.  It buys DaveSQL's products to help manage its servers.  
IT expenditures should be... always opex.  People disagree with this...they are wrong.  Go read the rules.  R&D is always opex.  If the ISV cancels the development effort at any time there is no asset.   at times this is capex, at other times, opex.  More in the next section.  

For non-ISVs...when is software development capex vs opex?

Remember, all software development costs (well, most I guess) should be opex for an ISV.  This table is solely for IT shops at traditional companies.  The "phase" of the software development project at a non-ISV determines how the expense is handled.  

Expenditure Capex Opex Reasoning
Functional design/"Evaluation Phase"   x If the project is not feasible and is scrapped there is no asset, so it is R&D, which is opex in the traditional sense.  
Development Phase including detailed technical specifications x   The outcome is an asset.  Even if the software is useless or obsolete by the end of this phase and is scrapped, it is still capex.  There is still an asset.  That asset may be worthless and can't be sold, but it is still an asset.  
Post-implementation   x This one should be obvious.  This is production support.  

 

 

 

 

 

Expenditure

Capex Opex Reasoning
software licenses x    
software subscriptions   x You are only "renting"
"cloud"   x You are "renting" and therefore there is no asset after the lease expires.  
data centers x   Huge upfront costs.  
software licenses for software deployed in the cloud   x Yup, you can buy a license for an IDE and deploy it on AWS and move it all to opex.  

The Rules for Software Development Expenses

Expenditure Capex Opex Reasoning
software licenses x    
software subscriptions   x You are only "renting"
"cloud"   x You are "renting" and therefore there is no asset after the lease expires.  
data centers x   Huge upfront costs.  
software licenses for software deployed in the cloud   x Yup, you can buy a license for an IDE and deploy it on AWS and move it all to opex.  

The Rules for Software Development Expenses

If you haven't noticed thus far in this post, there is a tendency for most companies to prefer opex over capex.  This is not an iron-clad rule, but that is the trend in the business world today.  So, if we were accountants/CFOs/analysts/investors we would want to figure out ways to get more opex and less capex from our software development efforts.  This helps us pay less taxes.  

First thing you should note is that the last table is very waterfall-ish in its "phases".  Design to development to ops.  But what if we were agile and used cross-functional teams?  Could we make some of that capex into opex?  Yep.  And there's the trick.  

Waterfall generates assets too quickly under accounting rules.  It has detailed design documents after all...and those are assets.  So there's another reason why agilists tout "Working Sofware over comprehensive documentation".  I'll bet you didn't know that.  Agile, if practiced properly and understood by your Finance Guys, will have less capex.  

Agile is the best way I've ever seen to shift almost all non-ISV software development costs to opex.  Just get some working software out the door and then bug fix it.  That probably seems like an oversimplication to a technologist, but not-so-much to an accountant.  

Bending the Rules Even More

You can justify anything you try hard enough.  For instance, you can front-load opex using waterfall if you lump that comprehensive documentation as part of your "evaluation phase" documentation.  Using that trick we could re-classify just about anything.  

Please note that pre-IPO companies can also bend the rules in the reverse direction to generate more capex to make their current year profits higher.  Like I said at the beginning of this post, this is all "accounting gimmicks".  

The Ultimate Rule Bender...the Cloud

Quick thought experiment...your customer comes to you and says, "Your software doesn't work because it doesn't do X properly."  You decide that you agree and proceed to rectify it.  Is this work capex or opex?  Here is the rule...upgrades and enhancements to non-ISV software is capex...maintenance and bug fixes are opex.  So, is the work you are about to undertake capex or opex?  That depends.  Your customer would probably label the "issue" a bug (hence opex), but your company may disagree and deem it a "requirements change", hence an enhancement, hence capex.  

But wait, we don't want capex...we want opex, so do we have to admit our software is buggy to get an opex classification?

Nope.  

Enter the cloud.  

All cloud application development, even enhancements and upgrades, is opex because the software is rented.  Nice.  Now you can get an opex expenditure and never admit that your software was buggy.  

More on the Cloud and Software Subscriptions

With traditional release-based licensing an ISV would only make money when the next release was available.  This had an unpredictable effect on profits.  If you missed a release date you may not make any money.  Subscription-based licensing fixes that by "smoothing" out the profits.  Recently Adobe moved their software packages to a subscription-only model.  When they first released their earnings under this model their profits were down radically based on where most of their customers were in the release cycle.  They basically picked an inopportune time to change their model.  

The buyer of software loves subscriptions for the same reason.  "Smoothed" expenses and no capex.  

Open Source Software and "Services"

I'm convinced that the opex/capex debate is one of the key reasons for the rise in the open source software (OSS) movement.  Most OSS vendors offer a version of their software for free and then try to make money by offering services.  To the user of OSS this is very appealing.  There is no upfront cost for the software (capex) and the customization services are opex.  

Not all OSS uses this model, but it is prevalent.  

Think of every blogger that offers free software to do performance analysis for SQL Servers.  Altruism aside, they do this to get you to use their tools hoping that you will attend their seminars to learn more.  Or purchases their consulting services.  It's really a great model.  

 

A History Lesson and Concluding Thoughts

Back in the Dot Com Days every company preferred capex to defer the hit to profits.  And times were good for IT guys who didn't look quite so expensive because their salaries were more-or-less capitalized.  Then the dot com bubble burst, the economy tanked, Enron blew up the world, and Sarbox came along.  Now most companies want to be as transparent as possible with their accounting.  And that means opex and less capex "one-time" charges to earnings.  

Every trend in our industry in the last 15 years is geared toward the move to opex.  

  • Why is there a push to be more agile and to use Kanban?  To get us to opex faster.  
  • Why are developers asked to do support work...a la "DevOps"?  To get more people under the support budget (opex).  
  • Why is every company tearing down their 10 year old data centers and moving to the cloud?  (Do I have to say it?)  
  • Why are ISVs moving to subscription-based software models?  So that their customers can opex the "rent".  (This also "smooths" the ISV's monthly recurring revenues too).  
  • Why is your company trying to move your on-premise SQL Servers to SQL Azure (or whatever it is called now)?  I think you got it.  

It behooves all technologists to understand the basics of accounting and economics.  Many of the trends in our industry can be traced to how those trends will ultimately affect profits.  You should be designing your software accordingly.  I have no clue what the future holds in our industry, but I sleep better knowing that I understand the economics of the decisions being made.  


You have just read "Understand accounting and you'll understand the trends in software development and IT" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Convention over Configuration

"Convention over configuration" (CoC) is one of the newer catch phrases you'll hear as a architect. I really don't know where this started but when I learned Ruby on Rails this started to make a lot of sense to me. Also known as "coding by convention" this is a design paradigm that seeks to limit the number of decisions that developers need to make, keeping things simple. When something needs to be unconventional you simply "manage by exception" and specify that which is unconventional. In other words, lengthy web.config files are no longer needed unless something is unconventional. This leads to less code, more simplicity, less bugs, less documentation, etc.  In this post I'll cover some areas for CoC improvement in SQL including one of my ideas that I've never seen proposed anywhere else that would eliminate TONS of bugs in SQL, not to mention cut down on needles SQL verbosity.  

We need more of this in SQL Server, RDBMS, and data technologies in general.  That's the gist of this post.  

An Example of CoC
Your programming language du jour probably has classes and some of those will loosely map to your database tables. Instead of a class being named "classEmployee" and the table being "dbo.Employee" we have a convention that "the class name is the table name"...unless of course it is specified as unconventional.  So the class would be Employee as would the underlying table.  Now there is no need to map this relationship explicitly.  Best practices suddenly become implicitly enforceable.  

The Need for CoC in SQL Server...Cursors

TSQL and SQL can be very verbose languages.  For the most part, these languages are declarative and not procedural.  If you can eliminate declaring some things by using sensible defaults, then you are improving the language.  IMHO.  Let's look at a simple example of where CoC could help us be less verbose...cursors.  After that I'll show you my biggest pet peeve with the SQL language and how I propose solving it using a CoC approach.  

How do you write cursors? I bet it's a tedious and error-prone process.  If you say that you NEVER use cursors and NEVER found a good use for them then stop reading, you are apparently a genius and I am not worthy of your time.  For the rest of us that know that cursors are sometimes the ONLY way to do something in tsql, read on.  

Here is a screenshot of a very hacky cursor.  It merely prints out a list of database objects and their type.  This is probably the simplest cursor you could create in SQL Server.  You'll notice a bit of verbosity.  For instance, the FETCH NEXT statement in repeated in the code.  No big deal, but we 

shouldn't have repeated code.  I'll bet you that at some point in your career you wrote a cursor and, in haste, forgot that second FETCH NEXT statement and put your code into an infinite loop.  

Ugh.  

You'll also notice on Line 1 that I was too lazy to see what the actual datatypes were for the cols in question so I just set them to varchar(2000) and varchar(200), which should be big enough.  Right?  I'll bet you've done the same thing in haste at some point in your career too.  

Cursors are just too darn tedious.  They follow a pattern (or you can use one of the templates in SSMS) but even so, it is still tedious, and therefore error-prone.  

One way we could eliminate the datatyping tedium, as well as remove the entire declaration of "cursor fetch variables" is if Microsoft gave us a shortcut 

similar to Oracle's %ROWTYPE.  In Oracle if you add %ROWTYPE to a variable the variable will assume the datatype of the underlying referenced object.  It does this by declaring a "record variable" over the input.  A bit of TSQL pseudo-code might help to illustrate this.  

First, you'll notice we no longer have to declare the individual variables for every column referenced in the cursor.  Our example only has 2 columns, but I'm sure you've written cursors with 10 cols being referenced, first in the "declare" section, then at least twice in the FETCH statements.  That's a lot of repeated code to screw up.  

Instead we declare a record type against the cursor on Line 22 and tell the record type that it should inherit the underlying columns and datatypes.  We can now reference the variable using dot notation on Line 26.  Using something like %ROWTYPE saves a lot of typing and it makes things a lot easier to read.  But it's still too wordy.  

What would be REALLY nice is something like this.  

Look at how nice and compact this is.  Further, it follows the FOR EACH...LOOP looping construct that every major language has these days.  Here I don't even need %ROWTYPE because I get the record object right from the cursor (cur) object.  No FETCH NEXT statements, no CLOSE/DEALLOCATE statements, no individual variable declarations for each column.  

Yes I know that this syntax severely limits your ability to use advanced cursor techniques.  This syntax is entirely forward only fetching only the NEXT row.  I'm cool with that.  I'll bet 99% of cursors you've written have been forward only, fetch-one-at-a-time cursors anyway.  And when you need that other functionality you can always fall back to the old TSQL cursor syntax.  

That's in the spirit of Convention Over Configuration...or "managing by exception".  

 

My Contribution to CoC - the KEY JOIN (quasi-NATURAL JOIN)

Here is my (I think) original contribution to CoC in SQL Server.  I've never seen this proposed...if it has been then I'm sorry and did not mean to steal anyone's idea.  It's called the KEY JOIN.  

How many times have you seen a query with a structure like this

No big deal right? I want to see all employees with the data for the department that they are assigned to.  

To me that query seems very "wordy". Think about it, 99% of the time when you are JOINing employee to department it is going to be by DepartmentId. There is no reason why it would ever be anything else.

And, IMHO, in about 99% of EVERY JOIN I've EVER written I've always JOINd the tables by the same cols...and those cols are the PK on the parent side to the FK on the child side.

The 1% of the JOINs that aren't by key are really edge cases.  Sometimes these are analytic queries and sometimes these are queries where I'm looking to find "fuzzy" matches so I don't want to use the key.  

For the other 99%, I see an opportunity for CoC.  We can manage the 1% as an exception.  

NATURAL JOIN?

Well, it just so happens that the ANSI SQL standard has a NATURAL JOIN syntax that seeks to alleviate the ON clause, thus aiding readability.  Here is the above query using the NATURAL JOIN syntax:

So much easier to read without that ON clause.

But few vendors (Oracle is one) support NATURAL JOIN. In fact, NATURAL JOIN is highly discouraged from use. Why?

  • a NATURAL JOIN actually joins tables by like col names, not DRI. 
  • it's generally thought that if a key col name changes then the NATURAL JOIN would break everywhere so the ANSI standard protects us from our own devices. But seriously folks, if you rename a key col you better realize that you are going to have A LOT of code that changes.  This is a ridiculous argument

To be clear, in the employee/department example if I used NATURAL JOIN the join would be by DepartmentId, assuming it is common in both tables.  But if Name is common in both tables then the join would be but that col as well...and clearly dbo.employee.Name is something entirely different from dbo.department.Name.  Also, many tables you work with probably have auditing cols that are always named the same CreUserId and CreDTime for instance.  In the NATURAL JOIN world those cols would also be part of the equi-JOIN, which is clearly not right.  

So NATURAL JOIN, while far more succinct, is worthless in modern RDBMSs, which is why it is discouraged from use.  Darn.  This is also why most RDBMS vendors don't even bother to support it.  

So, I would love to see something in the ANSI specification (or MS could just implement it as their own extension in TSQL) called something like KEY JOIN.  

KEY JOIN's Conventions

  1. A KEY JOIN will always join two objects by their DRI.  This will be a PK or Unique Key on the parent side and a FOREIGN KEY on the child side.  
  2. If DRI is not present between two objects referenced in a KEY JOIN, then an error should be thrown.  
  3. A KEY JOIN will assume INNER (INNER KEY JOIN) unless otherwise specified.  Just like JOIN is short for INNER JOIN, KEY JOIN is short for INNER KEY JOIN.  
  4. LEFT (OUTER) KEY JOIN would indicate an optional JOIN from tableA to tableB, following the same semantics as a LEFT OUTER JOIN.  
  5. RIGHT KEY JOIN and FULL KEY JOIN would work the same as RIGHT JOIN and FULL JOIN, except the ON clause would be assumed.  
  6. If an ON clause is found during query parsing with a KEY JOIN then an error should be thrown.  

Can anyone see a downside to KEY JOIN?  I can't.  This would really solve readability and errors due to accidentally picking the wrong join condition.  Here is a query I wrote that attempts to show all column names for all tables.  But it's not working.  Why?

Yeah, the ON clause is wrong.  That was probably easy to spot check, but as your queries grow larger your odds of JOINing incorrectly increase and the ability to spot the bug becomes more difficult.  With a KEY JOIN it is almost impossible to screw the query up.  That's just much easier to read.  

I have no clue why this isn't part of the ANSI standard.  

 

Lastly...CREATE or REPLACE

Here's another pet peeve of mine with SQL Server...the verbosity of simple tasks like deploying a new version of a procedure in a database.  I'm sure EVERY stored proc you've EVER written followed some pattern like this:

Maybe you construct Line 1 a little different, but I'm sure the basic pattern is the same.  

Why oh why can't we just have this?

Oh well.  

How can you spot an opportunity for CoC?

I like to write code using known patterns that I've used for years.  Those patterns (like the cursor example) work 99% of the time and I don't need to ever think about the verbosity of the code I'm trying to write.  I just use the pattern.  In the past few years I've realized that patterns really aren't that good after all.  A pattern is just needless verbosity and configuration that will lead to subtle errors.  If we had a shorthand for the given pattern then we could make that the convention and merely manage by exception.  You should look for patterns in your own work that you can factor out and set up as an implicit convention.  This will save you a lot of bug grief and make your code easier to document.  

I've proposed and up-voted many of the above items on Connect (such as %ROWTYPE) and Microsoft just doesn't care.  As for KEY JOIN...well, one day I'm going to seriously propose that.  

 


You have just read Convention over Configuration on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Vertica on AWS

In the last post I covered the Vertica Installation Prereqs.  If you really don't want to install Vertica yourself you can always download a fully-functional VM from HP on the myVertica portal.  

Or you can run Vertica on Amazon in AWS.  I won't cover this in detail since AWS costs money and I'm frugal. Also Vertica is a *bit* different on AWS then when run natively. I have a slight bit of knowledge on running Vertica on AWS. I'll briefly document what I know in case you decide to run Vertica on AWS instead of in-house.  Running in the cloud is an excellent choice if you are an ISV and wish to package Vertica (probably the Community Edition) with your application.  

  • You'll need to be familiar with Amazon Management Console since most of the work is done from there.  
  • You specifiy the HP Vertica AMI you want to install using install_vertica.
  • You can still use Community Edition with 3 nodes and 1TB of data on AWS gratis.
  • You cannot create a cluster on AWS using MC (or some other things cluster-related). Instead use the Amazon EC2 console.  

You have just read "Vertica on AWS" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Installing Vertica 7

In this post we are going to install Vertica 7.0.x from the command line for the first node, then look at how to install Vertica by using the Management Console (MC) in a future post, which is much easier and can be done for all subsequent nodes. The method outlined below is also the EXACT same process you follow when performing an upgrade. The only difference is prudence dictates that you would backup your existing databases PRIOR to upgrading.

Installing from the commandline

There is no apt-get for Vertica unfortunately.

  1. From your laptop (remember, Ubuntu server has no GUI) log in to the myVertica portal (accts are free) and download the HP Vertica package for the correct OS and processor architecture that you need.  
  2. I assume you are connecting to your Ubuntu box using PuTTY with SSH. If so you can use pscp to copy your downloaded .deb file to your server. You can google to download pscp if you are running stand-alone PuTTY.
  3. Open a DOS prompt (elevated) and cd to the location of pscp
  4. pscp <location of .deb file> <user>@<ip>:<path on ubuntu>
  5. You should see a progress indicator.  After the copy is complete you are ready to install the .deb package
  6. sudo dpkg -i <pathname>  .  this only takes a few secs. 
  7. You now have the Vertica package installed. Now you have to install Vertica.  This is a bit confusing, just remember that installing Vertica is a two-step process.  
  8. You now have to run the install script. This can be run for a single node or it can be simulataneously installed to many cluster nodes using --hosts. In other words, you only need to install the Vertica package once, on your first node. We will install on a single node and look at the other options for our subsequent dev nodes.
  9. sudo /opt/vertica/sbin/install_vertica --hosts 192.168.0.x --deb <path to deb> -L CE --accept-eula .  The ip_address can be an actual hostname as well. Don't use localhost or 127.0.0.1 else you can't add additional nodes to your cluster. This would be considered a "standalone, single node Vertica installation".  And there is no known way to fix it later.  The -L CE instructs the installer that you are using the Community Edition
  10. It's highly likely that you may see errors. Simply research and fix them one-by-one, re-running install_vertica as many times as it takes to resolve all issues.  Remember that you are running on Ubuntu and not Windows so expect some customizations.  I have some examples of errors at the right.  Eventually you'll get so far and need to ignore certain system requirement checks in install_vertica. For instance, ext3 and ext4 filesystems or LVM presence. If you have ext2 filesystems or LVM present (see first error above) then install_vertica will constantly fail.  It's OK to run a test Vertica instance on these filesystem configurations.  When you are comfortable that any remaining FAIL messages can be ignored, just add --failure-threshold NONE to the install_vertica command.  Never do that on a prod system, but it works for testing and playing with Vertica.
  11. At this point install_vertica should complete after a few minutes.  
  12. Logout
  13. login to Linux using the dbadmin acct that install_vertica created.  
  14. You can now play around. Try the following (or wait until my next blog post where we will do some additional Vertica fun):   /opt/vertica/bin/admintools

You have just read "Installing Vertica 7" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

On Recruiters

Quick disclaimer:  I don't HATE recruiters.  I love them!  The good ones that is.  Without them I wouldn't have had ANY of the success I think I've had in my career.  But unfortunately most recruiters are sleazeballs.  That's why, over my almost 20 year career, I've only used 4 different recruiters.  And I've never used a job board.  A good recruiter is your partner.  This post is about every other recruiter.  

Have you ever received an email from a recruiter with the text Please send me an updated copy of your resume in Word format?  Assuming you thought you were a good fit, I'll bet you responded with an attached copy without giving it a second thought.  Well, this is a bad idea.  I'm going to outline a few horror stories for you.  Then I'll cover some things I've done over the last 20 years to avoid "recruiter fraud".  

Your resume might get published to online databases without your permission

In one case I had a recruiter, which shall remain nameless, take my resume and graciously post it on a public job board without my knowledge.  How do I know this?  Tip #1:  slightly alter every copy of your resume that you send to ANYONE.  Then when you see a copy of your resume somewhere you can trace that "version" back and determine who shared your resume without your consent.  I usually change the grammar on a single sentence and then log the change and who I sent that version to.  

Why go through this rigamarole?  Many recruiters will claim they have a position that is a perfect fit for you.  Meanwhile, there is no position.  The recruiter wants to add your name to their database.  So, what's the problem with that?  Here's a litany:

  1. These internal recruiting firm databases are EXTREMELY valuable intellectual property.  It is very common for a smaller recruiting firm to be bought out by a larger firm solely for the database.  Assuming the database accurately reflects your information then this might be OK.  But what if you live in Philly and have no desire to relocate to San Diego and the new owner of the database has offices in San Diego and you get calls DAILY from San Diego wondering if you are interested in a .net position there?  This happened to me...twice.  The latest time was when a large recruiting firm was bought out by one of the MONSTROUSLY large, ubiquitous job boards.  I had read about the buyout in WSJ and within a few weeks I was getting more spam than ever from India-based recruiters.  After some legwork I found a copy of my resume on this MONSTER of a job board, and after a few hours of fighting with their customer service people, got it removed.  BTW, the copy of my resume they had was over 10 years old.  
  2. When times get tough recruiters rent their database assets to non-recruiters.  They sell your information so you can later be spammed and scammed.  How do I know this?  I have methods that I use to track this that I don't want to divulge.  Hint:  it involves using different email address, just like subtle differences in resume text, to track who is using my data without my consent.  
  3. Recruiters will post your resume on some job boards, again, without your consent.  I call this the "buy side vs sell side" recruiting approach.  Most recruiters are "sell-side", meaning they focus on selling a service to their client.  The fact is, this generates the most money in the industry.  The theory is that there are far fewer COMPANIES looking for candidates than candidates looking for companies.  So the recruiter focuses on selling to the client first, then finding a candidate later.  But there are a few buy-side recruiters.  These guys find a superstar candidate...one that is a paper tiger, well-spoken, and knows his skills...and then market that candidate to ANYONE who will listen.  This isn't necessarily a bad thing...I know because it has happened to me.  A recruiter realizes that someone is a HOT commodity and he markets the candidate to all of his clients even if they may not have a current need.  It works!  It only becomes a problem when the buy-side recruiter takes this too far and starts posting your resume with her contact information instead on various job boards.  This will eventually come back to hurt you professionally.   

Your resume is easier for a recruiter to alter in Word format

Not all recruiters do this, but most will ALTER your resume before sending it to their client.  THIS PISSES ME OFF.  I put a lot of time and thought into my resume and while others might think my resume is GARBAGE it is MY GARBAGE.  And I don't care to have it altered without my consent.  I have no problem with ANYONE who wishes to provide me with criticism to make my resume better.  There are recruiters who have more experience than I do and I value their inputs.  But ultimately these decisions are mine to make.  

Some alterations are OK.  In most cases the recruiter will insert their firm's logo at the top of your resume and remove your personal contact information.  This is smart since some clients may decide to poach a candidate to avoid paying the recruiter's fees.  If a prospective employer EVER tried to do this with me I would not do business with them.  That shows a total lack of ethics.  Although I find most recruiters to be sleaze-balls, if a recruiter introduces me to a client then that recruiter has earned his fees.  Period.  Most of us believe recruiter fees are outrageously high and we all complain about it, but until our industry cuts out the middleman, then the middleman deserves his pay.  So, I certainly would not mind if a recruiter alters my resume to remove my contact information, but I should be notified FIRST and should need to APPROVE this and this is the ONLY thing that should change.  

I once had an ambitious recruiter alter my contact information at the top of my resume.  At the face-to-face interview I saw this and asked if I could peruse their copy more closely since this was not my version.  I was surprised to see that my blog information, LinkedIn profile address, and even my NAME were altered.  The recruiter altered my NAME!  That is unacceptable.  Smart clients wish to do their own candidate vetting independently of the recruiting firm and the best way to do this is via reading someone's blog, LinkedIn profile, and googling their legal name.  They can't do this if the recruiter altered this data.  

Blatant falsification of my resume

On another face-to-face interview the interviewer wanted to know about my experience with software XYZ.  I mentioned that I had no experience with XYZ and didn't even know what it was.  The interviewer showed me that it was displayed prominently on my resume in the "Skills" section.  

Huh?

The recruiter actually falsified my resume!  I explained that I had no knowledge that XYZ was on my resume and apologized profusely.  The interviewer explained that XYZ was some industry-specific, esoteric piece of software that maybe a handful of people in the world worked with, but they were hoping to find someone with experience in it.  It was not a requirement, rather a "nice to have" for the position.  The interview proceeded and I think I did ok with everything else, and everyone was cordial and happy, but I never got a call back.  

And I didn't expect one.  

If you were that interviewer, would you recommend your company hire a candidate from a recruiting firm that falsifies documentation?  And then pay them a finder's fee for the privilege?  I wouldn't.  If it wasn't for the recruiter being too ambitious I am positive I would've gotten that job.  The recruiter lied his way right out of a commission.  

You're probably thinking that this kind of behavior is an edge case.  Certainly this is not prevalent?  WRONG.  For a number of years I was a team lead at a large ISV and interviewed, on average, about 1 candidate a week for various open positions...DBAs, Java guys, QA folks, etc.  I did a non-scientific study during this time. I asked every candidate to provide me with her copy of her resume.  Post-interview I compared the candidate's version with the recruiter-supplied version.  In more than half of the cases I saw "content" modifications to the resume.  By "content" I mean the change was not:

  • formatting (font change, bold/italic changes, styles)
  • spelling/grammar
  • reordering of sections (skills moved to the top, education to the bottom)
  • removal of candidate contact information
  • insertion of recruiter's logo/contact info

Those changes are OK.  I guess.  I don't approve of someone modifying my stuff, but you could make the case that the recruiter was just being overly "helpful".  Here are some examples of what was changed.  And there is no good reason for it, other than blatant FRAUD:

  • Employment dates were altered to hide periods of unemployment.  
  • Company names were altered.  Usually this was something like "American Bancorp May 2013-May 2014" to something like "Large Banking Institution May 2013 - May 2014".  Why would this need to be altered?  I couldn't think of a good reason.  I did some research and in some cases American Bancorp was also one of the recruiter's clients, and the recruiter was trying to avoid being accused of poaching.  Utterly sleazy.  The recruiter was trying to pass off someone he placed elsewhere.  
  • Employment responsibilities were rewritten to sound like the candidate was more "senior"
  • And the most prevalent edit...SKILLS.  My company used Rational tools and I saw lots of resumes that were altered to include the word "Rational" somewhere.  This was so unbelievably prevalent with one recruiter that I called him and told we really need people with NFC experience.  He asked what NFC was and I replied that it is a standard ITIL tool used in TOGAF analysis.  (BTW, NFC is really "No Fuck1ng Clue"...I made it up).  Can you believe that I started seeing resumes with NFC experience on them?  I now had ammunition to remove that recruiter from my company's "preferred vendor list".  

Need more proof?

If you still think these stories of recruiter fraud are merely edge cases and not indicative of most recruiters, well, perhaps we should ask a recruiter?  Read this article, written by a recruiter on a recruiting forum website.  The author, and frankly most of the commenters (also recruiters), see nothing wrong with recruiters who alter resumes to make them look better in front of a client.  And they do it without seeking the permission of the candidate. Here are some screenshots that should get your blood boiling:

OMG!  This is fraud folks.  It's like a used car salesman saying, "Yeah, GM didn't put air conditioning in this model but it blows out cold air anyway."  Of course it does...when you test drive it in December.  

Here's another:

Thanks for doing me that favor pal!  See my story above where the recruiter did me a FAVOR and cost me the job.  I can't believe the arrogance.  

Let's move to the comments:

Gotta love the author/editor analogy.  Here are the problems with that fallacious analogy.  The (non-sleazeball) editor always runs his edits past the author before publication.  More importantly, an editor may change grammar, spelling, or style, but doesn't change FACTS.  The editor is not the fact-checker.  Note above that Craig also knows his target market better than the candidate.  Yeah, the day a recruiter knows more about my domain than I do is the day I retire.  What an asshat!

Here's a more accurate analogy.  Let's say the local grocery store decides that the makers of Tide have horrendous marketing and the local grocery store could do better.  Does the local grocery store change Tide's marketing material?  Of course not.  In fact, that is trademark infringement and is illegal.  Resume altering is the same thing.  My resume is my "mark" (it isn't a trademark) and I would prefer my mark not be altered.  

Another comment:

"Sure Rebeccah, go ahead and modify my resume.  I promise not to be shocked at the interview when you total misrepresent my skills and experience."  Clearly Rebeccah is contradicting herself in the second sentence.  If Rebeccah asked me I would give her my consent but I would ASK to see the changes BEFORE they were submitted.  This avoids the shock.  I'm appalled that Rebeccah would want any of her candidates to be shocked at an interview.  Isn't that lack of preparation?  

And look at that last sentence...she actually modifies executive resumes.  Words fail me!

Rebeccah is clearly inexperienced.  I agree that Times New Roman is a yukky font, but there are very good reasons why it is used.  It is guaranteed to be an installed font with every copy of Word since...forever.  Clearly Rebeccah does not know how Word works when the font du jour she uses is not installed on her client's laptop.  Then there is the issue of the hiring manager that uses OCR and it can't interpret her nifty font.  The point is, TNR is horrible, but the alternatives can be worse.  Be careful!

If a candidate has a resume professionally prepared then that means they are better prepared.  I like that candidate already.  It shows initiative.  It shows attention to detail.  It shows that they understand their limitations.  I had mine professional prepared...15 years ago.  Since then I had other "professionals" try to modernize it but each one looked horrible and I like the style and am sticking with it.  As an interviewer I want to know how prepared the candidate is.  Craig is conflating "professional resume preparer" with "recruiter".  The two are totally different.  And I don't know any professional resume preparer that will alter CONTENT.  

And again, the analogy used by Craig is wrong.  If the candidate submits a professionally-prepared resume then the CANDIDATE gave his explicit consent.  If the recruiter prepares the resume and submits it then the CANDIDATE did NOT give his consent.  It all boils down to consent, ethics, and honesty.  

"But most technical people can't write a coherent English sentence in a resume and I'm helping the candidate by doing this"

I was told this once by a recruiter who reformatted my resume to make it look more modern without my consent.  To any recruiter reading this:  YOU ARE LYING WHEN YOU DO THIS.  If your candidate can't put together a good resume how do you think they are going to represent YOU in front of YOUR important client?  Get permission from the candidate before altering a resume.  And consider coaching the candidate and having them make the recommended edits. This helps the candidate LEARN how to present themselves to a client better.  When you add lipstick to a pig, you still have a pig. 

Some approaches to combat "recruiter resume fraud"

Here are some approaches I've tried to combat overzealous recruiters:

  1. Only work with reputable recruiters.  I've been in IT since 1997 and have always used recruiters.  Exactly 4 of them.  I've worked or consulted for at least 10 different companies and have used 4 recruiters.  These recruiters I trust implicitly.  I know they modify my resume and I approve of what they do.  I TRUST THEM BECAUSE THEY'VE EARNED IT.  There are lots of recruiting firms and many of them are disreputable.  I understand and accept that by not working with some recruiters that I am likely missing out on GREAT opportunities.  I'm ok with that.  If a company wants to work with a shady recruiting firm then I don't wish to work at that company.  The recruiters I work with know me, my personal foibles, my experience, and what I want out of a company.  They have a long term PERSONAL relationship with me.  And I know their relationship with their client is similarly strong.  When you lay down with dogs, don't be surprised when you get fleas.  ONCE I accepted an opportunity with a client of a disreputable firm because it was an amazing opportunity.  It really was an amazing opportunity.  And the recruiting firm was ALWAYS late paying me.  Once they were 5 months in arrears in fees.  I walked out on them, and the client.  It was a great opportunity, but a bad circumstance.
  2. Try sending your resume to a recruiter in pdf format and tell them you do not use Word.  Sometimes this is enough to thwart recruiters from resume fraud.  
  3. Create an online version of your resume and reference "supporting details" on your Word version that point to that online version.  If you do this in enough places the recruiter can't get too sleazy.  
  4. Whenever sending your resume to a recruiter, use the trick I outlined at the beginning of this post where you track subtle changes to your resume that can be traced back to a recruiter.  At least you'll know who is sleazy later.  
  5. Whenever you send your resume ensure you include a DISCLAIMER in the email text.  Here is the version I use.  This won't stop sleazeball recruiters from doing whatever they want, but it doesn't hurt to put them on notice.  If a recruiter balks at this just walk away.  Don't lower your standards because the recruiter claims to have some amazing opportunity.  They never do.  I've been burned so much by this that I've finally learned this truth.  
  6. Make your Word document as protected as possible without being a nuisance.  None of these settings are foolproof, but everything helps.  Here's what I do when I'm done editing the "version" for a given recruiter:
    1. File|Permissions|Restrict Editing.  
    2. Click Edit restrictions and select "No change (read only)"  
    3. Click "Yes, Start Enforcing Protection"
    4. Select Password and choose a simple password.  Remember that the recruiter is going to want to add their logo.  You aren't trying to thwart the NSA with this password.  
    5. File|Protection|Mark As FInal
    6. Now when a recruiter opens the document it will be read only.  This won't stop a recruiter from copy/pasting your resume text into another file, but sleazy recruiters will always be sleazy.  

Regarding References...

When a recruiter asks you for references to call BEFORE any contact with a prospective client, tell them NO.  There is at least one HUGE national recruiting firm that does this under the premise of being a value-add for their clients.  On the surface this sounds reasonable.  They want to check your references before embarrassing themselves with an important client if you are a dud.  

Problems with that:

  1. People can (and do) fake their references.  Calling a reference is not a replacement for a background check, it is merely one part of good due diligence.  I've heard of cases where candidates used their spouse as their "manager reference."  So, if references are so darn important, then make the recruiter do a background check on you too prior to divulging your references.  They won't.  Why?  Background checks cost $$$.  
  2. Recruiters will try to recruit your references.  A recruiter from a HUGE national recruiting firm demanded to call my references.  I had a reference who is a BigShot in my industry.  Seriously, he's well known.  The recruiter called the reference because he had an opportunity for me.  The recruiter spent the entire time trying to recruit my reference.  Then he asked my reference if he knew anyone who was interested in being recruited too.  My reference called me and was furious at me for doing business with a sleazeball like that.  How many times can I go back to the well with this reference?  Don't abuse your references by giving them out to recruiters too early in the process.  
  3. Recruiters from the same recruiting firm will ALL want to talk to your references.  I get calls from this HUGE national recruiting firm at least 4 times a year and it is always a different recruiter.  And they always want my references BEFORE they even tell me about their opportunities.  I mention that Recruiter X already called my references a few months ago and that she should go talk to him.  It doesn't work, they still need to talk to my references.  There is no good reason for this other than that this recruiting firm is sleazy.
  4. I've had a recruiter tell me that I had to provide the references first, but that he wouldn't call them until I was prepared to move forward with a face-to-face interview.  In a moment of weakness I handed over my references.  Guess what?  He called them within the hour trying to recruit them.  How do I know this?  Because one of my references was my own HOME PHONE NUMBER with a made-up name.  The sleazeball recruiter didn't even bother to check the number.  (I love doing little social experiments with recruiters...have you noticed that yet?).  
  5. References will always say good things about the candidate...that's why they are the reference.  I'm not going to allow someone to be my reference unless I am POSITIVE that I will get a glowing recommendation.  Therefore, references, by their very nature, are worthless.  
  6. Your references may have loose lips.  Whenever you use someone as a reference, decorum states you should tell her first.  By asking the person to be a reference you are implicitly telling that person that you are looking for new opportunities.  In some circumstances this could be a problem if your reference goes and blabs.   

Don't give your references to any recruiter until you've gotten through at least the phone interview with the prospective client.  There is no good reason to hand out your references prior to that.  You need to be sure that you are a good fit for the opportunity.  And when the recruiter says that this is their NON-NEGOTIABLE policy, then tell them, "Thank you, but I'll look elsewhere."  

As I've said, I've interviewed dozens of candidates and I would never dream of asking for a reference first.  Most candidates are not good fits and I don't have the time or the inclination to do reference checking first.  I get a feel for the other person at the table and I go with my gut.  Then I get a background check.  Those are much harder to fake.  I couldn't care less about a reference.  

Concluding Thoughts

Whenever any industry is experiencing a boom you can rest assured that the percentage of charlatans will rise proportionally.  We see this with IT people that pad their resume to include the hot new skill that is in demand.  We also see recruiters who are just interested in getting asses in chairs and are not interested in ethics.  

My advice to IT people reading this blog post: Be leery of all recruiters until you've built up trust.  Never allow a recruiter to alter your resume's content and never provide references until you've had the opportunity to talk with the client.  

To any recruiters that might be reading this post I offer you my advice:  Never alter someone's resume without permission.  Never recruit someone's references.  Always be honest in your dealings with your clients and your candidates.  If you work at a sleazy recruiting firm get out now.  Life is too short to be a whore.  I know lots of good recruiters who have left these places and are much happier now.  If you can't abide by my rules, don't contact me.  I'm not interested in your business.  Remember: you called me, I didn't call you!  

I wish you the best of luck in life's endeavors.


You have just read "On Recruiters" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Tags: 

Pages

 

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer