We Need To Talk About Data Fragmentation

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedIn

Five years ago, if you had a question about your users you could answer it easily. Your data, while not tracking every client-side CSS element, at least stored your primary metrics.

You could perhaps even answer most questions with a simple SELECT statement.

Today, analysts face a growing ecosystem of third party tools, providing more detailed data about your users. Your client-side events are tracked in Mixpanel, your billing data stored in Stripe, and your marketing interactions in Optimizely.

But the downside to more tools is data fragmentation. There are heterogeneous schemas, difficulties in aggregation, and issues of information extraction. It’s become harder and harder to get a complete picture of who your users are, what they do, and what they need.

We faced such challenges before starting Bolt, when we were systems and engineering managers at Optimizely and Google. So we wanted to see how often data fragmentation occurs at other companies, and analyzed the Alexa Top 1M domains over the past 3 years.

We’ll explore our primary findings below (download the full report here), and show how over 16% of domains used at least one external user or marketing data source, and 12% of those split their data between multiple data types, growing exponentially at a 2.88X Avg Y/Y growth rate.

Exploring User and Marketing Datastores

We explored the issue of data fragmentation via three questions:

  1. External Datastore Usage – how many websites are using external user or marketing datastores
  2. Multiple Datastore Usagehow many websites are using multiple user datastores or multiple marketing datastores?
  3. Cross-Technology Usage – how many websites are using both a user and marketing datastore technology?

For the purposes of this analysis we focused on six main categories of user and marketing datastores, defined as any third party tool that may store a user email or id in some form. For user datastores, we assessed adoption of CRM, Analytics, and Payments technologies; for marketing datastores, we looked at Ad remarketing, AB testing, and Survey technologies.

To identify which websites were using which datastores, we used the website profiler BuiltWith.com, which tracks technology adoption via detection of javascript snippets. We focused on the top 3 technologies per datastore category (i.e. Salesforce, Marketo, Zendesk for CRM datastores), and counted the absolute and relative rates of adoption through December 2015, running simple linear regressions to estimate year-over-year growth rates.

External Datastores are Growing 2X Every Year

Screen Shot 2016-03-07 at 2.05.20 PM

Of the Alexa Top 1M websites in 2015, we found 160,178 domains (16%) had an external user or marketing datastore identified from the list of 18 technologies assessed.

Websites using at least one external user datastore totaled 42,394, with the count growing on average 2.40X Y/Y from 2013-2015 (exponential R2 = 0.999).

Screen Shot 2016-03-06 at 9.40.24 AM

Of the respective user datastore types, websites using Payment technologies had the highest count with 18,871 domains, while Analytics technologies presented the highest rate of growth with an exponential R2 of 0.998. So while websites are more likely to store their user data in external CRMs and Payment datastores now, Analytics and BI technologies present a growing source for user data, as recently reported by both Forbes and VentureBeat.

For external marketing datastores, we found 136,428 domains in the Alexa Top 1M in 2015, with the count growing on average 1.32X Y/Y from 2013-2015 (exponential R2 = 0.982).

Screen Shot 2016-03-06 at 9.40.37 AM

Ad Remarketing technologies made up the largest majority of marketing datastores with 111,075 domains detected. And both Ad Remarketing and AB Testing technologies saw equivalent year-over-year average growth rates at 1.50X from 2013-2015. This is in line with recent reports of digital ad spend steadily increasing, with Google and Facebook Ads dominating market shares.

Domains Are Using Multiple CRM, Analytics and Ad Datastores

Screen Shot 2016-03-07 at 2.05.09 PM

Data fragmentation isn’t just indicated by the counts of different external datastore usage. Having several datastores of the same type (i.e. Zendesk + Salesforce, or Google Ad + Facebook Ad) present instances of data decentralization as well.

Exploring the same dataset of Alexa Top 1M domains in 2015, we analyzed which unique domains had two or more technologies of the same external datastore type detected.

We found that Ad Remarketing datastores presented the highest incidence of multiple usage. Of those domains that used an Ad Remarketing technology on Google, Facebook, or AdRoll, 22% used two or more distinct Ad datastore types. And of those domains using CRM technologies such as Salesforce, Marketo, or Zendesk, 7% used two or more distinct datastores. This is straightforward in conclusion, considering websites are more than likely to advertise on multiple channels, and leverage different CRM technologies for respective stages of their user lifecycle.

Screen Shot 2016-03-06 at 10.57.10 AM

We also found Analytics technologies such as Mixpanel, Heap, and Segment to present a high count in usage, with 11% using two or more Analytics technologies. This could have several explanations.

Websites may use Segment to pipe their event data to Mixpanel or Heap, resulting in both platforms being detected. It is also possible that larger companies use different analytics platforms on different parts of their site or web vs. mobile platforms, as part of different business units or product teams.

The remaining categories of external datastores (Payments, AB Testing, Survey) had low adoption of multiple technologies – not surprising considering a website for instance is only likely to use one payment gateway.

Combined Datastore Usage Growing by 3X Every Year

Screen Shot 2016-03-06 at 9.54.40 AM

Cross technology adoption was the last arena of analysis we pursued. The Alexa Top 1M domains were assessed for those websites that used both an external user datastore and an external marketing datastore.

Between the 42,394 domains that use a user datastore and the 136,428 that use a marketing datastore, 19,404 unique websites (12%) were found to use both in 2015. Given the higher count of domains using marketing technologies, the distribution of overlapping technologies skewed considerably to those with a marketing datastore (see bubble plot above).

The number of domains using both external datastore types was also found to be growing at a significantly exponential rate. From 2013-2015, websites using both a user + marketing datastore grew at an average 2.88X Y/Y (exponential R2 = 0.998).

Screen Shot 2016-03-06 at 9.40.44 AM

Breaking the growth rates down by specific categories of datastores, we found domains using an Analytics + Ad or AB Testing technology contributed the most to the exponential trend (exponential R2 = 0.990 and 0.991 respectively). By contrast those domains using an CRM + Ad or AB Testing technology seemed to contribute the most to the overall growth rate (2.68X Y/Y and 2.60X Y/Y respectively).

The high rate of external user / marketing datastore adoption may be due to a confluence of factors.

User and marketing datastores are often implemented by different parts of the business (i.e. engineering vs. marketing), leading to independent fragmentation of the data between teams.

Second, the pool of potential datastore options has grown significantly with Forbes reporting the business intelligence market expanding. This has precipitated the prevalence of more integration platforms that can in turn make having heterogeneous datastores easier to maintain. Segment Integrations and Optimizely Technology Partners are two such platforms which have made it easier to connect both user and marketing datastores.

More Data, More Problems

marketing_technology_jan2015

Clearly a company’s data no longer lives in one internal database.

Data is fragmented and spread across external CRM, Analytics, Payment datastores, as well as Ad Remarketing, AB Testing, and Survey platforms. And the incidence of multiple datastore and cross technology use is increasing at an exponential rate.

This trend is a natural byproduct of how SaaS tools are implemented. Different teams instrument distinct tools for their discrete problems, resulting in a tech stack that essentially matches your org chart (see infographic above). Reports from VentureBeat indicate data management will consequently be the #1 priority for enterprises in 2016.

Product analytics will become difficult as your user behavior lives in user and marketing datastores. Ad hoc questions become slow to answer, requiring complex queries and aggregations across multiple tables. Questions of conversion rates, affinity analysis, and lifetime value become nearly impossible to answer, as the user lifecycle is fragmented across datastores.

More companies may in turn seek out data warehouse solutions. But as every company’s technology stack is different, they may more often just build custom solutions in-house. Which in turn leads to high maintenance costs, data integrity issues, and concerns with anomaly detection.

Here at Bolt, we’re interested in helping companies with these challenges. Sign up for our beta to let us know how we can help your company.

And download our full report on data fragmentation, for additional industry specific analysis across B2B, Shopping, Travel and other verticals.

  • See how data sources have grown over 3 years
  • Compare CRM, Analytics, Payment, Ad platform adoption
  • Discover how 9 different industries are affected

Download now

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedIn

About Bilal Mahmood

Cofounder @ Bolt. Formerly at Optimizely, Google, and the Dept of Commerce. Stanford & Gates Cambridge Alum. Pizza addict.