Is big data really such a big deal?
by James Lawson.
14 Jun 2012: We hear a lot about “big data” and its associated storage, processing and analysis challenges at the moment: the torrent of data produced by everything from web analytics to social media appears set to spread chaos through our systems. But does marketing really need to care about big data here in the UK? We talked to the experts to find out.
Elephant in the room
First, some numbers. According to IBM, we create 2.5 quintillion bytes (1018 or 1 Exabyte) of data every day and 90% of the data in the world has been created in the last two years. With 24 hours of Twitter feeds equating to about 8 Terabytes, you can see why data volumes are ballooning at some businesses.
“Big data is an in-vogue term coined by the big consultancies,” says Ruaraidh Thomas, Managing Director of Data Lateral. “But it certainly is a challenge for marketing and it’s not just about volume.”
The hype around volumes is not entirely unjustified. Machine-generated data is largely to blame for the big increase in new data production. Produced in much larger quantities than traditional formats, which tend to be relatively well-described and change slowly, data from everything from jet engine management systems to radio telescopes flows ever quicker and its volume trends remorselessly upwards.
Companies involved in big data are collecting it from many sources and using it for many reasons: running a mobile phone network, managing user-generated text and video, streaming stock market information, handling legal compliance or supporting research.
“There’s a huge compliance element,” notes Thomas. “They have governmental responsibility to know what data they have in their business and store that compliantly.”
So purely on volume, what sort of file sizes do MSPs commonly deal with these days? “We hold around 380m records on our combined file, which covers all b2b and b2c reference data for the UK,” says Antony Allen, Managing Director of Data8. “The data is well structured and very tightly managed, with a minimum of 50,000 daily changes. The file size works out about 1Tb.”
Darron Gregory, Director of Insight and Innovation at Celerity, estimates a “reasonably-sized” hosted client database of two to three million customers would be in the “low Terabytes” today. Only four or five years ago, “it was rare for a marketing database to be larger than a few Gigabytes”. A three-orders-of-magnitude volume change shows that expanding data is a reality in UK marketing – but it’s still a far cry from big data.
So it’s not surprising that, though marketing use of big data is name-checked by vendors as a justification for system investment, real-life case studies are hard to find. Companies such as NICE are handling large data volumes to analyse call centre and web chat, Twitter feeds and other interactions in order to trigger offers or other actions, but examples tend to be more about “lots of data” rather than big data at the Exabyte-and-beyond level.
For marketing at the moment, it’s about “what’s in it for me?”; if new data is now available, how can we derive value from it? In the UK, it’s marketers themselves that are forcing volumes upwards as they demand more detailed information on customer and prospect behaviour. “The number of customers hasn’t changed, but the data around them has,” says Gregory.
So rather than only using derived variables for lifetime value, date of most recent purchase, annual income by customer and so forth, there’s more interest in analysing every single transaction. Instead of basket value, each item is recorded. Where they might simply have logged a web visit or an online purchase, companies increasingly want page-level details, such as the products viewed, whether they were added to the basket and if a purchase resulted. Building a single customer view is where these larger data sets do challenge marketing today. Putting the four components commonly used to describe big data – volume, velocity, value and variety – into MSP hosting terms translates to storing and processing large amounts of structured and unstructured customer-related data and presenting it such that it can be swiftly analysed and exploited.
“Volume is definitely in the top three challenges for any new client solution,” says Andy Grace, Technical Architect at Occam. “A simple rule of thumb for cost and effort is volume multiplied by complexity, where complexity is a measure of the number of data items to be integrated, the number and types of business rules that need to be applied to the data, plus the required transformation of data for final presentation.”
To manage and analyse really big data quickly enough, companies like Google long ago moved away from relational databases to platforms such as MapReduce and Hadoop that can support work across multiple separate servers. Though most marketing database volumes may still be small in comparison, the move to the Terabyte scale and rising demand for real-time answers has forced changes in the databases used by MSPs.“We have changed the architecture of our marketing database,” says Gregory. “Real time is now the norm for us and the relational structure is no longer adequate.”
Celerity now employs a data structure developed by data warehouse guru Ralph Kimball. Rather than the hierarchical structure you would tend to find in a relational marketing database where everything is tied back to an individual from the start, the company stores data at a transactional level and only later makes the linkages required to form a picture of a group or an individual’s behaviour – the so-called “presentation layer”. That suits online data in particular, where a site visitor might start as a cookie entry in a web log and only become a customer later on. This structure can handle far larger data volumes and is also better suited to the distributed computing model required for top performance.
“Shaping of the data in its storage phase of processing immediately limits its usefulness by fixing relationships between data items,” says Grace. “This fixed structure does not lend itself to the wide variety of analysis and selection techniques that maybe required, so there is an increasing trend to provide this data shaping dynamically through the presentation layer.”
Storage media is cheap and is possibly the simplest part of the big data challenge to solve – but that’s really only for archiving purposes. Crunching the numbers at the required velocity tends to demand high-end hardware. Dynamically building the required view for analysis also requires much faster processing.
“This places much higher levels of work load on the storage which needs to retrieve and join the data together as demanded,” says Grace.” This combined with large volumes leads to specialist solutions such as solid-state storage, appliance-based database engines or the new breed of in-memory solutions.”
This need for computing horsepower on demand is accelerating the shift to cloud-based data centres. “We haven’t bought a server for years,” says Gregory. “Scale and availability are more important now due to the need to send and respond to campaigns in real time. The days when an MSP manages its own infrastructure are almost gone.”However, the traditional SCV building challenges of cleansing and merging feeds into a coherent, actionable database still remain. Higher volumes and the variety of new data types simply exacerbate them.
“A lot of data feeds will contain erroneous or spurious data,” says Steven Day, Director at UKChanges. “There’s a learning curve to go through to successfully filter out the junk and filter in the useful stuff. Homing in on the right indicators from the data is likely to be an iterative process.”Day also pinpoints the consistency in format and content of data feeds as crucial. “If third party feeds feature, as there’s typically less control and “big data” types of projects are necessarily heavily automated,” he says. “Appropriate checks and balances are required to ensure everything stays in line.”
As ever, most companies are well behind the leading edge of marketing practice and big data to those in the midmarket at the moment really means making better – or any - use of online information. Email response data stagnates on Email Service Providers’ servers while web analytics are only used to optimise site design.
“At present, big data is a potential requirement but it is not a reality in the day-to-day world of most of our clients,” says Day. “They struggle to decide which attribute, data feed or source to measure and what the data may mean and how to action it.”
If web and call centre data are two of the feeds most responsible for growing marketing database volumes, what about all that social media data? Social media analysis still tends to exist in a silo as it is difficult to bring the data into a marketing database in a meaningful way.
“The need to recognise the same people on and offline is fundamental to making use of social data,” says Thomas. “Sentiment analysis is more like traditional market research. But if you want to engage with someone directly through social media, you will at some point have to bring messages into the database at an individual level.”
Preparing the way
Just as faster computers with more memory led developers to create larger, more complex software applications and bigger data sets, so marketers will seize the opportunity to store and analyse ever larger data sets in order to track and understand customer behaviour. Big data may be niche today, but the numbers using those kinds of volumes will surely grow.
“The industry may be a good two or three years away from this but it will happen,” says Gregory. “As an industry, we need to make sure we can provide the road map for our clients.”
Cleansing Big Files
For MSPs specialising in list hygiene, client file size is rarely a problem as all other attributes bar name and address are usually stripped out before processing. “We are seeing larger files come across these days,” says Data8’s Antony Allen, whose company specialises in online cleansing services. “We recently processed a 13m record file for one client. But it’s simply a list of names and addresses, and by the time the data is zipped, how big is it going to be?”
Depending on the complexity of validation and the business rules applied, modern cleansing software should be able match the entire UK population against a set of reference files within a manageable period. But as deadlines tighten, whether that period is hours or days for larger files is becoming more important.
“With tighter schedules and the growth in volumes, our bureau clients are putting pressure on us to get our software to run as fast as possible,” says Mark Dobson, Client Services Director at The Software Bureau. “We’re rewriting Cygnus module by module to take advantage of the latest software innovations.”
That means using multi-threading to allow load sharing across multiple servers, and also offering SaaS delivery for the first time. The software already offers alternative matching techniques that suit different applications, for example, high-volume deduping versus name and address matching to suppression files.
“There’s not much more that we can add in functionality, but clients will be able to choose from an installed version, a service they can use themselves or one that we can deliver,” says Dobson.
20 May 2013: In the third of our special series of articles profiling the database marketing industry, we catch up with Hopewiser, Alchemetrics, Callcredit and greenstone data solutions.
5 Apr 2013: It is predicted that, over the next year, the human race will produce 4 Zetabytes of new data, which represents about 1 quintillion new objects. Are marketers ready for this deluge and is it an opportunity or a threat, asks the IDM?
12 Mar 2013: B2B email data providers have gotten increasingly sophisticated over the last decade when it coems to building compliant, legal and responsive lists, as James Lawson discovers.
12 Feb 2013: Efficient name and address data matching is a fundamental of much database and insight-driven marketing activity, even if it doesn’t inspire much excitement, finds James Lawson.
4 Feb 2013: With a call for an Open National Address Dataset, the Open Data User Group has stirred up a heated debate on whether addressing data should be made freely available, finds James Lawson.
19 Dec 2012: The UK postal market is now the most de-regulated in the EU bringing both positives and negatives – James Lawson speaks to expert MSPs and vendors to get the low down on sortation.
18 Dec 2012: When it comes to effective direct marketing, timing is everything because pulling the trigger at precisely the right moment is invariably the difference between a sale and a fail, says James Lawson.
3 Oct 2012: Social media is now part of the established marketing landscape and while it’s great for many things, it’s still very difficult to link it to ROI – but that’s not stopping lots of people trying, finds James Lawson.
14 Aug 2012: The attraction of leveraging the power of the internet to target overseas markets has spawned a surge in international activity from the UK’s many address management providers, but the challenges of going global can be very significant indeed.
13 Aug 2012: The overwhelming majority of marketers regularly accept that building a single customer view strategy is utterly vital in today’s complex, multichannel world – so why is it that so few of them are actually doing anything about it, asks James Lawson?
26 Jul 2012: Aimed ostensibly at SMEs, online self-service data cleansing sites are increasingly pitching themselves at much bigger customers looking for convenient, quick turnaround options – so can that circle be squared, asked James Lawson?
18 Jun 2012: The torrent of data made available by the web is hugely tantalising, yet harnessing it for use in multichannel marketing remains a huge challenge, finds James Lawson.