For a business in digital transition, data architecture is a big decision. Selecting the right model is one of the first and most important choices of any such initiative. But given the breadth of options and confusing terminology, choosing a solution that meets the company’s needs without blowing its budget is no easy task.
Two of the most popular options are often referred to as “data warehouses” and “data lakes.” Think of a data warehouse like a shopping mall. It has discrete “shops” within it that store structured data — bits that are presorted into formats that database software can interact with.
In contrast, a data lake is like a disorganized flea market. It has “stalls,” but where one stops and the next one begins is not so clear. Unlike data warehouses, data lakes can contain both structured and unstructured data. Unstructured data, as the name implies, refers to “messy” digital information, such as audio, images, and video.
Complicating things further is the “data marketplace.” Unlike the first two concepts, this isn’t an architecture but rather an interface to a data lake that enables those outside of the IT team, such as business analysts, to its contents. Through a search function, it allows users to fish what they need out of the lake. Think of data marketplaces like personal tour guides for flea markets, showing shoppers where to find the best deals.
Inside the Data Warehouse and Data Lake
For a firm that’s looking to analyze large but structured data sets, a data warehouse is a good option. In fact, if the company is only interested in descriptive analytics — the process of merely summarizing the data one has — a data warehouse may be all it needs.
Let’s say, for example, company leaders want to look at sales figures across a particular time period, the number of inquiries about a product, or the view counts on various marketing videos. A data warehouse would be perfect for those applications because all of the associated figures are stored in the form of structured data.
But for most companies embarking on big data initiatives, structured data is only part of the story. Each year, businesses generate a staggering quantity of unstructured data. In fact, 451 Research in conjunction with Western Digital found that 63 percent of enterprises and service providers are keeping at least 25 petabytes of unstructured data. For those firms, data lakes are attractive options because of their ability to store vast quantities of such data.
What’s more, data lakes allow analysts to go beyond descriptive analytics and into the exciting — and highly rewarding — domain of predictive or prescriptive analytics. Predictive analysis is the practice of using existing data to predict future trends relevant to one’s business, such as next year’s revenue.
Prescriptive analytics goes a big step further, using artificial intelligence technologies to make recommendations in response to predictions. For both predictive and prescriptive analytics, a data lake is a must. Often, leaders manage data lakes using software like Apache Hadoop, a popular ecosystem of analytics tools.
Before springing for either a data lake or a data warehouse, think about who’ll be conducting data analyses and what sort of data they’ll need. Data warehouses are often accessible only by IT teams, while data lakes can be configured for access by analysts and business personnel across the company.
A healthcare organization my company worked with recently, for example, requested a data warehouse solution. Soon, though, it became apparent that the firm would instead require a data lake. Not only was it interested in predictive modeling, but it also sought to input all sort of unstructured data, such as handwritten doctor’s notes.
Analysts at a healthcare company might pull treatment data from a data lake to predict patient outcomes. They might add a prescriptive layer to then recommend the best course of treatment for each patient’s needs — one that minimizes cost and risk while providing the highest quality of care.
Making the Most of the Data Lake
Given their ability to store both types of data and their suitability for future analytics needs, it’s tempting to think that data lakes are the obvious answer. But due to their loose structure, they’re sometimes derided as more of a data “swamp” than a lake.
In fact, Adam Wray, CEO and president of NoSQL database Basho, described them as “evil because they’re unruly” and “incredibly costly.” In Basho’s experience, “the extraction of value [from data lakes] is infinitesimal compared to the value promised.”
But one shouldn’t count data lakes out just yet. Data marketplaces can rescue the promise of data lakes by organizing them for the end user. Just as the internet was much more difficult to navigate before Google, data marketplaces unlock the powerful data lake architecture.In the analytics world, there’s no one-size-fits-all system. Data warehouses can give even smaller companies a taste of data analytics, while data lakes (when combined with data marketplaces) can enable enterprises to dive headfirst into big data. These systems aren’t mutually exclusive, either. If its analytics needs change, a company that chooses a warehouse can later add a lake and a marketplace.
What’s most important is starting the journey to a more data-driven business. Many executives will remember that a decade ago, data wasn’t even discussed outside of IT teams. Now, with the range of analytics needs and tools available, it’s executives’ turn to lead the conversation.