Introducing data lakes

Last Updated : Jan 26 2015 | 12:10 AM IST

You have probably heard of data warehousing, but now there's a newer phrase doing the rounds, and it is one you're likely to hear more in the future if you're involved in big data: 'Data Lakes'.

So what is a data lake? Well, the best way to describe it is to compare it to data warehouses, because the difference is very much the same as between storing something in a warehouse and storing something in a lake.

In a warehouse, everything is archived and ordered in a defined way - the products are inside containers, the containers on shelves, the shelves are in rows, and so on. This is the way that data is stored in a traditional data warehouse. In a data lake, everything is just poured in, in an unstructured way. A molecule of water in the lake is equal to any other molecule and can be moved to any part of the lake where it will feel equally at home.

This means that data in a lake has a great deal of agility - another word which is becoming more frequently used these days - in that it can be configured or reconfigured as necessary, depending on the job you want to do with it.

A data lake contains data in its rawest form - fresh from capture, and unadulterated by processing or analysis.

It uses what is known as object-based storage, because each individual piece of data is treated as an object, made up of the information itself packaged together with its associated metadata, and a unique identifier.

No piece of information is "higher-level" than any other, because it is not a hierarchically archived system, like a warehouse - it is basically a big free-for-all, as water molecules exist in a lake.

The term is thought to have first been used by Pentaho CTO James Dixon in 2011, who didn't invent the concept but gave a name to the type of innovative data architecture solutions being put to use by companies such as Google and Facebook.

It didn't take long for the term to get used by select companies - Hortonworks, for instance, includes it in the name of its service, Hortonworks Datalakes. It is a practice which is expected to become more popular in the future, as more organisations become aware of the increased agility afforded by storing data in data lakes rather than strict hierarchical databases. For example, the way that data is stored in a database (its "schema") is often defined in the early days of the design of a data strategy. The needs and priorities of the organisation may well change as time goes on.

One way of thinking about it is that data stored without structure can be more quickly shaped into whatever form it is needed, than if you first have to disassemble the previous structure before reassembling it.

Another advantage is that the data is available to anyone in the organisation, and can be analysed and interrogated via different tools and interfaces as appropriate for each job.

It also means that all of an organisation's data is kept in one place - rather than having separate data stores for individual departments or applications, as is often the case.

This brings its own advantages and disadvantages - on the one hand, it makes auditing and compliancy simpler, with only one store to manage. On the other, there are obvious security implications if you're keeping "all your eggs in one basket."

Data lakes are usually built within the Hadoop framework, as the datasets they are comprised of are "big" and need the volume of storage offered by distributed systems. A lot of it is theoretical at the moment because there are very few organisations which are ready to make the move to keeping all of their data in a lake. Many are bogged down in a "data swamp" - hard-to-navigate mishmashes of land and water where their data has been stored in various, uncoordinated ways over the years.

And it has its critics of course - some say that the name itself is a problem (and I am inclined to agree) as it implies a lack of architectural awareness, when a more careful consideration of data architecture is what's really needed when designing new solutions.

The author is Bernard Marr, a big data expert. Re-printed with permission.
Link: https://www.linkedin.com/pulse/big-data-what-heck-lakes-bernard-marr?trk=prof-post

Connect with us on WhatsApp

Introducing data lakes

More From This Section

Promise or problem?

Virtualisation of media industry

Why SMBs need to consider cloud computing

Cultural intelligence a new imperative

Explore News