Big Data Using NoSQL – Where and How?



Most of the Data Integration and Data Management companies, products and tools mainly provides a unified way to combine, view & control data residing in different sources. Basis of this unified way is nothing but ETL – Extract, Transform and Load. All data crunching products and services require to use ETL processes either on need basis or on regular basis (hourly, daily, weekly and monthly). However the decision of underlying data platforms i.e. database engines, is left to the data architects and application architects. Can we imagine a platform or a service, which can drive this decision of underlying data platform for data architects and application architects? At a high level, this question looks like “Which Data Platforms are most suitable for my transactions or workloads”. But it has 2 real questions – i> Where to reside my data ii> How to reside my data.

Traditionally all enterprise and open source RDBMS have been answers to above mentioned 2 questions and there are thousands of white papers & power point decks describing which RDBMS is better in which situation and why. These are again based on internal research of individual RDBMS shop or conclusions provided by data architects. But I have not seen a platform or a service, which allows helping and driving my decision by doing a trial runs for certain data and comparing the results. This might be all fine now because RDBMS are well matured, hardened and widely used data platforms. However, we cannot say the same when it comes to a big data using No-SQL tools. I wish to showcase a platform or a service to try and compare multiple No-SQL tools for my data crunching & processing requirements, finally to land a decision(s) to choose a right tool.

Problem Statement

If one starts searching “NoSQL v/s RDBMS” then there are at least a million articles and papers and to set the expectations right, this blog is not for such debate at all. By working closely for the last 7+ years with the large data processed using batch-processing and near-real-time processing techniques, I strongly think NoSQLs are not going to replace RDBMS and also dare to say that No-SQLs are not even targeted for that purpose. Below points high-light my worries, which can be collectively mentioned as “Problem Statement”:

1) Many non-transactional data systems e.g. data warehousing, are big shops of RDBMS, but due to higher Cost to Performance ratio considering NoSQL tools as alternate scalable solution for certain workloads, but not for all the workloads. Main question here – for certain workloads, do we have to really pay higher cost of enterprise RDBMS dbs?

2) Data systems require data to be retained for longer time and every month data volume is increasing. In theory, scaling out storage and compute nodes is good solution, but how many times do we struggles with CIO to get more budget for servers & DB licenses?

3) Performance gap between Compute Nodes and Storage Nodes has increased the need of more memory units to cache the data for faster IO. Can we look at the tools, which are more geared for in-memory cache based processing and less worry where it is going to be persisted?

4) Many data points are write only once and read multiple times. Can we consider RD and WR are two separate requirements and hence two different solutions?

5) Data nature and format itself is getting dynamic and unstructured, which forces us to go beyond static schema. Can we get an approach to have programable schema? (I tried in the past using RDBMS and seen little success due to nature of data changes.)

6) Even the RD and WR patterns are different and can be handled using multiple No-SQL technologies. Do we have to invest into bigger support and subject matter expert (SME) cost to tune jack-of-all trades enterprise RDBMS? Can we deep dive into multiple NO-SQL technologies to check the feasibility of the data workload requirements match? Please note that the support and SME cost with No-SQL DB is not zero, but could result lesser than traditional Enterprise DB support cost.

7) Are we ready to re-look at the cost paid so far for overall scalability and management of the data systems and thrive to search for lower cost solution for certain data workloads.

Few months back, we received requirement of migrating from one famous RDBMS DB to HBase. However there was written analysis how existing RDBMS DB is a bottleneck for certain transactions, but no much conclusive analysis done for the targeted data platform. The reasons were unknown or known, but not told. But it certainly exposes the need of tool that drives “Where to Go” and helps for “How to Go” . This would avoid a tendency to ride on No-SQL wave without much analysis.

Where and How to Go?

This section provides one use case driven experimentation, which emphasises the need a platform where such analysis can be done prior to actual deploying and using the No-SQL tools.

Identify Workload and Analyse

Any workload is measured by well known parameters like time, number of CPU cores, memory consumption and storage. However these workload parameters are varied by 2 more parameters volume of the data and type & frequency of data IO patterns.

For example, one use case required pushing raw data from wireless and wireline devices to the cloud for further analysis and reporting feed purpose. The parameters measured were number of devices, data sizing per device, different types of data pushed by devices , frequency of data push, retention of the data and what requirements of analysis for the data.

By looking at the data patterns, we realised it is – write less times and read multiple times. Also the raw data (termed as fact data in data crunching world) retention was not beyond 2 months period and every day’s data was not more than 200 million rows per day. However it is critical to store the daily raw data for at least for 2 months.

To store the raw data, we started with one RDBMS, which has capability of easy replication and compared it with persistent cache store. Both were performing reasonably well for RD and WR (to store) operations with 200 million rows per day, however scaling from 200 million to 2 billion rows per day was better for persistent cache store in terms of Cost to Performance ratio.

Identify Data Processing Patterns and Analyse

The data processing patterns were divided into 3 high-level categories:

1) Aggregation across all devices, but per certain dimension and metrics combinations.

The one day aggregated data was resulting into 50+ million rows and 180+ million rows for weekly aggregated data. The projection was to reach 300+ million rows per day and 600+ million rows per week.

The aggregation was compared between one enterprise DB and 2 columnar No-SQL tools. And both the columnar No-SQL tools provided better Cost to Performance ratio.

2) Aggregation per device and per certain dimension and metrics combinations.

The one day aggregated data was resulting into 120+ million rows and 300+ million rows for weekly aggregated data. The projection was to reach 360+ million rows per day and 900+ million rows per week.

The aggregation was compared between one enterprise DB and 2 columnar No-SQL tools. And both the columnar No-SQL tools provided better Cost to Performance ratio.

3) Generate device profiles using aggregated data per device and Read device profiles .

The RD operations were compared between one enterprise DB and 2 document based No-SQL tools. With the growing data, document based No-SQL tools were providing much better Cost to Performance ratio

The Cost to Performance ratio was calculated per million rows and the cost included DB license cost, storage cost, compute cost i.e. CPU cores, memory, time to store/aggregate/read data and cost involved into replication feature license and overhead cost.