Since its introduction in 2011, SAP tries to push HANA very
heavily and there is a lot of marketing buzz over this new product. For a
freelance consultant focused on SAP Sybase database products, like me, it is next to
impossible to ignore HANA in year 2013. So, I decided not to rely to marketing
slogans and check what HANA is, what it can do, and, importantly, what HANA
is NOT. I put my first impressions to this blog post; hopefully other
HANA-related posts will follow. Note that I’m not a HANA expert (yetJ) and I’m writing these
rows as a person with a lot of experience with IQ and some other RDBMSs
and trying to learn HANA.
So, why to compare HANA and IQ? Both are designed for data
warehouse environment, both are column-based (with some support of row-based
data), both provide a data compression out-of-the-box and highly-parallel.
Years ago, much like SAP for HANA today, Sybase claimed that IQ processed data
so fast that aggregation tables are not really needed, because the aggregations can
be just performed on-the-fly. Well, experience with a number of big projects
showed me how problematic that statement was, and it is only a single example.
According to SAP,
the strong point of HANA is its ability
to utilize CPU cache
, which is much faster than accessing the
main memory (0.5 - 15 ns. vs. 100 ns.). Currently, IQ and other Sybase
RDBMSs lack this capability.
Therefore, I decided to build a test environment which allows performing
of queries
that answer a number of conditions:
-
The query should be
performed fully in memory
, otherwise it is not fare to compare IQ and HANA.
In HANA, queries are executed fully in memory, once relevant columns are loaded
into the RAM.
-
The query should perform
a lot of logical I/Os
and should be hardly optimizable using indexes.
Otherwise, the effect of using CPU cache may be not clear enough.
-
The query should take at
least a number of seconds to finish.
Since both IQ and HANA (very unfortunately)
don’t provide the number of logical I/Os performed by a query,
we may compare
response times only
. If the query finishes in small milliseconds, the comparison
of response times may be problematic.
Some notes about the test environment:
For IQ, I used 16-core RHEL server with hyper-threading turned on (32 cores visible to OS) and 140GB RAM
available. I used IQ 16.0 SP01 for my tests.
For HANA, I had to use HANA SPS6 Developer Edition on a Cloudshare
VM, which provides HANA on a Linux server with 24GB RAM. However, only
19.5 GB
is actually available from the Linux point of view (free –m output) and
most of
this memory is allocated by various HANA processes. In fact,
less than 3GB
RAM is available for user data in HANA
. I only wish that SAP would allow us
to download HANA and install it on any server that answers to HANA’s
requirements for CPUs, but it seems that the SAP’s policy is to distribute HANA as a
part of appliances only, so I don’t expect free HANA download any time soon.
This brings us to an additional requirement for the test:
the
test dataset should be relatively small
, because of severe RAM restrictions
imposed by HANA Developer Edition on Cloudshare.
Finally, I decided to base my tests on a relatively narrow
table that represents information about phone calls (for those involved in Telecom
industry, it is like short and very much simplified CDRs). Here is the
structure of the table:
create table CDRs (<br>
CDR_ID unsigned bigint, -- Phone
conversation ID
<br>
CC_ORIG varchar(3), -- Country code
of the call originatior
<br>
AC_ORIG varchar(2), -- Area code of
the call originatior
<br>
NUM_ORIG varchar(15), -- Phone number
of the call originatior
<br>
CC_DEST varchar(3), -- Country code
of the call destination
<br>
AC_DEST varchar(2), -- Area code of
the call destination
<br>
NUM_DEST varchar(15), -- Phone number
of the call destination
<br>
STARTTIME datetime, -- Start time of
the conversation
<br>
ENDTIME datetime, -- End time of
the conversation
<br>
DURATION unsigned int -- Duration of
the conversation in seconds
<br>
);
I developed a stored procedure that fills this table in SAP Sybase
ASE row-by-row according to some meaningful logic and prepared delimited files
for IQ and HANA. The input files are available upon request. At first, I
planned to run tests on a dataset with 900 million rows, but I finally
discovered that I have to go down to 15 million rows because of the VM memory limitations mentioned above.
Important note about the terminology. In IQ, inserting
of the data from a delimited file into a database table is called LOAD, and retrieving
of the data from a table to a delimited file is called EXTRACT. In HANA, the
inserting is called IMPORT and the retrieving is called EXPORT. The term LOAD
in HANA has a totally different meaning – it means loading of a whole
table, or some of its columns, to the memory from disk, when the data is
already in the database.
IMPORT functionality in HANA is not similar to IQ, at
all. Actually, it contains
two phases: IMPORT and MERGE. During the first
phase, the data is imported to a
“delta store” in an uncompressed form. Then, the
data from the “delta store” is merged into “main store”,
where the table data is actually resided. The merge is performed
automatically, when a configurable threshold is crossed (for example, the size of
the “delta store” becomes too big). To ensure that the imported data is fully
inside the “main store”, a manual MERGE may be required. The memory
requirements during the MERGE process are quite interesting, maybe I will write
about it in a different post. It is pretty much possible that you will be able
to IMPORT the data, but will not have enough memory to MERGE it; it happened to
me a number of times during my tests. I would recommend you to read more about
HANA architecture here:
http://www.saphana.com/docs/DOC-1073,
Chapter 9.
Given the significant difference between the test systems (a
powerful dedicated server for IQ vs. small VM for HANA), I didn’t plan to
compare the data load performance between IQ and HANA. However, so far I see
HANA performing the IMPORT using not more than 1.5 core of 4 available, thus
underutilizing the available hardware. The MERGE phase, though, is executed in
a much more parallel way. The bottom line is that IQ seems outperform HANA in
data loading, possibly quite by far. I will probably return to this topic in
one of following posts, additional tests with larger dataset are required.
Now, we come to the
data compression. Since IQ and
HANA approach the indexing quite differently, I chose to compare the
compression without non-default indexes in both IQ and HANA. It appears
that IQ
provides better data compression and needs 591M to store 15,000,000
rows, while HANA
needs 748M to store the same data. HANA provides a number of compression
algorithms for columns, which are chosen automatically, according to the
data type
and data distribution. However, it seems that neither of compression
algorithms offered by HANA contains LZW-like compression used by IQ. I’d
prefer to test
the compression on a more representative data set (15,000,000 is way too
small) and
play with different HANA compression algorithms. I hope one of future
posts
will be dedicated to this topic.
Finally, the data is inside the database and we are ready to
query it. To answer the test conditions mentioned above, I chose the following
query:
select
<br>
a.CDR_ID CDR_ID_1, b.CDR_ID CDR_ID_2,
<br>
a.NUM_ORIG NUM_A, a.NUM_DEST NUM_B, a.STARTTIME STARTTIME_1, a.ENDTIME
ENDTIME_1,
<br>
a.DURATION DURATION_1,
<br>
b.NUM_DEST NUM_C, b.STARTTIME STARTTIME_2, b.ENDTIME ENDTIME_2,
<br>
b.DURATION DURATION_2
<br>
from CDRs a, CDRs b
<br>
where a.NUM_DEST = b.NUM_ORIG
<br>
and datediff(ss, a.ENDTIME, b.STARTTIME) between 5 and 60
<br>
order by a.STARTTIME;
This query finds cases when a person A called person B and
then the person B called person C almost immediately (in 60 seconds). This
query has to perform a lot of logical I/O by its very definition. With my test
data set, this query returns 31 rows.
In IQ, this query takes
6.6 seconds while executed
fully in memory and when all relevant indexes are in place. The query uses
sort-merge join and runs with relatively high degree of parallelism, allocating
about 60% of 32 CPU cores available.
In HANA, the same query takes
only 1 second with no
indexes in place
! Remember, that in my tests HANA is running on
a small VM
with just 4 virtual CPU cores! The query finishes so fast that I cannot
measure the degree of parallelism. Creation of indexes on NUM_ORIG and
NUM_DEST
reduces the response time to 900 ms.
A note about indexes in HANA: HANA offers only two
index types and, by default, it chooses the index type automatically. In my
tests, I have found that indexes improve query performance in HANA, sometimes
significantly. Unfortunately, I have not found any indication of index usage in
HANA query plans, even when some indexes were used by the query for sure. The
role of the optimizer statistics in the query plan generation is also not
very clear to me. I hope to prepare a separate post about query processing in
HANA, stay tuned!
Another amazing and
totally unexpected finding in HANA – index creation
on NUM_DEST (varchar(15)) takes 194 ms. Index on DURATION (int) is created in
12ms!
My conclusions so far:
-
HANA in-memory
processing is not just about caching, it is much more than that.
HANA allows us to
achieve incredible performance for resource-intensive queries. Things
that seem impossible with other databases, column-based or row-based, may
become possible with HANA.
- Loading of the data into
HANA requires careful resource and capacity planning.
Merging of the inserted
data with the rest of the table may require much more memory that you have
probably thought.
Particularly, to perform the merge, both old and new version
of the table should fit into memory.
- It is pretty much possible that storing of aggregations in HANA is not required
indeed, at least in most of the cases. Of course, I need a more
representive result set to verify it.
-
IQ and HANA can be used together in the same system, where they can solve different problems and store different data. HANA is very good for real-time data processing, or for queries that must
be executed very quickly. However, it is not feasible to store the
whole multi-terabyte data warehouse in HANA's memory in most of the
cases, at least not in year 2013. At this point, IQ enters into the
game. It is very efficient in massive data loading and data storage, and
can answer queries with less strict response time requirements very
efficiently. In some scenarios, the raw data can be loaded into IQ, and
then, after some refining inside IQ, imported into HANA.
Update: see IQ query plan for my test case here:
Download ABC_15mln_fully_in_memory
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Does Smart bear support SAP