Infobright DB Reviews and Pricing

Consultant at a tech consulting company with 51-200 employees

Feb 22, 2015

Column Oriented, Open-Source Analytic Database

Interesting Article about BigData and Column oriented engines written by a student…

… Students in major of software engineering are required to take another course named “Data Storage & Information Retrieval Systems” as a prerequisite for Database. DS&IRS mainly focuses on optimized storage and retrieval of data on peripheral storages like a HDD or even Tape! (I did one of my most sophisticated and joyful projects during this course. We had to implement a search engine, which compared to the boolean model of Google, is supposed to be more accurate. More information concerning this project could be found on my older posts). During these courses, student are required to be engaged in specific projects, defined to help students gain a better perspective and intuition of the problem and issue.

I don’t know about other universities, but in ours, seeing students performances on such projects is such a disappointment. While doing such fun projects as part of your course to learn more, is quite an opportunity, students beg to differ. The whole atmosphere is believing that our professors are torturing us, and we should resist doing any homework or projects! You have no idea how hard it is to manage escaping that dogma, as you have to live among such students. It is unfortunate how most of the students are reluctant to any level of studying. For such students, learning only occurs when they’re faced with a real problem or task.

So here’s the problem. You are supposed to do your internship at a data analysis company. You will be given 100 GBs of data, consisting of 500 millions of records or observations. How would you manage to use that amount of data? If you recall from DS&IR course, you’d know that a single iteration through all the records would take at least 30 minutes, assuming all of your devices are average consumer level. Now imagine you have a typical machine learning optimization problem (a simple unimodal function), that may require at least 100 iterations to converge. Roughly estimated, you’d need at least 50 hours to optimize your function! So what would you do?

That kind of problem has nothing to do with your method of storage, a simple contagious block of data which minimizes seek time on the hard disk, and reading the data in a sequential manner is as best as you can get. Such problems are tackled by using an optimization solution which minimizes access to hard disks and finds a descent optimal solution.

Now imagine you could reduce the amount of data you’d need on each iteration, by selecting records with a specific feature. What would you do? The former problem doesn’t even need a database to perform its job. But now that you need to select some records with a specific attribute (Not necessarily a specific value), you shouldn’t just iterate through the data and test every record against your criteria. You need to manage the data on disk, and create a wise index of the data, which would help you to reduce disk access and answer your problem perfectly (or even close enough). That’s when databases come in handy.

Now the question is, what kind of database should I use? I’m a Macintosh user, with limited ram, a limited and slow hard disk, with a simple processor! Is using Oracle the right choice? The answer is no, you have a specific need and these general purpose databases may not be the logical choice, not to mention the price of such applications. So what kind of service do we require? In a general manner, users may need to update the records, or alter the table’s schema and … . To provide such services, databases need to sacrifice speed, memory and even the processor. Long story short, I found an alternative open source database which was perfect for my need.

The infobright, is an open-source database which is claimed to “provide both speed and efficiency. Coupled with 10:1 average compression”. According to their website the main features (for my use) are:

- Ideal for data volumes up to 50TB - Market-leading data compression (from 10:1 to over 40:1), which drastically reduces I/O (improving query performance) and results in significantly less storage than alternative solutions. - Query and load performance remains constant as the size of the database grows. -Runs on low cost, off-the-shelf hardware.

Even though they don’t offer a native Mac solution, They have a virtual machine running Ubuntu, prepared to use the infobright with. And here’s the best part, even though the virtual machine allocations were pretty low (650 MBs of ram, 1 cpu core), it was actually able to answer my queries in about a second! The same query on a server (Quad Core, 16GBs of ram, running MS SQL Server) took the same amount of time. My query was a simple select, but according to the documents, this is highly optimized for business intelligence and data warehousing queries. I only imported 9 millions of records, and it only consumed 70MBs of my hard disk! Amazing, isn’t it? Having all the 500 millions of data imported would only take 3.5 GBs of my disk!!

The infobright, is mainly an optimized version of MySql server, with an engine called brighthouse. Since its interface is SQL, you can easily use Weka or Matlab to fetch the necessary data from your database and integrate it into your learning process, with minimum amount of code.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

it_user196605Works

Report as inappropriate

Feb 19, 2015

Query and load performance remains constant as the size of the database grows. It Runs on low cost, off-the-shelf hardware. Even though they don’t offer a native Mac solution, they have a virtual machine running Ubuntu, prepared to use Infobright with.

it_user7536

Project Manager at a tech company with 51-200 employees

Jul 4, 2013

Big Data Chronicles: Infobright as column-based database for analytics system

Last several months I was involved into the task of design and implementation of statistics/analytics system for the game in social network. There are a lot of users at the same time. All of them produce huge amount of events. One of the standards for analytics systems is providing fast queries for the collected data. Logically, I used OLAP cubes to collect all kinds of events needed for our team to analyze. Technically, the best way in our case is using column-based storage. I use Infobright. In our case regular RDBMS (SQL) storage or document-based DB like MongoDB is not enough because of performance. They should be used rather for OLTP, but not for MGD OLAP. From other hand, such cool gun as Hadoop-based solution would be overrun. So, Infobright is exactly the case. It was one of the best decisions I made as software architect for last several months:

- As it’s pure OLAP solution, so, I’m able to implement any ETL/Storage/Query scheme;

- As Infobright is column-based storage, all my even very sofisticated queries on even huge recordsets have extremely short execution time;

- As all huge functionality like aggregation/filtering is hidden in Inforbright’s internals, I concentrate on my business task, so, able to desing/implement/add new module/scheme/query very quickly.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.