Big or Medium Application may have large MySQL data. Large data means millions of records.
How can we handle very large data in MySQL? Many methods can be found on internet but until I write this post I cannot find any generic answer for all situations / requirements except using common ‘Limit’ clause.
So I do a little research on my own and I want to share it here.
I use MySQL.exe command prompt instead of PHPMyAdmin due to MySQL.exe is a core engine so then it is more objective. I use Windows OS, CPU Core 2 Duo and 2 GB RAMs.
My MySQL server version is 5.5.8 Community Server (GPL).
I use MySQL Employee sample database, please follow this link: http://dev.mysql.com/doc/employee/en/index.html to get it.
Employee sample database has tables with large records i.e employee table has 300,024 rows, titles has 443,308 rows, salaries has 2,844,047 rows.
Using ‘Limit’ keyword
A common answer to handle large data is to limit returned records, so MySQL does not retrieve all rows. This method usually being used by web pages.
I combine employee, titles, salaries with ‘JOIN’ to retrieve million of rows. The record count of my query is 4,638,507. Quite good for example.
The query is:
SELECT count(*) FROM `employees` left join titles on employees.emp_no=titles.emp_no left join salaries on employees.emp_no=salaries.emp_no
Using Limit has weakness that to get the hundred thousand(th) or million(th) records is much slower than getting the first or hundred(th) records.
Please take a look at below queries and required time in sec
SELECT birth_date, first_name, last_name, gender, hire_date, title, titles.from_date, salary, salaries.from_date FROM `employees` left join titles on employees.emp_no=titles.emp_no left join salaries on employees.emp_no=salaries.emp_no limit 0, 70
After that I change the limit to
.... limit 200, 70 and it still tooks 0.00 second. Again I change it to
.... limit 15000,70, it tooks 0.11 sec.
So to get 70 rows start from the 15000th row need more time, it is slower.
I change it again to
.... limit 100000,70 need 0.39 second and with
.... limit 2000000,70 tooks 6.03 second.
The bigger number of starting rows then the slower MySQL will retrieve records. I can see that to get the two million(th) ahead it tooks more time than getting first row.
How about more complex query ? Data may have more tables to join, more rows, filter, sort order etc. More complex query means more time.
So I think I need better method to fasten retrieval process.
I came up to think to use ranking method. This ranking method works for SQL Server. So I give it a try.
Using Record Ranking
Record ranking method is about giving sequencial number to rows. I get my first reference about record ranking from this link: http://jimmod.com/blog/2008/09/displaying-row-number-rownum-in-mysql.
The basic example is below:
SELECT @row_no:=@row_no+1 as rank, birth_date, first_name, last_name, gender, hire_date, title, titles.from_date as title_from, salary, salaries.from_date as salary_from FROM `employees` left join titles on employees.emp_no=titles.emp_no left join salaries on employees.emp_no=salaries.emp_no, (select @row_no:=0) as temp1 limit 70
That query must be sub queried in order to able to filter ‘rank’ column. I want to get result per page by filtering ‘rank’, so the behaviour like ‘Limit’ method. Please look at query below:
select * from ( SELECT @rownum:=@rownum+1 as rank, birth_date, first_name, last_name, gender, hire_date, title, titles.from_date as title_from, salary, salaries.from_date as salary_from FROM `employees` left join titles on employees.emp_no=titles.emp_no left join salaries on employees.emp_no=salaries.emp_no, (select @rownum:=0) as temp1) as tbl_temp where rank>=200 limit 70
I filter the rank with
... where rank>=200 ... and put limit to just 70 rows. This look the same with method of ‘Limit’ :
.... limit 200, 70.
Unfortunately, the required time of rank method is 26.61 second. Much more slower than using ‘Limit’.
I curious to change the rank filter with
.... where rank>=15000 ... and it tooks 26.80 second. It is about same with filter of 200.
Change it to
.... where rank>=100000 ... and tooks 26.67 sec. Pretty much closer with above scenarios. One more change to
.... where rank>=2000000 ... and tooks 26.73 sec.
So all ‘rank’ column filter gives same performance but again this method is slower than ‘Limit’.
I think this is maybe because of MySQL retrieve all record in the first place and then filter it later so it is much more slower.
Apparently this method is not my option. I like using ‘limit’ better than this method because ‘Limit’ is much faster but still I want to try another method.
I think I should make a table contains data to retrieve with additional sequence order like ‘rank’. So, I made a table and name it as ‘tmpTable1’ for experimental only. I made its structure with same as record ranking query result and dump the retrieved rows to this tmpTable1 table.
Using Dump Table
The structure of Dump Table ‘tmpTable1’ is:
CREATE TABLE IF NOT EXISTS `tmptable1` ( `rank` bigint(20) NOT NULL, `birth_date` date NOT NULL, `first_name` varchar(50) NOT NULL, `last_name` varchar(50) NOT NULL, `gender` varchar(1) NOT NULL, `hire_date` date NOT NULL, `title` varchar(50) NOT NULL, `title_from` date NOT NULL, `salary` int(11) NOT NULL, `salary_from` date NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
I insert the data to tmpTable1 with query:
insert into tmptable1 SELECT @rownum:=@rownum+1 as rank, birth_date, first_name, last_name, gender, hire_date, title, titles.from_date as title_from, salary, salaries.from_date as salary_from FROM `employees` left join titles on employees.emp_no=titles.emp_no left join salaries on employees.emp_no=salaries.emp_no, (select @rownum:=0) as temp1
I insert the data with this query. The data of tmptable1 is same as previous example. It has 4,638,507 rows.
Lets take an experiments to retrieve data from tmpTable1 with query:
select * from tmptable1 where rank>=200 limit 70;
It tooks only 0.00 sec almost no time required. This query is basically same as
.. limit 200, 70 .. but because of I already dump it to tmpTable1 then I do query with only
.. rank>=200 limit 70...
Change above query to
.. rank>=15000 limit 70.. tooks 0.00 sec,
.. rank>=100000 limit 70 .. (0.27 sec),
.. rank>=2000000 limit 70 .. (4.22 sec).
This method is fastest than ‘Limit’ and ranking method. This method weaknesses are I have to dump / insert to fill in its data before any retrieve operation and if I want to sort it by i.e ‘hiredate’ then I have to re-dump with appropriate order by clause.
I can schedule dump operation every night or off peak hours so then the data is not real time. If I dont really need real time table data and the record count is millions then I recommend use this method because it is fastest.
In my experiments above, The fastest method to retrieve millions or large MySQL records is dumping data with sequencial order number like ‘rank’ and retrieve it with its ‘rank’ filter.
Because the nature of dump method then data it self is not real time. I also can add index in the ‘rank’ column to make retrieval process faster.
Otherwise, I will use common ‘Limit’ method to retrieve real time data and I try to add filters so that returned records not achieve millions of rows.
This post content is a short journal of my experiments with MySQL large rows. If you have better ideas please feel free to give me a comment.
Thanks and Regards,