Searching 10GB of data As A Service - lessons learned

Tags: programming, webdevelopment, databases.
By lucb1e on 2013-11-05 11:48:36 +0100

The day before yesterday I launched a service where you can check whether you were included in the Adobe accounts hack. I had the file, it could be grepped for stuff in about 30 seconds, and I thought "hey, others might want to do this too". And so I started coding.

My parents would be home soon and we'd go out for dinner, but I wanted it done. With the Linkedin breach someone else put up the same service so it doesn't seem to be an uncommon thing to do. Since I didn't want anyone to steal my idea before I could get it done (and my work would be wasted), I was kind of on a schedule. This is probably largely responsible for the issues that it had at first.

The first codemash was pretty crude. The front-end was very similar to what it is now: a simple form featuring a search field and a submit button with some explanation. That posted to a database which stores: id, uid, query, result, ip, performed and a timestamp. Then a daemon that I wrote on the server would load all queries from the database, read through each line of the hacked Adobe file, run the search query on each line, and post the results back to the database. The front-end page would refresh and load the results from the database.

As it turned out, my server was *way* too slow to read that huge file for potentially hundreds of queries at a time. I was going to submit it to hackernews, and the top few spots (something you should always prepare for, you never know) bring potentially even thousands of simultaneous users.

But really, how hard could it be? Disk speeds are close to a gigabit nowadays, and I'm reading something sequential here. Searching should take no more than 10 seconds in theory, excluding the CPU time and stuff. But the CPU is many, many times faster than harddrives, so that's no issue, right? Right?!

As it turned out, we went to dinner and I wasn't finished. From my phone I managed to get the daemon configured and running on my desktop instead of the server, which is not a very fun thing to do at a 2G data rate (t-mobile limits me to 64kbps), but ssh is very stable and it was at least doable. This would have been a total impossibility with Windows and remote desktop.

Returning home, I tested my daemon some more, found that it actually took 5 minutes to run instead of the earlier 30 seconds that grep took (and the theoretical 10 seconds), but I just wanted to get it out there. I could improve while it ran.

Training wave

And so it ran. I could see in my console output that it was running between 1 and 3 queries per run, but runs were taking 10 minutes. Forever, basically. And so it got no upvotes on Hackernews. At least, I guess that the time was the reason why it did so bad.

Having made a few improvements and also having transferred the 10GB file + daemon to my laptop (which has an SSD), I rebooted my pc and overclocked it a bit. It's usually underclocked, but it seems my CPU was actually most of the bottleneck, not the disk speed. The disk was still running at nearly 100% capacity, but so was my CPU.

Wave 2

With a faster service and help of an ssd now and then, I posted it to Reddit on /r/technology. There it did get quite a bit of upvotes and in the end I got a couple hundred users. Through the night it went from 1-3 queries per run to about 7. In the morning I read and commented on some comments and someone suggested putting it in a database which would be much faster.

Indeed! Databases are not for storing huge files, I think that's why it didn't even cross my mind. But this is not a huge file, it's a huge file with lots of rows. Isn't it better to run through the rows with a database engine instead of running through the entire file with a daemon written in PHP? I mashed together a script that converted the cred file to an INSERT query, put up a database on my laptop (an ssd should be much faster in converting than a harddrive), and started converting.

This lasted for five hours. Since my laptop was in use and not on power all the time, I paused the script (another win for GNU/Linux, Windows can't do that natively), but all in all it ran for a total of 5 hours, finishing around 15:15.

Wave 3

By 13:45 I was going home again and I ssh'd to check on the status of the thing. HOLY SHIT 175 QUERIES RUNNING AT ONCE. Uh-oh. This was trouble.

I anticipated this, but what I had not anticipated is the CPU time it takes to loop through all queries. My daemon reads a block of data, say 10MB, then searches it for each query. Reading the data fast enough should have been the bottleneck, but instead it was running the queries that was hogging the system.

Using internet in the train (which is roughly as fast as a Ferrari Enzo after you drove it off a cliff) I attempted to remedy the situation and put my laptop to work. I added a 'started' column in the database which allowed me to run multiple daemons simultaneously. Each daemon could now check which rows were not done yet, but also which hadn't been started.

This brought the load back only very, very slightly. I updated the page to say we had performance issues. All queries are in the database and can be run just fine, except the users would potentially have to wait over half an hour to get the result. Most wouldn't do that.

Then all hell broke loose. The database started throwing "too many connections" errors. That never happened before. Not even when I hit one of the top spots on Hacker News. This means that not only my desktop at home is being hammered, it also can't insert any results into the database.

At this point I'm on the train station and am able to use decent internet. The only issue is that my connection drops as soon as a train drives into the station, blocking the signal. And I needed that signal to restart mysql.

Fortunately I just had enough time to get mysql to restart. At the same time I also increased the refresh time of the front-end page from 7 seconds to something like 55. It wasn't like they would be getting results any sooner no matter how slow the page refreshed, so a full minute refresh time should be fine. I then also took out the "persistent connection" flags in the daemon at my desktop and laptop, used the raw mysql client over ssh to reset all started=1 entries back to 0, and restarted the search daemons.

So now we had 200 queries queuing and a massive performance issue. My laptop only had internet for short periods of time so I couldn't rely on it to update the results even when it found any. My desktop was at it alone. So alone. (lol). Also converting the file to a database was getting close to finishing, but my battery was running low as well. I had to pause inserting and suspend my laptop to prevent it from dying mid-run.

Fast-forward 15 minutes to about 15:00. I get home and instantly put my laptop to work. On my laptop too it was the CPU that was the bottleneck, but while driving the final 5km home I realized multithreading might be a good idea now that I have this started=1|0 column.

It was. It didn't go much quicker, but it was quicker. The issue seemed manageable. I could throw more computers at it, things run in parallel, and it would be doable. I also increased the mysql connection limit from 175 to 250. I didn't know how much my server could take, but with only 200MB free RAM, I didn't dare go much higher.

I now started working on a much-needed feature that I had planned but did not fully implement the night before: emailing the result. If a result is found, I can send you an email so that you don't have to wait and leave the tab open. I encouraged people to use it, and also wrote on the results page (which was now showing "it'll probably take 40 more minutes, get some coffee") that if need be, they could leave and come back later. If you enter the same search twice, the old results pop up.

Meanwhile my file to database conversion script had finished and before expanding the search daemons to other computers at home, I tested how fast database queries were. 30 seconds. Okay that's about the same performance as grep was getting. Much better than 10 minutes anyway... but how about multiple queries? This test search was only 1 query. Turns out that multiple queries take an additional few seconds, but: twice as many queries take less than twice as long. So that's a big improvement.

I started mashing another script together which connected to the server, got some search queries, ran the queries in batches on my laptop's local database, and posted the results back to the server.

This was epicly fast.

Then I multithreaded it.

This was super epicly fast.

After finishing the script around 16:10, it took 30 minutes for the queue to go back from 200 to 0. ZERO DO YOU HEAR ME? Yay! Searches average now around 30 seconds rather than 50 minutes. I totally feel ready for the hackernews homepage.

Having an SSD also hugely improved the speed. I wrote 81GB of data to it and probably destroyed a few weeks of lifetime that day, but oh well.

Wave 4

One of the search daemons crashed. Someone submitted a query "hotmail.com", which of course returns millions of results. Since there was no real queue anymore, I reduced the number of searches per batch to 1. Still multithreaded, but only 1 search per run. I also limited the number of search results per query to 1 so that it couldn't crash a single daemon even with absurd queries. I disabled wildcards already, but with generic email addresses you could still get absurdly large results. I don't think this was done on purpose, though.

I let this run through the evening and night mostly unattended and was pretty happy with the result. In the morning things are still holding up, and traffic slowed to 18 queries per hour. This is now tuesday morning (having launched at sunday evening, most of the story being on monday).

Wave 5

While tweaking the MySQL settings and discovering that 400 connections is still easily doable for my small server, another wave hit. This time a German blog linked to me, and once again the queue went to 175 (though I discovered it almost instantly). My one-query-per-run was not keeping up anymore. I reverted it to 35 per run, and soon enough someone had the brain to try another "@hotmail.com" query. The queue was still about 30 entries long, but it was all being processed very fast.

This wave hit as I'm writing this post. I'm manually tweaking the database, manually taking out these absurd queries and resetting other affected people back to state 'started=0'. I think I have an idea on how to solve this, but I really should get dressed and do some more important things. Like school you know.

I'm quite lucky to have yesterday afternoon and today the whole day off, but there are still things I need to do. So I'm going to stop writing, post this post, and maybe update how I solved this final issue (which I'm doing manually now). Perhaps you have good suggestions? Update: I've found one.

Lessons learned

1. The thundering herd problem is real. I hit this with the 7 second refresh times on the front-end.

2. Always benchmark, then draw conclusions. With "big data" there is no "most of the time". Laws, like harddrives being slow and CPUs being fast, suddenly go backwards. I know this isn't actually "big data", this is very small, but for the horsepower I personally own as a 20 year old student, it is significant.

3. Test and think before putting something out there. Don't rush too much. As it turned out, someone else did another attempt at making this service (at about the same time as I did, so it was independently) and remarked that 10GB was also too large for him to keep in RAM. So the issue that I was having, I wasn't having alone. He used a bloom filter to solve it, something I think is a bad solution since I wouldn't want false-positives, but nevertheless it's a smart solution to the RAM issue.
lucb1e.com
Another post tagged 'programming': Py2exe simplifier

Look for more posts tagged databases, programming or webdevelopment.

Previous post - Next post