How To Handle Large Amounts of Data Quickly and Easily

So the last few weeks have been crazy busy. I have been completely swamped with client work, and while that is a good problem to have, it does take away time to do other stuff (like blogging!).

One of the things I have been working on deals with handling large blobs of data. Its a bit of a tangent since it has absolutely nothing to do with UI, but I figured that since there are a lot of developers reading this site, it might be helpful.

So what do I mean by large amounts of data? Well, for the app I am building we were testing our theoretical limits for performance reasons. Imagine a grid or an Excel spreadsheet that has 700 columns (Excel itself only allows 256) and 30,000 rows. That basically equates to 1.65 million cells in the grid. Now if you compare that lump of data to what, say, Google throws around, its pretty insignificant; however, when it comes to keeping it in memory and accessing it quickly it is a bit beefy.

Most of the time you would get around this problem by paging through the data somehow, or having some sort of filter. Unfortunately for us, we didn’t have this luxury. The application needed the whole kit and caboodle accessible to it at all times. With that as the challenge we went to work.

In the beginning…
Before I started testing the theoretical limits of the app, I was using a run-of-the-mill XML Serialized cache file. It was kinda big with the small sample data I was initially using, but I didn’t think anything of it. Once I started dealing with lumpy up there, the XML file grew to 550 Megs of pure, unadulterated crap. I don’t know about you, but I sure don’t want to throw that bad boy into memory (we actually did a few times and watched my dev box cry uncle…kinda funny and sad at the same time).

So the original solution was out the door…time for plan B. Well my partner is a big database guy so we went with his idea and tried to throw el lumpo into a table called “tblCache”. Ever try throwing 1.6 million inserts at Sql Server Express? Even using BulkCopy it still performed like a one-legged man in a butt-kicking competition. The end result of plan B was a database that quintupled in size to over a gig and an app that performed so poorly that it made the Xml file seem speedy…

That is when the crying began…

My friend came up with the idea of “lets try and find an object based database and just put the whole cache object in there!” Fortunately we both realized that that was a horrendous idea about as soon as it came out of his mouth. However, the idea to cache the object itself stuck so my buddy began researching that a bit and came up with a solution.

The Answer
So how do you deal with a large amount of data quickly and easily? Two words: Binary Serialization. It sounds fancy, but what it basically means is that you take your big fancy object and store it in a bunch of 1s and 0s. This took lumpy and turned him into a 100 Meg blob of goodness (a full 5 times smaller than the XML file). When we were talking about our solution to the problem to another guy on the team he suggested looking up in memory compression. After we got that working our data file shrunk down to a trim 1 meg (my buddy owes that guy dinner for that idea by the way).

Couple that with a few database tricks (like doing a few smaller hits rather than one huge one) and we went from loading our data in a crawling 138 seconds down to a svelte 4. Sounds too good to be true? Its not…and what is better is how easy the code is once you get your mind around it.

That’s Nice…Can I Do It?
So by now you are probably thinking…nice story…but can I do the same thing with my app? The answer is resoundingly yes! and what is even better is, you can do it completely free (i.e. no third party tools) using only the standard libraries of .NET. You could spend some money on a 3rd party tool if you need something a bit specialized (i.e. higher compression than the gzip stuff), but you definitely don’t need to.

Lets get started…

First we need some imports (this project is in VB.NET, but could easily be converted to C#).

Imports System.IO
Imports System.IO.Compression
Imports System.Runtime.Serialization
Imports System.Runtime.Serialization.Formatters.Binary

Nothing too strange there, but there are probably a few you haven’t seen before.

The code itself is relatively simple as well.

Dim compressedzipStream As GZipStream = Nothing
Dim ms As MemoryStream = Nothing
Dim b As BinaryFormatter = Nothing
Dim fs As FileStream = Nothing

Try
   ms = New MemoryStream
   b = New BinaryFormatter
   b.Serialize(ms, cache)

   Dim buffer(ms.Length - 1) As Byte
   ms.Position = 0
   ms.Read(buffer, 0, buffer.Length)

   fs = New FileStream("C:\FileGoesHere\FileName.whatever", FileMode.Create)
   compressedzipStream = New GZipStream(fs, CompressionMode.Compress, True)
   compressedzipStream.Write(buffer, 0, buffer.Length)

Finally
   compressedzipStream.Close()
   fs.Close()
   ms.Close()
End Try

Got all that? Lets break it down.

The first few lines are just initializing some variables we are gonna use later. The GZipStream is the compression class of .NET. It is pretty good, but if you need some serious compression you might want to go with a third party tool.

Next you notice is a classic Try/Catch. This is simply so we can be sure that no matter what happens our streams will be closed. The first line to really notice is b.Serialize(ms, cache). This is basically where the magic happens, or rather where your object (in our case it is called cache) is changed from what you built to a bunch of gobbledygook that only the computer can read. The good news is, its pretty efficient when compared to other serialization stuff (i.e. xml). The bad news is that you can’t read it like you can Xml. Its not a big deal, but I do find myself missing that little feature.

So what is happening is your object (which can be as simple as a string, or a custom object) is being squooshed down and stored into the memory stream ms.

The compression class needs a byte array to process so that is what we build next. Buffer is our little array that we create to mimic the size of the memory stream we just created After he is initialized we read the stream in. Note: Make sure to set the position of the memory stream to 0 before reading, otherwise nothing happens.

Once our buffer is loaded we create a file stream object that points to the place on your hard drive that you want to store your data. Next we create a new GZipStream class and point our new file stream at it. Finally write the buffer into the compressed stream. Voila! Your data is now saved and zipped up in a nice neat bow.

So its on the disk now…the next thing you need to know is how to access it right? No problem.

dim reader as New StreamReader(cacheFile.FullName)

'this stream has been squooshed so we unsquoosh it here
Dim decomp as New GZipStream(reader.BaseStream, CompressionMode.Decompress, True)

Dim b As New BinaryFormatter
cache = b.Deserialize(decomp)

This looks pretty similar to what we had before. Basically we use a StreamReader to open up the file we saved earlier, then we use the GZipStream to decompress (notice the compression mode). Finally we use our handy dandy BinaryFormatter to deserialize the now unsquooshed data into our object.

Now that you know how to use binary serialization here is the golden rule…

Keep your objects simple

The more complex the object you are trying to serialize is, the larger your file will be and the slower it will be to access. For instance…when I started using this my data file was 17 megs (umcompressed it would have been over 500 Megs!!). The reason for this was because I was storing most of my data in a generic list. Now I love generic lists, and I use them where ever I can, but in this instance, they are absolute memory pigs. The reason for this is when the serializer crunches down objects it creates some overhead. That means for each item in a list you get a little overhead. Serializing a few objects is no big deal, but when you are dealing with hundreds of thousands of little ones, it starts to create a big problem.

In our case we took the same data and changed from a generic list to a comma delimited string and the data file shrunk down to just under 1 meg. From a loading perspective time I went from 24 seconds to 4, so it is a big difference. When we decompress it I change the string back into a list so my code didn’t have to change at all.

Now if you want to go a step further, you can make your app really fly by moving the entire save process to a seperate thread. It is out of scope for this article, but it isn’t as difficult as some people would lead you to believe (*cough* job security *cough*). If you use threading the whole process will seem instant to your user. Can’t beat that!

So there you have it. Dealing with large blobs of data isn’t all that uncommon nowadays. I hope this gives you a different approach to use when speed is of the essence.

17 Comments so far »

  1. Adam Jordens said,

    Wrote on December 4, 2007 @ 3:59 am

    Were there any optimizations done to actually reduce the in-memory footprint of the cached data?

    You mentioned that in your deserialization function that you’re converting from a denormalized csv to an in-memory list.

    Basically, from what I’ve read, you’ve managed to optimize the size on disk of data by transitioning from xml to binary (w/ gzip thrown on top). I agree that this is probably a good thing provided the data is small enough that the overhead of decompression doesn’t hurt. That being said, there are some very fast XML parsers available that should be able to rip through 500M of data in well under 100s.

    However, I’d be curious to know if the in-memory footprint is still upwards of 500M?

    Partitioning the data is often a good position to take with problems like this. It’s one of the core tenets behind Map/Reduce and Google has obviously shown it to scale well. Even with much smaller data sets it’s nice to be able to have the flexibility to say ‘I haven’t used this data in awhile I’m going to evict it from memory’, choosing to pay the penalty to reload at a later time. It’s unfortunate that this strategy doesn’t work for your particular application.

    Cheers.

  2. David Baxter said,

    Wrote on December 4, 2007 @ 10:07 am

    I can’t say exactly why, but when everything is decompressed into memory my app only takes up about 250 megs or so of RAM. So for some reason it is smaller than when it is on disk…

    I have been an xml guy for years, and I love working with it because it is so functional. With that said, I did not want to move away from xml because I am so familiar with it. In .NET if you want to be able to edit an xml object you have to use a XmlDocument object. This slows things down and the XPath we used to search through the lump was just too slow.

    If you only need read access to an xml file, then you can use faster parsers like an XPathDocument and XPathNavigator…unfortunately, we couldn’t go “read only”.

    I haven’t heard of Map/Reduce before…gonna have to look that one up. Thanks!

    David

  3. Taufik said,

    Wrote on December 4, 2007 @ 2:28 pm

    Hmm…what about binary xml format (with built in compression)?

  4. David Baxter said,

    Wrote on December 4, 2007 @ 5:20 pm

    Can’t say that I have ever heard of that Taufik. Those two seem mutually exclusive. Do you have a link or something that talks about binary xml?

    David

  5. Canta said,

    Wrote on January 4, 2008 @ 3:59 pm

    Please, excuse my english, as i speak spanish.

    About the 250MB vs 500MB… could it be an encoding related issue?

    I mean, doesn’t Unicode encoded strings (like UTF-8) use double bytes for characters, when other encodings (like ISO-8859-1) use only one?

    If i’m understanding it right, that means any utf-8 encoded file has double data space used than any same iso-8859-1′d file.

    Maybe when serialized that penalty is lost if your XML file is unicode.
    I’m just a high level programmer, not a byte expert, so just guessing.

  6. David Baxter said,

    Wrote on January 4, 2008 @ 4:06 pm

    Hey Canta, I am not a byte junkie either, but what you are saying is interesting and worth looking into.

    I do know some encoding formats are better than others, but I have never really sat down and thought about it.

    In my case, I just trust the .NET serializer to make the right choice, but if space is at a premium, that might be a good way to be more efficient.

    Thanks for the advice!

    David

  7. Canta said,

    Wrote on January 8, 2008 @ 10:38 am

    Hi David.

    Please, let me tell you a few tips about that encoding thing i’ve learned working with XML in Microsoft enviroments.

    As i told you before, we speak spanish here. So, we got a few non-US characters, like the “ñ” and the tilded vowels. Those characters gave us always encoding problems. That’s why and where UTF-8 comes in: it’s supposed to be universal. Therefore, it’s spected that contemporary XML parsers work in UTF-8 by default, meaning that your file may be utf-8 encoded (and have a size of 500M when using 250M chars).

    It’s easy to know this, by opening the file in Notepad, selecting “Save As”, and checking what file encoding does it shows by default; that’s the encoding of your file.
    Unfortunately, that’s not that “easy” to check that way with a 500MB text file (notepad will surely hang).

    Visual Studio 6, and windows up to server 2003 (don’t know about vista or 2008), handle files in ANSI by default (wich is but an ASCII with MSDOS characters and 16 colours). To write an UTF-8 file, you gotta do a few things.
    Therefore, if you save an UTF-8 encoded XML into an ANSI file, it does not work (the system missuderstand some chars).
    Also, by default, MSXML up to version 3 (maybe 6 as well, gotta try) wrote XML code in UTF-16 encoding (wich is ANSI compatible, as well as ISO-8859-1).
    That’s how the things worked before .NET in the Microsoft’s XML world: you HAD to put this parameters by hand, and know how to do it.

    I assume .NET parsers do work in UTF-8 by now, because of the mess that was all that encoding thing, and therefore you might have an utf-8 file.

    Don’t know about your version of Visual Studio, nor you framework version, but is good to know this kind of issues everytime you work with an XML file and Microsoft technologies. I learned all this stuff in the hard way. :(

    Best.

  8. Elmo said,

    Wrote on February 5, 2008 @ 11:36 am

    hello.

    Now, i’m a bit new on this, so i got 1 quick question.

    I get all this code to work, but i need my program to write to the file, close it, then continue on the same file later in the code.

    when i use it as it is it only overwrited the previous data when i try to continue on it.

    I tried changing the “FileMode.Create” to “FileMode.Open” but didn’t change anything.

    so, how do i make it continue on an exsisting file?

  9. David Baxter said,

    Wrote on February 11, 2008 @ 11:45 am

    Hmmm…FileMode shoulda done it I would think. There should be something about Appending when it comes to messing with files. I would take a look at that. Let me know if that doesn’t help.

    David

  10. David said,

    Wrote on June 18, 2008 @ 12:17 pm

    Worst recommendation for my project ever. If you want a small file size, then binary formatter and compression may be the way to go. But if you doing data mining or analysis on huge numbers of records or huge arrays I just figured out how to write 700MBs of data to a disk and load it back in in 35 seconds! The binary formatter would run out of memory or take many times longer. You could also write your own algorithm for compressing it and making the file smaller with my method. The way you do it is using a BufferedStream which you pass your fileStream to. Then, you read and write your values using a binaryWriter and BinaryReader. Its so fast. For my application its like have all the memory in the world because if your algorithm takes 30 minutes to hours, needing only seconds to load in all the data into memory is super super fast. Its especially useful for creating the algorithm and debugging on 64 bit since there is no way to keep the data in memory when you want to make changes. 64-bit doesn’t allow edit and continue… the bastards! :) well, enjoy! hope this helps someone.

  11. David Baxter said,

    Wrote on June 18, 2008 @ 1:18 pm

    Wow…worst ever? Alrighty then.

    Sorry it didn’t work out for you. Our files are about 100 MB if uncompressed and this method works wonders. However, for super huge data mining ops, then it sounds like you have a good way of doing it. Congrats.

    David

  12. Ramki said,

    Wrote on July 5, 2008 @ 6:01 am

    Hi,

    I have got to read a text file(size > 500MB to GBs also) in c#. I have to parse this text file.
    the file consists of lines with comma sepearated and some lines not required to parse.

    I need to read the log file ( right now doing b stream reader ReadLine() method )
    1. then form a data row to build data table.
    2. I need to bind this huge data table to grid view.
    3. Then Export this grid to Excel Sheet.

    Am using stream reader to read the file.
    the application hangs or sometimes becomes not responding…

    I tried to do the operation in seperate thread then also the behavior didc’t change much (Hanging)

    and also Binding to Grid is a good thing to do ?

    Thanks,
    Ramki

  13. David Baxter said,

    Wrote on July 7, 2008 @ 10:11 am

    Hey Ramki, when you are dealing with that much data you definitely don’t want to bind it all to the grid at once.

    We have a grid that can be as large as 3,000 rows and 1,500 columns (STUPID big)…its a beast, but the key is to use a virtual grid. Most third party grids has some form of virtual grid.

    What that does is tells the grid…here is your data source over here (which points to the cache file I talk about in this article), but I only want you to load what the user can see on the screen. To say this dramatically speeds things up is an understatement.

    We use the Infragistics grid on our project and it has been really solid.

    Hope that helps.

    David

  14. Ramki said,

    Wrote on July 14, 2008 @ 3:53 am

    Hi David,

    Thanks for your reply.

    I tried to add to grid row by row after every Readline() of the log.
    But this is also consuming time as binding altogether at once after whole reading operation had completed.

    Yes having an option to view a bunch of rows is good design. And user can see only a bunch of records at any time. But this is not the requirement.

    virtual grid can be done in v 2005 without 3rd party tools?
    ->infragists: to use these tools we need to buy right?

    Ramki

  15. David Baxter said,

    Wrote on July 14, 2008 @ 9:52 am

    I believe that VS has a virtual grid, but I also am pretty sure it sucks :P

    If you want to use infragistics, it does cost money. If you are a freelancer, then it is probably a bit expensive, but if you are working with a company or a good client, it is money well spent…

    David

  16. David said,

    Wrote on August 28, 2008 @ 6:02 pm

    David,

    I’m trying to wrap my mind about your whole process here. Tell me if this is correct:

    1) Client app communicates to server app requesting data.
    2) Server app requests data from database.
    3) Server app compresses data and ships off to client app.
    4) Client app decompresses data for use.

    If I’m way off here, please let me know. I may be working on a project that involves several gigs of data and I’m looking for a solid solution.

    David

  17. Betzie said,

    Wrote on September 7, 2008 @ 1:44 am

    Hi everyone

    Good Algorithm, like i read here above, i find it difficult to handle amount large data, like > 500 MB - X GB big from an database.
    But the nice thing is we dosent need all that data at once, but i want to have it have “on-the-fly”, is it better to index the database and hope that the sql query will quick be enough to send over data in realtime when the program ask for it or is it better to cache it up and compress it in-memory, or the last thing, maby is better to cache it up to an virtual file, then using filemapping.

    clue: this data is for an own Chart program, where i want to show one or more lines, so the problem here is when i want to show data on the chart, is that,it dosent have to show all the data at once, but when you zoom in you want some connection to more detail-data that can be shown on the chart.
    But what technic can be quick enough for that?

Comment RSS · TrackBack URI

Leave a Comment

Name: (Required)

E-mail: (Required)

Website:

Comment: