After the issues we run into with immutable structures, I decided that I still wanted to maintain the same guarantees that immutable data structures gave us, but that still maintain the speed of mutable structures.
I managed to do that by making different design choices when building my implementation. While the BCL immutable collections are based on binary trees, and attempt to spare GC pressure in favor of CPU cycles, I find that for a lot of the things that we do, we can spare the memory in favor of actually being faster overall. In particular, this is true since the immutable collections are order of magnitude slower for reads than their mutable counterparts. I could have dealt with the slow writes, somehow. But we do a lot of reads, and that is utterly unacceptable.
Here is how I implemented SafeList:
The implementation for SafeDictionary is pretty similar as well. Note that I have dedicated & optimized methods for the type of things that I need in my code.
But the key part here is that I gain the safety by always creating another copy of the underlying implementation. We are never actually mutating anything.
Sure, this results in us having to deal with a lot more allocations, but we get the same speed for reads as you do for the standard mutable collections. And more importantly, we are still faster for writes.
Why is the performance of an immutable list over 16 times slower than a standard list? I took a peek at what it was actually doing, and it made a lot of sense. In order to maintain efficient indexing access, the actual storage of the data in the immutable list is a binary tree. With the key being used as the indexer.
This result is a much higher cost for pretty much everything. Let us look at the following:
This List<T> is 0.23 seconds, ImmutableList<T> takes 1.28 seconds. When I use foreach, instead, we get 0.22 seconds vs. 2.29 seconds.
As you can see from the blog post describing them, because immutable collections are mostly implemented as binary trees, I don’t really think that there is a good way to approach this as is. The problem is that the immutable collections actually need to do a lot more than what I need them to do.
Now, it might have been more acceptable to use them if the perf was limited to just writing to them, it might have been acceptable. But as you can see, we have the same problem when we are reading, and that is quite unacceptable.
After finding out that a lot of our costs in Voron is actually related to immutable collections, I decided to run some isolated perf tests.
1 minute and 58 seconds
Yes, that is correct, the immutable version is over thirty times slower. And given that we are only using 10 million items, that is a ridiculous high rate.
I decided to do some tests to see if it is just the number of calls that we are making here:
So that it appears that it isn’t the number of calls, but something intrinsic to the way it works. And how about lists?
I think we are going to need to do something about this, even though I love the idea of immutable collections, 16 – 32 times slower is just going to be utterly unacceptable.
But this is for writes, how about reads?
I had written the following for both of them:
So the summary is, the performance for our workloads is probably going to be simply unacceptable. We’ll have to find another way to handle this. The way to handle this nicely is to basically copy the values. So I wanted to know how long it would take to fully clone a 10 millions items dictionary. The answer: 0.6 seconds. Doing the same with immutable dictionary? 16 seconds.
So we are going to need to do something else .
Earlier this month I posted our results for the initial performance of Voron. We learned a lot from that, and it led us to do some major refactoring to the codebase. Now, we are back on a more or less stable ground, so it is time to actually see what is going on. Where before we had a baseline against other storage engines, now we are testing against the previous version as well as Esent.
To get a good and fixed numbers, I decided to run the tests on a EC2 machine, in particular, an m1.large instance. The idea is that this gives us a simple way to get a standard machine config that anyone can use to test out as well.
For the purpose of this benchmark, we are doing writes to one of the temporary storage disks, rather than a permanent EBS drive. This will presumably be somewhat faster than real world configuration, but I think that running in an environment where we are actually using a common production machine (rather than a dedicated development machine) should give us better numbers, certainly more realistic ones.
The first thing to check, of course, is the simplest. Sequential writes:
Here we can see that we are actually worse off than we were before. That is actually kind of sad. But we’ll get to that later on.
For random reads, we are still very good, and it is obvious that we are pretty early make use of all of the resources that the machine gives us. Here we are consistently slightly faster than the older version, but that is something that is probably accidental. We certainly did no work on improving read performance (there was very little need to do so, after all).
Random writes was a big surprise:
As it happens, it appears that we are faster, significantly so, than Esent in this scenario. Which was quite a surprise. For some reason, we are also faster in the new code than the old version. Which is quite interesting.
Finally, we have random reads.
No, it isn’t an error. We see a rate of ~1,600 reads/sec for Esent single threaded. I am not quite sure why this is behaving in this fashion, but I am willing to say that there is something wrong and just discount this result.
I mentioned before that it appears that we are actually slower than the old code. So I decided to check what is actually going on there. My tool of choice, the profiler. Here is the old version:
You can see that we have done two runs (one sequential, one random) of 10,000 transactions, writing a total of 2 million items. And roughly 26% of the time is spent committing the transaction.
Digging into that, a lot of that time is spent just flushing to disk:
Note, however, that a decidedly non trivial amount of time is actually spent on ImmutableDictionary, but I’ll touch on that later on.
With our changes, the new code looks like:
The one thing that jumps at me is that the cost of creating a transaction is now very significant. And here is why:
The actual reason for this is likely that we also have the background journal flushing, that now also needs to take the transaction lock. Which is introducing contention into the mix. We’ll need to look into how to resolve that.
Let us dig into the commit. The change there was pretty much the whole reason for what we were doing.
You can see that we are using WriteFileGather, and unlike before, where syncing took about 15% of the overall time. Now it is taking just 7%. So we are better than 50% improvement on that score.
But the really interesting bit? Look at what was by far the most expensive operation there. It was the calls to immutable dictionary. In fact, let us look at the results in the profiler for the immutable collections:
So now we have two new things that we need to investigate. One is reducing contentions, and the second is checking how we can optimize our usage of the immutable collections.
I’ll report on our finding…
I was asked to review the book (and received a free electronic copy).
As someone that is very into storage engines, I was quite excited about this. After going over the leveldb codebase, I would finally get to read a real book about how it works.
I was disappointed, badly.
This book isn’t really about leveldb. It contains pretty much no background, explanation, history or anything much at all about how leveldb works. Instead, it is pretty much a guide of how to use LevelDB to write iOS application. There is a lot of chapters dealing with Objective-C, NSString and variants, how to do binding, how to handle drag and drop.
However, things that I would expect. Such as explanations of how it works, what does it do, alternative use cases, etc are very rare, if there at all. Only chapter 10 is really worth reading, and even so, I got the feeling that it only made sense to me because I already knew quite a lot leveldb already. I can’t imagine actually starting from scratch and actually being able to understand leveldb from this book.
If you are working on iOS apps / OS X, I guess that this might be a good choice, but only if you want to know about actually implementing leveldb. You’ll need to do your actual leveldb learning elsewhere.
The book does contain some interesting tidbits. Chapter 10 is talking about tuning and key policies, and it did have some interesting things to talk about, but it also contain wrong information* (and if I could spot it, with my relatively little experience with leveldb, I’m pretty sure that there are other things there too that are wrong).
* The book said that is it better to write keys in order, to reduce I/O. But leveldb writes to a skip list in memory, then flush that entire thing in sorted fashion to disk. Your writes have to be bigger than the buffer size of that to actually matter, and that still won’t help you much.
In short, feel free to skip this book, unless you are very focused on writing leveldb apps on iOS. In which case it might be a worth it, but I don’t think so. You are better off reading the docs or any of the tutorials.
I got an interesting question from Thomas:
How can the OS actually ensure that the write through calls go to disk before returning, when it cannot do the same for fsync. I mean couldn't there still be some driver / raid controller / hard disk doing some caching but telling windows the data is written?
And I think that the answer is worth a post.
Let us look at the situation from the point of view of the operation system. You have an application that issue a write request. And the OS will take the data to be written and write it to its own buffers, or maybe it will send it to the disk, with instructions to write the data, but nothing else. The disk driver is then free to decide what the optimal way to actually do that would be. In many cases, that means not writing the data right now, but placing that in its own buffer, and do a lazy write when it feels like it. This is obviously a very simplified view of how it works, but it is good enough for what we are doing.
Now, when we call fsync, we have to do that with a file handle. But as it turned out, that isn’t quite as useful as you might have thought it would be.
The OS is able to use the file handle to find all of the relevant data that has been written to this file and weren’t send to the disk yet. And it will call the disk and tell it, “hi, how about writing those pieces too, if you don’t mind*”. However, that is only part of what it needs to do. What about data that has already been written by the OS to the disk drive, but is still in the disk drive cache?
* It is a polite OS.
Well, we need to force the drive to flush it to the actual physical media, but here we run into an interesting problem. There is actually no way for the OS to tell a disk drive “flush just the data belong to file X”. That is because at the level of the disk drive, we aren’t actually talking about files, we talk about sectors. Therefor, there isn’t any way to say, flush just this data. And since the disk drive won’t tell the OS when it actually flushed the data to disk, the OS has no way of telling (nor does it needs to track it) what specific pieces need to actually be flushed.
Therefor, what it does is go to the disk driver and tell it, flush everything that is in your cache, and tell me when you are done. As you can imagine, if you are currently doing any writes, and someone call fsync, that can be a killer for performance, because the disk needs to flush the entire cache. It is pretty common for disks to come with 64MB or 128MB caches. That means that when fsync is called, it might be doing a lot of work. the FireFox fsync issue is probably the most high profile case where this was observed. There have been a lot of people looking into that, and you can read a lot of fascinating information about it.
Now, what about Write Through? Well, for that the OS does something slightly differently. Instead of just handing the data to the disk driver and telling it do whatever it wants with it, what it does is to tell the disk driver that it needs to write this data right now. Because we can give the disk driver the specific instructions about what to flush to disk, it can do that without having to flush everything in its cache. That is the difference between writing a few KB and writing tens of MB.
I said that this is a great oversimplification. There are some drivers that would choose to just ignore fsync. They can do that, and they should do that, under certain circumstances. For example, if you are using a disk that comes with its own battery backed memory, there is no reason to actually wait for the flush, we are already ensured that the data cannot go away if someone pulls the plug. However, there are plenty of drives that would just ignore fsync (or handling only 3rd fsync, or whatever) because it leads to better performance.
This also ignore things like the various intermediaries along the way. If you are using hardware RAID, for example, you also have the RAID cache, etc, etc, etc. And yes, I think that there are drivers there that would ignore write through as well.
At the low level Write Through uses SCSI commands with Force Unit Access, and fsync uses SYNCHRONIZE_CACHE for SCSI and FLUSH_CACHE for ATAPI. I think that ATAPI 7 has Force Unit Access, as well, but I am not sure.
Also known as: Please check the subtitle of this blog.
This benchmark doesn’t actually provide much useful information. It is too short and compares fully featured DBMS systems to storage engines. I always stress very much that people never make decisions based on benchmarks like this.
These paint the fully featured DBMS systems in a negative light that isn’t a fair comparison. They are doing a LOT more work. I’m sure the FoundationDB folks will not be happy to know they were roped into an unfair comparison in a benchmark where the code is not even available.
This isn’t a benchmark. This is just an interim step along the way of developing Voron. It is a way for us to see where we stand and where we need to go. A benchmark include full details about what you did (machine specs, running environment, full source code, etc). This is just us putting stress on our machine and comparing where we are at. And yes, we could have done it in isolation, but that wouldn’t really give us any major advantage. We need to see how we compare to other database.
And yes, we compare apples to oranges here when we compare a low level storage engine like Voron to SQL Server. I am well aware of that. But that isn’t the point. For the same reason that we are currently doing a lot of micro benchmarks rather than the 48 hours ones we have in the pipeline.
I am trying to see how users will evaluate Voron down the road. A lot of the time, that means users doing micro benchmarks to see how good we are. Yes, those aren’t very useful, but they are a major way people make decisions. And I want to make sure that we come out in a good light under that scenario.
With regards to Foundation DB, I am sure they are as happy about it as I am about them making silly claims about RavenDB transaction support. And the source code is available if you really want to, in fact, we got the Foundation DB there because we had an explicit customer request, and because they contributed the code for that.
Next, let us move to something else equally important. This is my personal blog. I publish here things that I do on a daily basis. And if I am currently in a performance boost stage, you’re going to be getting a lot of details on that. Those are the results of performance runs, they aren’t benchmarks. They don’t get anywhere beyond this blog. When we’ll put the results on ravendb.net, or something like that, then it will be a proper benchmark.
And while I fully agree that making decisions based on micro benchmarks is a silly way to go about doing so, the reality is that many people do just that. So one of the things that I’m focusing on is exactly those things. It helps that we currently see a lot of places to improve in those micro benchmarks. We already have a plan (and code) to see how we do on a 24 – 48 hours benchmark, which would also allow us to see all sort of interesting things (mixed reads & writes, what happens when you go beyond physical memory size, longevity issues, etc).
Those are interim results, but it gives me some hope. This is running on a different machine than the numbers I posted before. But I have enough to compare to each other. You can see it here. All numbers are in operations / sec.
Writing 100,000 sequential items (100 items per transaction in 1,000 transactions):
And writing 100,000 random items:
And here is what we really care about. Comparing Voron to Esent:
This is actually pretty amazing, because we are now at about 25% write speed of Esent for sequential writes (we used to be at less than 5%). When talking about random writes (what we really care about) we are neck in neck.
I am currently in the states, and I can’t go anywhere without seeing a lot of signs for Black Friday. Since it seems that this is a pretty widely spread attempt to do a load test on everyone’s servers (and physical stores, for some reason). I decided that I might as well join the fun and see how we handle the load.
You can use one of the following coupon codes (each one has between 4 – 16 uses) to get a 20% discount for any of our products, if you buy in the next 48 hours.
I keep getting asked by people: “What is the configuration option to make NHibernate run faster?”
People sort of assume that NHibernate is configured to be slow by default because it amuses someone.
Well, while there isn’t a “Secret_incantation” = “chicken/sacrifice” option in NHibernate, there is this one:
And it pretty much does the same thing.
No, I won’t explain why. Go read the docs.
I have a method that does a whole bunch of work, then submit an async write to the disk. The question here is how should I design it?
In general, all of the method’s work is done, and the caller can continue doing whatever they want to do with they life, all locks are released, all the data structures are valid again, etc.
However, it isn’t until the async work complete successfully that we are actually done with the work.
I could do it like this:
But this implies that you have to wait for the task to complete, which isn’t true. The commit function looks like this:
So by the time you have the task from the method, all the work has been done. You might want to wait of the task to ensure that the data is actually durable on disk, but in many cases, you can get away with not waiting.
I am thinking about doing it like this:
To make it explicit that there isn’t any async work done that you have to wait for.
I’ll be in Ireland and the UK next week presenting at several events. Below are details on the talks I’ll be doing if you want to come along and hear them:
Dublin: Monday Dec 2nd
I’m doing two separate free events in Dublin on Monday:
- Windows Azure and the Cloud at Mon 1-3pm. This event is free to attend, and I’ll be doing a two hour keynote/overview session on Windows Azure as part of it. This will be a great talk to attend if you are new to Windows Azure and are interested in learning more about what you can do with it. Later sessions at the event also cover VS 2013, building iOS/Android apps with C# using Xamarin, and F# with Data and the Cloud. Lean more here and sign-up for free.
- Building Real World Application using Windows Azure at Mon 6:00-9:00pm. This event is also free to attend, and during it I’ll walkthrough building a real world application using Windows Azure and discuss patterns and best practice techniques for building real world apps along the way. The content is intermediate/advanced level (my goal is to melt your brain by the end) but doesn’t assume prior knowledge of Windows Azure. Learn more here and sign-up for free.
There is no content overlap between the two talks – so feel free to attend both if you want to!
London: Wed Dec 4th and 5th
I’m presenting at the NDC London Conference on Dec 4th and Dec 5th as well. This is a great developer conference being hosted in the UK for the first time. It has a great line up of speakers attending.
I’m presenting 2 separate two-part talks:
- Introduction to Windows Azure (Part 1 and 2) at Wed from 1:30-4pm. This will be a great talk to attend if you are new to Windows Azure and are interested in learning more about what you can do with it.
- Building Real World Applications using Windows Azure (Part 1 and 2) at Thursday from 9am-11:20am. I’ll walkthrough building a real world application using Windows Azure and discuss patterns and best practice techniques for building real world apps along the way. The content is intermediate/advanced level (my goal is to melt your brain by the end) but doesn’t assume prior knowledge of Windows Azure.
Hope to see some of you there!
P.S. In addition to blogging, I am also now using Twitter for quick updates and to share links. Follow me at: twitter.com/scottgu
We recently shared our plans to produce a 6.0.2 patch release to address some performance issues (and other bugs) in the 6.0.0 and 6.0.1 releases. Today we are making Beta 1 of the 6.0.2 release available.
Why the beta?
We were originally planning to go straight to RTM and have the 6.0.2 patch available in the month of November. Some of the fixes are proving harder to implement and test/verify than we expected, so we need a bit more time to finish the fixes and ensure that performance is improved. In order to keep our commitment to have a release available this month, we’ve opted to release the current code base – which includes a number of improvements – as a beta.
Can I use it in production?
Yes, with some caveats. The license does not prevent you from using the release in production. We’re still testing the changes we’ve made and there are more changes still to come. Microsoft does not guarantee any particular level of support on this beta.
Where do I get the beta?
The runtime is available on NuGet. If you are using Code First then there is no need to install the tooling. Follow the instructions on our Get It page for installing the latest pre-release version of Entity Framework runtime.
The tooling for Visual Studio 2012 and 2013 is available on the Microsoft Download Center. You only need to install the tooling if you want to use Model First or Database First.
Note: If you are installing the tools for Visual Studio 2012, you will need to uninstall the existing Entity Framework Tools for Visual Studio 2012 (via Add/Remove Programs) before installing the new MSI. This is due to a temporary issue with the Beta 1 installer that will be fixed for RTM.
When can I expect the RTM?
Getting the 6.0.2 patch release to RTM is our teams top priority. We expect to have it available during December.
What if I find an issue in the beta?
Make sure it’s not something we already know about that is tracked to be fixed in 6.0.2. If it’s not, please file a new issue – be sure to include detailed steps on how to reproduce it, preferably including source code.
What’s in the beta?
Fixes to the following issues are included in Beta 1. We haven’t finished verifying all these issues..
- [Performance] Startup time is bad with debugger attached
- [Performance] Buffered queries are executed in non-sequential mode
- [Performance] Revert buffering by default
- [Performance] Remove calls to BufferedDataReader.AssertFieldIsReady
- Memory leak in case of usage an external connection
- “The given key was not present in the dictionary.” when adding Association to Entities - all mapped to Views
- Code first stored procedure mapping fails when the stored procedures have more than 25 parameter mappings
- EF6 regression: identity pattern not applied to key by convention in simple inheritance scenario
- [UpForGrabs] Remove EF.PowerShell.dll from SqlServerCe package
- Race condition in InitializeMappingViewCacheFactory
- Migrating from EF5 to EF6: InversePropertyAttribute broken?
- Code First: TPC with joins from base class to Identity ApplicationUser Fails
- Moving from EF5 -> EF6: Invalid Column Name error
- Having System.ComponentModel.DataAnnotations.MaxLength() applied to a string property leads to InvalidCastException exception
- Unable to cast Anonymous Type from 'System.Linq.IQueryable' to 'System.Data.Entity.Core.Objects.ObjectQuery'
- Designer: Can not reverse engineer EDMX when FKs in different order than PKs (composite keys)
- EF Configuration Cause Null Reference Exception.
- Stored Procs :: Concurrency check on value types in conjunction with stored procedures doesn't work properly
- TypeLoadException thrown for a class nested in an internal class
I really like the manner in which C# async tasks work. And while building Voron, I run into a scenario in which I could really make use of Windows async API. This is exposed via the Overlapped I/O. The problem is that those are pretty different models, and they don’t appear to want to play together very nicely.
Since I don’t feel like having those two cohabitate in my codebase, I decided to see if I could write a TPL wrapper that would provide nice API on top of the underlying Overlapped I/O implementation.
Here is what I ended up with:
Note that I create the file with overlapped enabled, as well as write_through & no buffering (I need them for something else, not relevant for now).
It it important to note that I bind the handle (which effectively issue a BindIoCompletionCallback under the cover, I think), so we won’t have to use events, but can use callbacks. This is much more natural manner to work when using the TPL.
Then, we can just issue the actual work:
As you can see, all the actual details are handled in the helper functions, we can just run the code we need, passing it the overlapped structure it requires. Now, let us look at those functions:
The complexity here is that we need to handle 3 cases:
- Successful completion
- Error (no pending work)
- Error (actually success, work is done in an async manner).
But that seems to be working quite nicely for me so far.
For the past few days I have been talking about our findings with regards to creating ACID storage solution. And mostly I’ve been focusing on how it works with Windows, using Windows specific terms and APIs.
The problem is that I am not sure if those are still relevant if we talk about Linux. I know that fsync perf is still an issue (if only because both Win & Lin are running on the same hardware). But would the same solutions apply?
For example, the nearest that I can find to FILE_FLAG_NO_BUFFERING is O_DIRECT and FILE_FLAG_WRITE_THROUGH appears to be similar to O_SYNC. But I am not sure if they are actually behaving in the same fashion.
Any ideas? Anyone has something like Process Monitor for Linux and can look at the actual behavior of industry grade databases commit behavior?
From my exploring, it appears that PostgreSQL is using fdatasync() as the default approach, but it can use O_DIRECT and O_DSYNC as well, so that is promising. But I would like to have someone who actually know Linux intimately tell me if I am even in the right direction.