We recorded the recent RavenDB conference, and we are starting to make them available now. The first to show up is Mauro Servienti’s intro talk to RavenDB.

https://www.youtube.com/watch?v=0WEMNKk3HHU

 

Ayende RahienCorax: The problem of space Posted at 4/23/2014 9:00:00 AM

Corax is a research project that we have, to see how we can build a full text search library on top of Voron. Along the way, we take the chance to find out how Lucene does things, and what we can do better. Pretty much from the get go, Corax is likely to use more disk space than Lucene, probably significantly so. I would be happy if we could get a merely 50% increase over Lucene

The reason that this is the case is that Lucene goes to great length to save disk space. From storing all integers in variable length format, to prefix compression to implicitly referencing data in other files. For example, you can see that when you try reading term positions:

TermPositions are ordered by term (the term is implicit, from the .tis file).

Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).

The downside of saving every little bit is that it is a lot more complex to read the data, requiring multiple caches and complex code path to actually get it properly. It make a lot of sense, when Lucene was created, disk space was at a premium. I won’t go as far as to say that disk space doesn’t matter, but given a trade off of using more disk space vs. using more memory / complexity, it is much easier to justify disk space usage today*.

* The caveat here is that you need to be careful, because just accessing the disk can be very slow.

One of the major things that we wanted to deal with Corax is reducing index corruption issues, and seeing if we can simplify things into a transactional system. As a side effect of that, we don’t need to have index segments, and we don’t need to do merges to free disk space. The problem is that in order to handle this, we need to make track additional information that Lucene doesn’t need to.

Let us look at the actual data we keep. Here is a very simple index:

using (var fullTextIndex = new FullTextIndex(StorageEnvironmentOptions.CreateMemoryOnly(), new DefaultAnalyzer()))
{
using (var indexer = fullTextIndex.CreateIndexer())
{
indexer.NewIndexEntry();
indexer.AddField("Name", "Oren Eini");
indexer.AddField("Email","Ayende@ayende.com");

indexer.NewIndexEntry();
indexer.AddField("Name", "Arava Eini");
indexer.AddField("Email","Arava@houseof.dog");

indexer.Flush();
}
}

For each field, we are going to create a multi tree. And for each unique term in the field we have a list of (Index Entry Id, term frequency, boost).

  • @fld_Name
    • arava
      • { 2, 1, 1.0 }  (index id 2, freq 1, boost 1.0)
    • eini
      • { 1,1,1.0 }
      • { 2,1,1.0 }
    • oren
      • { 1,1,1 }
  • @fld_Email
    • arava@houseof.dog
      • { 2,1,1.0 }
    • ayende@ayende.com
      • { 1,1,1.0 }

This is pretty much the equivalent to the way Lucene store things. Possible space optimizations here include not storing default values (term frex or boost of 1), storing index entry ids as variable ints, etc.

The problem is that while this is actually enough for the way Lucene does things, it is not enough for the way Corax does things. Let us consider the case of deleting a document. How would you go about doing this using the information above?

Lucene does this by marking a document id as deleted, and will purge its details on the next segments merge. That works, but only because a segment merge actually read & write all of the relevant segments data. Without a segments merge, deleting a document is actually something that would require us to scan all the data in the entire database. This is not really practical. Therefor, we need to store additional data so we can delete it later on. In this case, we have the Docs tree, which has keys for (index entry id, field id and term num). This looks like this:

  • Docs
    • [1,1,1]: oren
    • [1,1,2]: eini
    • [1,2,1]: ayende@ayende.com
    • [2,1,1]: arava
    • [2,1,2]: eini
    • [2,2,1]: arava@houseof.dog

Using this information, we can now remove all traces of a document when it is deleted. However, the problem here is that we need to also keep the terms per document in the index. That really blow up the index size, obviously.

The reason for this peculiar way of storing the document fields in this manner is that we also want to reuse this information for sorting. When Lucene needs to sort data, it has to read all of the data from the fields, then recreate the values for all relevant documents. Corax can just serve the data already there.

A pretty obvious step to save space would be to track the terms separately, and use an id in the Docs tree, not the full term. That leads to an interesting problem, because we are going to need to be able to go from term –> id and id –> term, which pretty much require storing them twice, unfortunately.

Final note, Corax is a research project.

During the RavenDB courses* in the past few weeks, I was talking with one of the attendees and I came up with what I think is a great analogy.

* Yes, courses, in the past 4 weeks, I’ve given the RavenDB course 3 times. That probably explains why I don’t remember which course it was.

What are your success metrics? From Opening a Champagne Bottle To Hiding Under the Bed with Said Bottle?

The first success  metric is when you have enough users (and, presumably, revenue) to cross the threshold to the Big Boys League. Let us call this the 25,000 users range.  That is the moment when you throw a party, go to the store and grab a whole case of champagne bottles and make fancy speeches. Of course, the problem with success is that you can have too much of it. A system that does just fine (maybe creeks a little ) on a 25,000 users is going to behave pretty differently when you have 100,000 users. That is the moment when you find your engineers under the bed, with a half empty bottle of champagne and muttering things about Out Of Capacity errors and refusing to come out until we fire all the users.

In just about any system, you need to define the success points. Because Twitter was very luck that it managed to grow even though it had so many problems when its user base exploded. It is far more likely that users will figure out that your service is… well, your engineers are drunk and hiding under the bed, so the service looks accordingly.

And yes, you can try to talk to people about SLA, and metrics and capacities. But I have found that an image like that tend to give you a lot more focused answers. Even if a lot of the time the answer is “I don’t know”. That is a place to start, because this make it a lot more acute than just “how many req/sec do we need to support?”.

There are a lot of stuff that are hard to do when you are working on a large team. But the really nice thing is the velocity in which you can move.

I just started the morning with the following commands:

image

And I have another PR pending for a different branch that I’m going to have to look at. Overall, I think that this is a pretty cool thing to have. We can push forward in many direction at once, and it can be pretty awesome to look at all the good thins that are coming our way.

Ayende RahienOptimizing Voron and the cost of Multi Trees Posted at 4/18/2014 9:00:00 AM

ayendOne of the nice features that Voron got from LMDB is the notion of multi trees. If I recall correctly, LMDB calls them duplicate items, or something like that. Basically, it is the ability to store multiple values for a single key. Those items are actually stored as a nested tree, which effectively gives us a great way to handle duplicates. From an API standpoint, this looks like this:

// store
tree.MultiAdd(tx, "Eini", "users/1");
tree.MultiAdd(tx, "Eini", "users/3");
tree.MultiAdd(tx, "Eini", "users/5");




//read
using(var it = tree.MultiRead(tx, "Eini"))
{
if(it.Seek(Slice.BeforeAllKeys) == false)
yield break;
do
{
yield return it.CurrentKey; // "users/1", "users/3", "users/5"
}while(it.MoveNext());
}

Internally, we handle this in the following fashion:

  • If a multi add operation is the very first such operation, we’ll add it as a simple key/value pair in the tree.
  • If a multi add operation is the 2nd such operation, we’ll create a new tree, and add both operations to the new tree. The original tree will have the key/nested tree reference stored.

That lead to a very easy to use API, which is quite useful in many cases. However, it also have space usage implication. In particular, a tree means that we have to allocate at least one page to it. And that means that we have to allocate 4KB to hold any multi tree with more than a single value. Since there are actually a lot of scenarios where we would want to store a small number of small values, that can often lead to a lot of waste on our part. We use 4KB to store data that we could have stored in just 64 bytes, for example. That means that when we want to actually store things that might be duplicated, we actually need to consider the taken space consideration as well.

I think that this is a pretty bad idea, because it leads us to do things with composite keys under some scenarios. That is something that Voron should be able to just handle for me.  So we changed that, in fact, what we did was changed the way we approach multi trees in general. Now, for every multi value, we can store up to 1KB of items in the same page as the key, before we devolve into a full blown multi tree.

The idea is that we can use the same Page mechanism we use elsewhere, and just have a nested page inside the parent page. As long as the nested page is small enough, we can just store it embedded. That result in some nice space saving. Since usually we have items in the following counts: zero, one, few, lots.

We need this to help with Corax, and in general, that would reduce the amount of space we need to use in many cases.

C# FAQVisualizing Roslyn Syntax Trees Posted at 4/17/2014 7:30:00 PM

Hello everyone! I hope you had a chance to catch the recent announcements around the .NET Compiler Platform (“Roslyn”). If not, I encourage you to view Anders’s presentation at Build 2014 (skip to 1:10:28). If you haven’t already, download the previews and take them for a spin!

What’s included?

The Roslyn compiler codebase is now open source. Check out Matt’s recent post for details around how to enlist in the open source project and modify the compiler. In addition to the End User Preview that Matt discussed at the beginning of his post, we are also releasing an SDK Preview. The End User Preview allows you to try out the new Roslyn IDE experience as well as some new language features that we are planning to include in the next version of C#. The SDK Preview on the other hand, contains tools and samples that will help you build your own applications, code refactoring extensions, quick fixes etc. atop the Roslyn APIs.

In this post, I’d like to introduce to you the Syntax Visualizer tool included in the SDK Preview. Detailed documentation for this tool is available on the Roslyn CodePlex site.

If you are new to Roslyn, I recommend starting with the Roslyn Overview document which provides a nice overview of how C# syntax and semantics are modelled in the Roslyn APIs. Also check out this talk that Dustin and Mads presented at Build 2014 for a tour of what’s in the above previews. More documentation and learning resources are available on the Roslyn CodePlex site.

What is the Syntax Visualizer?

The Syntax Visualizer is a Visual Studio extension that allows you to inspect and explore Roslyn syntax trees. This tool will be your faithful companion as you build your own applications atop the Roslyn APIs.

Why do I need a Syntax Visualizer?

In Roslyn, each type of syntactic construct in the language is modelled using a type named SyntaxNode. For example, C# class declarations are modelled using the type ClassDeclarationSyntax that is a derived type of SyntaxNode. A SyntaxNode can contain other SyntaxNodes. For example, the SyntaxNode for a class declaration contains SyntaxNodes for each method declaration in this class. The SyntaxNode for a method declaration in turn contains SyntaxNodes corresponding to each variable declaration inside this method and so on.

The source code in each file for a given program is thus modelled by a tree of SyntaxNodes. SyntaxNodes also contain SyntaxTokens and SyntaxTrivia which form the leaf nodes of syntax trees. SyntaxTokens represent significant text in the code (such as keywords, identifiers, literals and punctuation) while SyntaxTrivia represent text that is largely insignificant for normal understanding of the code (such as whitespace and comments).

The following image depicts what the syntax tree for a small snippet of C# code would look like. You can read more about SyntaxNodes, SyntaxTokens and SyntaxTrivia in the Roslyn Overview document.

clip_image002

Now the C# language supports a large variety of syntactic constructs (type declarations, method declarations, lambdas, variable declarations, method calls etc. etc.). So we have a corresponding large variety of SyntaxNodes in Roslyn and a variety of configurations in which these SyntaxNodes can appear within different syntax trees.

SyntaxNodes also happen to be the primary form of currency in the world of Roslyn APIs. Most tasks in the Roslyn APIs begin with SyntaxNodes. For example, you pass the SyntaxNode for the construct that you are interested in to other APIs such as SemanticModel to get richer semantic information about the construct. You specify which types of SyntaxNodes you wish to “handle” when you use the Roslyn Diagnostics and Code Fixes APIs to implement your own code fixes and code refactoring extensions.

When you combine these two aspects (i.e. a large surface area of syntactic constructs and the need to communicate in terms of these constructs all the time), it is clear that in order to be productive when building applications atop Roslyn, you need some easy way to inspect and navigate syntax trees. You need some easy way to answer questions like – What does the syntax tree for this piece of code look like? What is the type / Kind of the SyntaxNode that represents this language construct? This is where the Syntax Visualizer tool comes in handy.

What can I do with the Syntax Visualizer?

Once installed, the Syntax Visualizer allows you to inspect the syntax tree for any C# or VB code file that is open inside the Visual Studio IDE. You can select any piece of code in your file and the Syntax Visualizer will tell you what SyntaxNode it corresponds to and what the properties of this SyntaxNode look like. You can also select nodes from the Visualizer’s tree view and it will tell you what piece of code each node corresponds to.

clip_image003

The Syntax Visualizer also allows you to view a Directed Syntax Graph for selected sub-trees in case you want a more detailed graphical view of the nodes in this sub-tree.

clip_image004

clip_image006

Finally, the Syntax Visualizer also has some rudimentary support for inspection of Symbols and semantic information. You can read more about Symbols and APIs for performing semantic analysis in the Roslyn Overview document.

clip_image007

As I mentioned above, detailed documentation around how to install and use the Syntax Visualizer tool is available on Roslyn CodePlex site. Do try it out and use it as you explore and build rich code applications and extensions atop Roslyn! :)

Feedback

We would love to hear your feedback, ideas and suggestions. A list of the different ways in which you can provide feedback for the preview is available here. You can provide general feedback about the Syntax Visualizer here and report any issues you find over here.

Have fun with Roslyn! :)

--Shyam Namboodiripad on behalf of the Roslyn team

Ayende RahienThe Corax Experiment: API Posted at 4/17/2014 9:00:00 AM

I posted before about design practice for how I would approach building a search engine library. I decided to bite the bullet and actually try to do this. Using Voron, that turned out to be a really simple thing to do. Of course, this doesn’t do a tenth of what Lucene does, but it actually does quite a lot. The code is available here, and I want to emphasize again, this is purely experimental / research project.

The entire thing comes to less than 500 lines of code. And it is pretty functional even at this stage.

Corax is composed of:

  • Analysis
  • Indexing
  • Querying

Analysis of the documents is handled via analyzers:

   1: public interface IAnalyzer
   2: {
   3:     ITokenSource CreateTokenSource(string field, ITokenSource existing);
   4:     bool Process(string field, ITokenSource source);
   5:  
   6: }

An analyzer create a token source, which accept a TextReader and produces token. For each token, the Process method is called, and it is used to do things to the relevant token. For example, here is the default analyzer:

   1: public class DefaultAnalyzer : IAnalyzer
   2: {
   3:     readonly IFilter[] _filters =
   4:     {
   5:         new LowerCaseFilter(), 
   6:         new RemovePossesiveSuffix(), 
   7:         new StopWordsFilter(), 
   8:     };
   9:  
  10:  
  11:     public ITokenSource CreateTokenSource(string field, ITokenSource existing)
  12:     {
  13:         return existing ?? new StringTokenizer();
  14:     }
  15:  
  16:     public bool Process(string field, ITokenSource source)
  17:     {
  18:         for (int i = 0; i < _filters.Length; i++)
  19:         {
  20:             if (_filters[i].ProcessTerm(source) == false)
  21:                 return false;
  22:         }
  23:         return true;
  24:     }
  25: }

The idea here is to match, fairly closely, what Lucene is doing, but hopefully with clearer code. This analyzer will text a stream of text, break it up to discrete tokens, lower case them, remove the possessive ‘s suffix and clear stop words. Note that each of the filters are actually modifying the token in place.  And the tokenizer is pretty simple, but it does the job for now.

Now, let us move to indexing. With Lucene, the recommendation is that you’ll reuse your document and field instance, to avoid create garbage for the GC. With Corax, I took it a step further:

   1: using (var indexer = fullTextIndex.CreateIndexer())
   2: {
   3:     indexer.NewDocument();
   4:     indexer.AddField("Name", "Oren Eini");
   5:  
   6:     indexer.NewDocument();
   7:     indexer.AddField("Name", "Ayende Rahien");
   8:  
   9:     indexer.Flush();
  10: }

There are a couple of things to note here. An index can create indexers, it is intended to have multiple concurrent indexers running at the same time. Note that unlike Lucene, we don’t have Document or Field classes. Instead, we call methods on the indexer to create a new document and then add fields to the current document. When you are done with a document, you start a new one, or flush to complete the entire operation. For long running indexing, the indexer will flush itself automatically for you.

I think that this API gives us the best approach. It guide you toward using a single instance, with internal optimizations to make it memory efficient. Multiple instances can be used concurrently to speed up indexing time. And it knows when to spill flush itself for you, so you don’t have to worry about that.  Although you do have to complete the operation by calling Flush() at the end.

How about searching? That turned out to be pretty similar as well. All you have to do is create a searcher:

   1: using (var searcher = fti.CreateSearcher())
   2: {
   3:     QueryResults queryResults = searcher.QueryTop(new TermQuery("Name", "Arava"), 10);
   4:     Console.WriteLine(queryResults.TotalResults);
   5:     foreach (var match in queryResults.Results)
   6:     {
   7:         Console.WriteLine(match.DocumentId + " - " + match.Score);
   8:     }
   9: }

We create a searcher, and then we can utilize it to perform queries.

So far, this has been all about the API we have, I’ll talk about the actual implementation in my next post.

RavenDB is a server product, as such, we are obvious going to have to talk over the network. That is great, except that it means we need a system wide resource, a TCP port. And that isn’t so great, because it turn out that there are a lot of things that can try to grab a port and use it.

And I don’t speak about things like Skype taking over port 80. I’m talking about the use of dynamic or ephemeral ports. That means that during the build, we can have something (for example, Chrome, or DropBox, or anything that uses the web) make a call and suddenly that port is busy and you can’t bind to it.

The default ports we use for tests are in the 8070 – 8090 range, but we have had consistent build failures because something is taking the ports. Finally I took the time to investigate, and I discovered that:

   1: C:\Work\ravendb-3.0 [master]> netsh int ipv4 show dynamicport tcp
   2:  
   3: Protocol tcp Dynamic Port Range
   4: ---------------------------------
   5: Start Port      : 1025
   6: Number of Ports : 64510

So, on the two machines I tested this on, we had the dynamic port range cover everything from port 1025 and up. This sucks. And probably explains the issue. This is how we fixed it (I hope):

   1: netsh int ipv4 set dynamicportrange tcp startport=10000 numberofports=55535

Basically, everything below port 10,000 is free for us to use. And yes, this post is here mostly so I’ll not have to remember it in the future.

Note: This is done purely as a design practice. We don’t have any current plans to implement this, but I find that it is a good exercise in general.

How would I go about building a search engine for RavenDB to replace Lucene. Well, we have Voron as the basis for storage, so from the get go, we have several interesting changes. To start with, we inherit the transactional properties of Voron, but more importantly, we don’t have to do merges, so we don’t have any of those issues. In other words, we can actually generate stable document ids.

But I’m probably jumping ahead a bit. We’ll start with the basics. Analysis / indexing is going to be done in very much the same way. Users will provide an analyzer and a set of documents, like so:

   1: var index = new Corax.Index(storageOptions, new StandardAnalyzer());
   2:  
   3: var scope = index.CreateIndexingScope();
   4:  
   5: long docId = scope.Add(new Document
   6: {
   7:     {"Name", "Oren Eini", Analysis.Analyzer},
   8:     {"Name", "Ayende Rahien", Analysis.Analyzer},
   9:     {"Email", "ayende@ayende.com", Analyzed.AsIs, Stored.Yes}
  10: });
  11:  
  12: docId = scope.Add(new Document
  13: {
  14:     {"Name", "Arava Eini", Analysis.Analyzer},
  15:     {"Email", "arava@doghouse.com", Analyzed.AsIs, Stored.Yes}
  16: });
  17:  
  18: index.Sumbit(index);

Some notes about this API. It is modeled quite closely after the Lucene API, and it would probably need additional work. The idea here is that you are going to need to get an indexing scope, which is single threaded. And you can have multiple indexing scopes running a the same time. You can batch multiple writes into a single scope, and it behaves like a transaction.

The idea is to deal with all of the work associated with indexing the document into a single threaded work, so that make it easier for us. Note that we immediately get the generated document id, but that the document will only be available for searching when you have submitted the scope.

Under the hood, this does all of the work at the time of calling Add(). The analyzer will run on the relevant fields, and we will create the appropriate entries. How does that work?

Every document has a long id associated with it. In Voron, we are going to have a ‘Documents’ tree, with the long id as the key, and the value is going to be all the data about the document we need to keep. For example, it would have the stored fields for that documents, or the term positions, if required, etc. We’ll also have a a Fields tree, which will have a mapping of all the field names to a integer value.

Of more interest is how we are going to deal with the terms and the fields. For each field, we are going to have a tree called “@Name”, “@Email”, etc. Those are multi trees, with the keys in that tree being the terms, and the multi values being the document ids that has those threes. In other words, the code above is going to generate the following data in Voron.

  • Documents tree:
    • 0 – { “Fields”: { 1: “ayende@ayende.com” } }
    • 1 – { “Fields”: { 1: “arava@doghouse.com” } }
  • Fields tree
    • 0 – Name
    • 1 – Email
  • @Name tree
    • ayende – [ 0 ]
    • arava – [ 1 ]
    • oren – [ 0 ]
    • eini – [ 0, 1 ]
    • rahien – [ 0 ]
  • @Email tree
    • ayende@ayende.com – [ 0 ]
    • arava@doghouse.com – [ 1 ]

Given this model, we can now turn to the actual querying part of the business. Let us assume that we have the following query:

   1: var queryResults = index.Query("Name: Oren OR Email: ayende@ayende.com");
   2: foreach(var result in queryResult.Matches)
   3: {
   4:     Console.WriteLine("{0}: {1}", result.Id, result["Email"] )
   5: }

How does querying works? The query above would actually be something like:

   1: new BooleanQuery(
   2:     Match.Or,
   3:     new TermQuery("Name", "Oren"),
   4:     new TermQuery("Email", "ayende@ayende.com")
   5:     )

The actual implementation of this query would just access the relevant field tree and load the values in the particular key. The result is a set of ids for both parts of the query, and since we are using an OR here, we will just union them and return the list of results back.

Optimizations here can be to run those queries in parallel, or to just cache the results of particular term query, so we can speed things even more.  This looks simple, and it is, because a lot of the work has already been done in Voron. Searching is pretty much complete, we only need to tell it what to search with.

It has been a really busy last 10 days for the Azure team. This blog post quickly recaps a few of the significant enhancements we’ve made.  These include:

  • Web Sites: SSL included, Traffic Manager, Java Support, Basic Tier
  • Virtual Machines: Support for Chef and Puppet extensions, Basic Pricing tier for Compute Instances
  • Virtual Network: General Availability of DynamicRouting VPN Gateways and Point-to-Site VPN
  • Mobile Services: Preview of Visual Studio support for .NET, Azure Active Directory integration and Offline support;
  • Notification Hubs: Support for Kindle Fire devices and Visual Studio Server Explorer integration
  • Autoscale: General Availability release
  • Storage: General Availability release of Read Access Geo Redundant Storage
  • Active Directory Premium: General Availability release
  • Scheduler service: General Availability release
  • Automation: Preview release of new Azure Automation service

All of these improvements are now available to use immediately (note that some features are still in preview).  Below are more details about them:

Web Sites: SSL now included at no additional charge in Standard Tiers

With Azure Web Sites you can host up to 500 web-sites in a single standard tier hosting plan.  Azure web-sites run in VMs isolated to host only your web applications (giving you predictable performance and security isolation), and you can scale-up/down the number of VMs either manually or using our built-in AutoScale functionality.  The pricing for standard tier web-sites is based on the number of VMs you run – if you host all 500 web-sites in a single VM then all you pay for is for that single VM, if you scale up your web site plan to run across two VMs then you’d pay for two VMs, etc.

Prior to this month we charged an additional fee if you wanted to enable SSL for the sites.  Starting this month, we now include the ability to use 5 SNI based SSL certificates and 1 IP based SSL certificate with each standard tier web site hosting plan at no additional charge.  This helps make it even easier (and cheaper) to SSL enable your web-sites.

Web Sites: Traffic Manager Support

I’ve blogged in the past about the Traffic Manager service we have built-into Azure. 

The Azure Traffic Manager service allows you to control the distribution of user traffic to applications that you host within Azure. This enables you to run instances of your applications across different azure regions all over the world.  Traffic Manager works by applying an intelligent routing policy engine to the Domain Name Service (DNS) queries on your domain names, and maps the DNS routes to the appropriate instances of your applications (e.g. you can setup Traffic Manager to route customers in Europe to a European instance of your app, and customers in North America to a US instance of your app).

You can use Traffic Manager to improve application availability - by enabling automatic customer traffic fail-over scenarios in the event of issues with one of your application instances.  You can also use Traffic Manager to improve application performance - by automatically routing your customers to the closet application instance nearest them.

We are excited to now provide general availability support of Traffic Manager with Azure Web Sites.  This enables you to both improve the performance and availability of your web-sites.  You can learn more about how to take advantage of this new support here.

Web Sites: Java Support

This past week we added support for an additional server language with Azure Web Sites – Java.  It is now easy to deploy and run Java web applications written using a variety of frameworks and containers including:

  • Java 1.7.0_51 – this is the default supported Java runtime
  • Tomcat 7.0.50 – the default Java container
  • Jetty 9.1.0

You can manage which Java runtime you use, as well as which container hosts your applications using the Azure management portal or our management APIs.  This blog post provides more detail on the new support and options.

With this announcement, Azure Web Sites now provides first class support for building web applications and sites using .NET, PHP, Node.js, Python and Java.  This enables you to use a wide variety of language + frameworks to build your applications, and take advantage of all the great capabilities that Web Sites provide (Easy Deployment, Continuous Deployment, AutoScale, Staging Support, Traffic Manager, outside-in monitoring, Backup, etc).

Web Sites: Support for Wildcard DNS and SSL Certificates

Azure Web Sites now supports the ability to map wildcard DNS and SSL Certficates to web-sites.  This enables a variety of scenarios – including the ability to map wildcard vanity domains (e.g. *.myapp.com – for example: scottgu.myapp.com) to a single backend web site.  This can be particularly useful for SaaS based scenarios.

Scott Cate has an excellent video that walks through how to easily set this support up.

Web Sites: New Basic Tier Pricing Option

Earlier in this post I talked about how we are now including the ability to use 5 SNI and 1 IP based SSL certificate at no additional cost with each standard tier azure web site hosting plan.  We have also recently announced that we are also including the auto-scale, traffic management, backup, staging and web jobs features at no additional cost as part of each standard tier azure web site hosting plan as well.  We think the combination of these features provides an incredibly compelling way to securely host and run any web application.

New Basic Tier Pricing Option

Starting this month we are also introducing a new “basic tier” option for Azure web sites which enables you to run web applications without some of these additional features – and at 25% less cost.  We think the basic tier is great for smaller/less-sophisticated web applications, and enables you to be successful while paying even less. 

For additional details about the Basic tier pricing, visit the Azure Web sites pricing page.  You can select which tier your web-site hosting plan uses by clicking the Scale tab within the Web Site extension of the Azure management portal.

Virtual Machines: Create from Visual Studio

With the most recent Azure SDK 2.3 release, it is now possible to create Virtual Machines from directly inside Visual Studio’s Server Explorer.  Simply right-click on the Azure node within it, and choose the “Create Virtual Machine” menu option:

image

This will bring up a “Create New Virtual Machine” wizard that enables you to walkthrough creating a Virtual Machine, picking an image to run in it, attaching it to a virtual network, and open up firewall ports all from within Visual Studio:

image

Once created you can then manage the VM (shutdown, restart, start, remote desktop, enable debugging, attach debugger) all from within Visual Studio:

image

This makes it incredibly easy to start taking advantage of Azure without having to leave the Visual Studio IDE.

Virtual Machines: Integrated Puppet and Chef support

In a previous blog post I talked about the new VM Agent we introduced as an optional extension to Virtual Machines hosted on Azure.  The VM Agent is a lightweight and unobtrusive process that you can optionally enable to run inside your Windows and Linux VMs. The VM Agent can then be used to install and manage extensions, which are software modules that extend the functionality of a VM and help make common management scenarios easier. 

At last week’s Build conference we announced built-in support for several new extensions – including extensions that enable easy support for Puppet and Chef.  Puppet and Chef allow developers and IT administrators to define and automate the desired state of infrastructure configuration, making it effortless to manage 1000s of VMs in Azure.

Enabling Puppet Support

We now have a built-in VM image within the Azure VM gallery that enables you to easily stand up a puppet-master server that you can use to store and manage your infrastructure using Puppet.  Creating a Puppet Master in Azure is now easy – simply select the “Puppet Enterprise” template within the VM gallery:

image

You can then create new Azure virtual machines that connect to this Puppet Master.  Enabling this with VMs created using the Azure management portal is easy (we also make it easy to do with VMs created with the command-line).  To enable the Puppet extension within a VM you create using the Azure portal simply navigate to the last page of the Create VM from gallery experience and check the “Puppet Extension Agent” extension within it:

image

Specify the URL of the Puppet master server to get started. Once you deploy the VM, the extension will configure the puppet agent to connect to this Puppet master server and pull down the initial configuration that should be used to configure the machine.

This new support makes it incredibly easy to get started with both Puppet and Chef and enable even richer configuration management of your IaaS infrastructure within Azure.

Virtual Machines: Basic Tier

Earlier in this blog post I discussed how we are introducing a new “Basic Tier” option for Azure Web Sites.  Starting this month we are also introducing a “Basic Tier” for Virtual Machines as well.

The Basic Tier option provides VM options with similar CPU + memory configuration options as our existing VMs (which are now called “Standard Tier” VMs) but do not include the built-in load balancing and AutoScale capabilities.  They also cost up to 27% less.  These instances are well-suited for production applications that do not require a built-in load balancer (you can optionally bring your own load balancer), batch processing scenarios, as well as for dev/test workloads.  Our new Basic tier VMs also have similar performance characteristics to AWS’s equivalent VM instances (which are less powerful than the Standard tier VMs we have today).

Comprehensive pricing information is now available on the Virtual Machines Pricing Details page.

Networking: General Availability of Azure Virtual Network Dynamic Routing VPN Gateways and Point-to-Site VPN

Last year, we previewed a feature called DynamicRouting Gateway and Point-to-Site VPN that supports Route-based VPNs and allows you to connect individual computers to a Virtual Network in Azure. Earlier this month we announced that the feature is now generally available. The DynamicRouting VPN Gateway in a Virtual Network will now carry the same 99.9% SLA as the StaticRouting VPN Gateway.

clip_image037

Now that we’re in General Availability mode, DynamicRouting Gateway will automatically incur standard Gateway charges which will take effect starting May 1, 2014. 

For further details on the service, please visit the Virtual Network website.

Mobile Services: Visual Studio Support for Mobile Services .NET Backend

With Visual Studio 2013 Update 2, you can now create your backend Mobile Service logic using .NET and the ASP.NET Web API framework in Visual Studio, using Mobile Services templates and scaffolds. Mobile Services support for .NET on the backend offers the following benefits:

  1. You can use ASP.NET Web API and Visual Studio together with Mobile Services to add a backend to your mobile app in minutes
  2. You can publish any existing Web API to Mobile Services and benefit from authentication, push notifications and other capabilities that Mobile Services provides. You can also take advantage of any Web API features like OData controllers, or 3rd party Web API-based frameworks like Breeze.
  3. You can debug your Mobile Services .NET backend using Visual Studio running locally on your machine or remotely in Azure.
  4. With Mobile Services we run, manage and monitor your Web API for you. Azure will automatically notify you if we discover you have a problem with your app.
  5. With Mobile Services .NET support you can store your data securely using any data backend of your choice: SQL Azure, SQL on Virtual Machine, Azure Table storage, Mongo, et al.

It’s easy to get started with Mobile Services .NET support in Visual Studio. Simply use the File-New Project dialog and select the Windows Azure Mobile Service project template under the Cloud node.

clip_image012

Choose Windows Azure Mobile Service in the New ASP .NET Project dialog.

clip_image014

You will see a Mobile Services .NET project, notice this is a customized ASP .NET Web API project with additional Mobile Service NuGet packages and sample controllers automatically included:

clip_image016

Running the Mobile Service Locally

You can now test your .Net Mobile Service project locally. Open the sample TodoItemController.cs in the project. The controller shows you how you can use the built-in TableController<T> .NET class we provide with Mobile Services. Set a breakpoint inside the GetAllTodoItems() method and hit F5 within Visual Studio to run the Mobile Service locally.

clip_image018

Mobile Services includes a help page to view and test your APIs. On the help page, click on the try it out link and then click the GET tables/TodoItem link. Then click try this out and send on the GET tables/TodoItem page. As you might expect, you will hit the breakpoint you set earlier.

clip_image020

Add APIs to your Mobile Service using Scaffolds

You can add additional functionality to your Mobile Service using Mobile Service or generic Web API controller scaffolds through the Add Scaffold dialog (right click on your project and choose Add -> New Scaffolded Item… command)

clip_image022

Publish your Mobile Services project to Azure

Once you are done developing your Mobile Service locally, you can publish it to Azure. Simply right click on your project and choose the Publish command. Using the publish wizard, you can publish to a new or existing Azure Mobile Service:

clip_image024

Remote debugging

Just like Cloud Services and Websites, you can now remote debug your Mobile Service to get more visibility into how your code is operating live in Azure. To enable remote debugging for a Mobile Service, publish your Mobile Service again and set the Configuration to Debug in the Publish wizard.

clip_image027

Once your Mobile Service is published and running live in the cloud, simply set a breakpoint in local source code. Then use Visual Studio’s Server Explorer to select the Mobile Service instance deployed in the cloud, right click and choose the Attach Debugger command.

clip_image028

Once the debugger attaches to the mobile service, you can use the debugging capabilities of Visual Studio to instantly and in-real time debug your app running in the cloud.

To learn more about Visual Studio Support for Mobile Services .NET backend follow tutorials at:

This new .NET backend supports makes it easy to create even better mobile applications than ever before.

Mobile Services: Offline Support

In addition to the above support, we are also introducing a preview of a new Mobile Services Offline capability with client SDK support for Windows Phone and Windows Store apps.

With this functionality, mobile applications can create and modify data even when they are offline/disconnected from a network. When the app is back online, it can synchronize local changes with the Mobile Services Table APIs. The feature also includes support for detecting conflicts when the same record is changed on both the client and the backend.

To use the new Mobile Services offline functionality, set up a local sync store. You can define your own sync store or use the provided SQLite-based implementation.  The Mobile Services SDK provides a new local table API for the sync store, with a symmetrical programming model to the existing Mobile Services Table API. You can use Optimistic Concurrency along with the offline feature to detect conflicting changes between the client and backend.

The preview of the Mobile Services Offline feature is available now as part of the Mobile Services SDK for Windows Store and Windows Phone apps. In the future, we will support all client platforms supported by Mobile Services, including iOS, Android, Xamarin, etc.

Mobile Services: Support for Azure Active Directory Sign On

We now support Azure Active Directory Single Sign On for Mobile Services.  Azure Active Directory authentication is available for both the .NET and Node.js backend options of Mobile Services.

To take advantage of the feature, first register your client app and your Mobile Service with your Azure Active Directory tenant using the Applications tab in the Azure Active Directory management portal.

clip_image030

In your client project, you will need to add the Active Directory Authentication Library (ADAL), currently available for Windows Store, iOS, and Android clients.

From there on, the token retrieved from ADAL library can be used to authenticate and access Mobile Services.  The single sign-on features of ADAL also enables your mobile service to make calls to other resources (such as SharePoint and Office 365) on behalf of the user.  You can read more about the new ADAL functionality here.

These new updates make Mobile Services an even more attractive platform for building powerful employee facing apps.

Notification Hub: Kindle Support and Visual Studio Integration

I’ve previously blogged about Azure Notification Hubs, a high scale cross platform push notification service that allows you to instantly send personalized push notifications to segments of your audience or individuals containing millions of iOS, Android, Windows, Widows Phone devices with a single API call.

Today we’ve made two important updates to Azure Notification Hubs: adding support for Amazon Kindle Fire devices, and Visual Studio support for Notification Hubs.

Support for Amazon’s Kindle

With today’s addition you can now configure your Notification Hubs with Amazon Device Messaging (ADM) service credentials on the configuration page for your Notification Hub in the Azure Management portal, and start sending push notifications to your app on Amazon’s Kindle device, in addition to iOS, Android, or Windows.

clip_image032

Testing Push Notifications with Visual Studio

Earlier I blogged about how we enabled debugging push notifications using the Azure Management Portal. With today’s Visual Studio update, you can now browse your notification hubs and send test push notifications directly from Visual Studio Server Explorer as well.

Simply select your notification hub in the Server Explorer of Visual Studio under the Notifications Hubs node.  Then right click, and choose the Send Test Notifications command:

clip_image033

In the notification hub window, you can then send a message either to a particular tag or all registered devices (broadcast). You can select from a variety of templates - Windows Store, Windows Phone, Android, iOS, or even a cross platform message using the Custom Template. After you hit Send, you’ll receive the message result instantly to help you diagnose if your message was successfully sent or not.

clip_image035

To learn more about Azure Notification Hubs, read tutorials here.

AutoScale: Announcing General Availability of Autoscale Service

Last summer we announced the preview release of our Autoscale service. I’m happy to announce that Autoscale is now generally available!  Better yet, there's no additional charge for using Autoscale.

We've added new features since we first released it as a preview version: support for both performance-and schedule-based autoscaling, along with an API and .NET SDK so you can programmatically scale using any performance counters that you define.

Autoscale supports all four Azure compute services: Cloud Services, Virtual Machines, Mobile Services and Web Sites. For Virtual Machines and Web Sites, Autoscale is included as a feature in the Standard pricing tiers, and for Mobile Services, it's included as a part of both Basic and Standard pricing tiers.

Storage: Announcing General Availability of Read Access Geo Redundant Storage (RA-GRS)

In December, we added the ability to allow customers to achieve higher read availability for their data. This feature called Read Access - Geo Redundant Storage (RA-GRS) allows you to read an eventually consistent copy of your geo-replicated data from the storage account’s secondary region in case of any unavailability to the storage account’s primary region.

Last week we announced that RA-GRS feature is now out of preview mode, and generally available. It is available to all Azure customers across all regions including the users in China.

RA-GRS SLA and Pricing

The benefit of using RA-GRS is that it provides a higher read availability (99.99+%) for a storage account over GRS (99.9+%). When using RA-GRS, the write availability continues to be 99.9+% (same as GRS today) and read availability for RA-GRS is 99.99+%, where the data is expected to be read from secondary if primary is unavailable. In terms of pricing, the capacity (GB) charge is slightly higher for RA-GRS than GRS, whereas the transaction and bandwidth charges are the same for GRS and RA-GRS. See the Windows Azure Storage pricing page here for more details about the SLA and pricing.

You can find more information on the storage blog here.

Active Directory: General Availability of Azure AD Premium

Earlier this month we announced the general availability of Azure Active Directory Premium, which provides additional identity and access management capabilities for enterprises. Building upon the capabilities of Azure AD, Azure AD Premium provides these capabilities with a guaranteed SLA and no limit on directory size. Additional capabilities include:

  • Group-based access assignment enables administrators to use groups in AD to assign access for end users to over 1200 cloud applications in the AD Application Gallery. End users can get single-sign on access to their applications from their Access Panel at https://myapps.microsoft.com or from our iOS application.
  • Self-service password reset that enables end users to reset forgotten passwords without calling your help desk.
  • Delegated group management that enables end users to create security groups and manage membership in security groups they own.
  • Multi-Factor Authentication that lets you easily deploy a Multi-Factor Authentication solution for your business without deploying new software or hardware.
  • Customized branding that lets you include your organization’s branding elements in the experiences that users see when signing in to AD or accessing their Access Panel.
  • Reporting, alerting, and analytics that increase your visibility into application usage in your organization, and potential security concerns with user accounts.

Azure AD Premium also includes usage rights for Forefront Identity Manager Server and Client Access Licenses.

To read more about AD Premium, including how to acquire it, read the Active Directory Team blog.

Active Directory: Public Preview of Azure Rights Management Service

Earlier this month we announced the public preview of the ability to manage your Azure Rights Management service within the Azure Management Portal. If your organization has Azure Rights Management either as a stand-alone service or as part of your Office 365 or EMS subscriptions you can now manage it by signing into the Azure Management Portal. Once in the Portal, select ACTIVE DIRECTORY in the left navigation bar, navigate to the RIGHTS MANAGEMENT tab, then click on the name of your directory.

clip_image039

With this preview you can now create custom rights policy templates that let you define who can access sensitive documents, and what permissions (view, edit, save, print, and more) users can have on those documents.  To begin creating a rights policy template, in the Quick Start page, click on Create an additional rights policy template option and follow the instructions on the page to define a name and description for the template, add users and rights and define other restrictions.

clip_image041

Once your template has been created and published, it will become available to users in your organization in their favorite applications.

clip_image043

To learn more managing Azure Rights Management and the benefits it offers to organizations, see the Information Protection group’s blog

Scheduler: General Availability Release Scheduler Service

This month we’ve also delivered the General Availability release of the Azure Scheduler service.  Scheduler allows you to run jobs on simple or complex recurring schedules that can invoke HTTP/S endpoints or post messages to storage queues. Scheduler has built-in high availability and can reliably call services inside or outside of Azure.

During preview customers have used it for a wide set of scenarios including for invoking services in their backend for Hadoop workloads, triggering diagnostics cleanup, and periodically checking that partners have submitted content on time. ISVs have used it to empower their applications to add scheduling capabilities such as report generation and sending reminders.

In the Scheduler portal extension you can easily create and manage your scheduler jobs. Since the initial release, Scheduler has also added the ability to update HTTP jobs with custom headers and basic authentication. It has also exposed the ability to change the recurrence schedule which will allow you to also choose to limit the execution of a job or allow the job to run infinitely.

With the general availability, new Azure Scheduler cmdlets have been released with Azure PowerShell and the Scheduler .NET API has been included in WAML 1.0.

I highly encourage you to try out the Scheduler today. You might find the following links helpful:

It makes scheduling recurring tasks really easy.

Automation: Announcing Microsoft Azure Automation Preview

Last week we announced the preview of a new Microsoft Azure service: Automation.

Automation allows you to automate the creation, deployment, monitoring, and maintenance of resources in your Azure environment using a highly scalable and reliable workflow execution engine. The service can be used to orchestrate the time-consuming, error-prone, and frequently repeated tasks you’d otherwise accomplish manually across Microsoft Azure and third-party systems to decrease operational expense for your cloud operations.

To get started with Automation, you first need to sign-up for the preview on the Azure Preview page. Once you have been approved for the preview, you can sign in to the Management Portal and start using it. Automation is currently only available in the East-US data center, but we will add the ability to deploy to additional data centers in the future.

Authoring a Runbook

Once you have the Automation preview enabled on your subscription, you can easily get started automating by following a few simple steps:

Step 1: In the Microsoft Azure management portal, click New->App Services->Automation->Runbook->Quick Create to create a new runbook. Runbooks are collections of activities that provide an environment for automating everything from diagnostic logging to applying updates to all instances of a virtual machine or web role to renewing certificates to cleaning storage accounts. Enter a name and description for the runbook, and create a new Automation account which will store your Runbooks, Assets, and Jobs.

Next time you create a runbook you can either use the same Automation account as you just created or create a separate one to if you’d like to maintain separation between a few different collections of runbooks / assets.

clip_image045

clip_image047

Step 2: Click on your runbook, then click Author->Draft. Type some PowerShell commands in the editor, then hit ‘Publish’ to make this runbook draft available for production execution.

clip_image049

Starting a Runbook and Viewing the Job

1. To start the runbook you just published, go back to the ‘Runbooks’ tab, click on your newly-published runbook, and hit ‘Start.’ Enter any required parameters for the runbook, then click the checkmark button.

clip_image051

2. Click on your runbook, then click on the ‘Jobs’ tab for this runbook. Here you can view all the instances of a runbook that have run, called jobs. You should see the job you just started.

clip_image053

3. Click on the job you just started to view more details about its execution. Here you can see the job output, as well as any exceptions that may have occurred while the job was executing.

clip_image055

Once you get familiar with the service, you’ll be able to create more sophisticated runbooks to automate your scenarios. I encourage you to try out Microsoft Azure Automation today.

For more information, click through the following links:

Summary

This most recent release of Azure includes a bunch of great features that enable you to build even better cloud solutions.  If you don’t already have a Azure account, you can sign-up for a free trial and start using all of the above features today.  Then visit the Azure Developer Center to learn more about how to build apps with it.

Hope this helps,

Scott

P.S. In addition to blogging, I am also now using Twitter for quick updates and to share links. Follow me at: twitter.com/scottgu

Miguel de IcazaDocumentation for our new iOS Designer Posted at 4/14/2014 2:53:00 PM

The team has put together some beautiful getting started documentation for our iOS User Interface Designer.

In particular, check a couple of hot features on it:

Miguel de IcazaMono on PS4 Posted at 4/14/2014 2:50:00 PM

We have been working with a few PlayStation 4 C# lovers for the last few months. The first PS4 powered by Mono and MonoGame was TowerFall:

We are very excited about the upcoming Transistor, by the makers of Bastion, coming out on May 20th:

Mono on the PS4 is based on a special branch of Mono that was originally designed to support static compilation for Windows's WinStore applications [1].

[1] Kind of not very useful anymore, since Microsoft just shipped static compilation of .NET at BUILD. Still, there is no wasted effort in Mono land!

Ayende RahienThe dark sides of Lucene Posted at 4/14/2014 9:00:00 AM

I’ve been using Lucene for the past six or seven years, and after my last post, I thought it would be a good idea to talk a bit about the kind of things that it isn’t doing well. We’ve been using it extensively in RavenDB for the past 5 years, and I think that I have a pretty good understanding of it. We used to have one of Lucene.NET committers working at Hibernating Rhinos, so I’ve a high level of confidence that I’m not just stupidly not using it properly, too.

Probably the part that caused us the most pain with Lucene was the fact that it isn’t transactional. That is, it is quite easy to get into situations where the indexes are corrupted. That make it… challenging to use it in a database that needs to ensure consistency. The problem is that it is really not a use case that Lucene is well suited for. In order to ensure that data is saved, we have to commit often, the problem is that in order to ensure good performance, we want to commit less often, but then we will the changes if we crash. For that matter, Lucene doesn’t do any attempt to actually flush the data properly, relying on the OS to do that, a system crash can cause you to lose data even though you “committed” it.

Fun times, I can tell you that.

Next, we have the issue of what Lucene call updates. Updates in Lucene are actually just delete/add, and they don’t maintain the same document id (more on that later). Because of that, you usually have to have an additional field in the index that would be your primary key, and you handle updates by first deleting then adding things. That is quite strange, to be fair, and it means that you can’t “extend” an index entry, you have to build it from scratch every time.

Speaking of this, let us talk a bit about deletes. Ignoring for the moment the absolutely horrendous decision to do deletes through the reader, let us talk about how they are actually done. Deletes are recorded in a separate file, and that means that the moment you have any deletes (or, as I mentioned, updates), all the internal statistics are wrong.  We run into this quite often with RavenDB when we are doing things like facets or suggestions. For example, if you have request a suggestion for a user name, it will happily give you suggestions for deletes users, even though we deleted it in Lucene.

It will go away eventually, when it is ready to optimize the index by merging all the files, but in the meantime, it makes  for interesting bug reports.  Speaking of merging, that is another common issue that you have to deal with. In order to ensure optimal performance, you have to be on top of the merge policy. This results in some interesting issues. For RavenDB’s purposes, we do a writer commit after every indexing batch. That means that if you are writing to RavenDB slowly enough, we do a commit after every document write. That result in a lot of segments, and the merge policy would have to do a lot of merges. The problem here is that merges have two distinct costs associated with them.

First, and obviously, you are going to need to write (again) all of the documents in all of the segments you are merging. That is very similar to doing merges in LevelDB ( indeed, in general Lucene’s file format is remarkably similar to SST ). Next, and arguably more interesting / problematic from our point of view is the fact that it also kills all of the caches. Let me try to explain, Lucene uses a lot of caches to speed things up, in fact, most of the sorting is done by using the caches, for example.  That works really well when we are querying normally, because segments are immutable, which makes for great caching. But on a merge, not only have we just invalidate all of our caches, we now need to read, again, all of the data that we just wrote, so we would be able to use it. That can be… costly. And both things can introduce stalls into the system.

The major problem externally with merges is that the document id changes, and that means that you cannot rely on them. It would be much easier if you could send an id out into the world, and get it back later and do something with it, but that isn’t possible with Lucene.

Next, and not really an operational issue like the rest, Lucene’s multi threaded behavior is… a hammer to an egg, in most cases. By that I mean code like this:

image

I mean, it is certainly functional, but it is pretty ugly.

Now, don’t get me wrong, I think that Lucene is pretty neat. But there are some really dark corners there. For example, the actual searching, go ahead and try to find where that is done in Lucene. It is very easy to get lost between all of the different aspects: weights, sorters, queries and various enumerators. For fun, a lot of that runs at hard to figure out times, making the actual query run time interesting to try to figure out.

As a good example, let us take the simplest possible query, TermQuery. Go ahead, try to find where it is actually doing the query for matching terms in this code: https://github.com/apache/lucene/blob/LUCENE_2_1/src/java/org/apache/lucene/search/TermQuery.java

That actually happens here: https://github.com/apache/lucene/blob/LUCENE_2_1/src/java/org/apache/lucene/search/TermScorer.java#L79, and it is effectively a side effect of calling reader.termDocs(term) that limit the matches only to those with the same term.  Trying to track down where exactly things happen can be… interesting.

Anyway, this post is getting to long, and I want to get back to figuring out how Lucene does its thing without dwelling too much in the dark…

Ayende RahienWhat Lucene does, a look under the hood Posted at 4/11/2014 9:00:00 AM

Lucene is a search engine library, which is great. But as it turns out, there is a lot going on there. After working with it for several years, I can say with confidence that it is a pretty awesome library. But surprisingly, a lot of the effort that went into it doesn’t seem to be talked about / visible to people not trolling through the code. I think that this is a pretty good testament for how successful it is. That, and the fact that it is now the base line against which all other search libraries & engines are compared.

What I wanted to talk about today was the kind of things that Lucene is doing that doesn’t seem to get much publicity. I think that Spolsky said it best:

Back to that two page function. Yes, I know, it's just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I'll tell you why: those are bug fixes. One of them fixes that bug that Nancy had when she tried to install the thing on a computer that didn't have Internet Explorer. Another one fixes that bug that occurs in low memory conditions. Another one fixes that bug that occurred when the file is on a floppy disk and the user yanks out the disk in the middle. That LoadLibrary call is ugly but it makes the code work on old versions of Windows 95.

I remember just how much impact that article made on me at the time. And Lucene’s codebase bear true for this words. Lucene is a search engine library, which basically means that it does:

  • Indexing
  • Querying
  • Maintenance

One of the major areas of maturity in Lucene is how it optimized indexing. You can see it in the code. For example, Lucene goes to a great deal of trouble to avoid allocating memory willy nilly. Instead, pretty much everything there is done via object pools. This helps reduce the memory pressure when doing a lot of indexing and can save a lot of GC cycles.

Another is the concept of multiple threads for indexing .A lot of Lucene is build around this idea, it has a lot of per thread state that is meant to ensure that you don’t have to deal with concurrency yourself. The idea is that you can take an IndexWriter and write to it concurrently, then call commit. A lot of the work to do with indexing is CPU intensive, so that makes a lot of sense, and Lucene nicely isolates you from all of that work. There is DocumentWriterPerThread, so you can see really nice scaling effects as you throw more threads & hardware at the problem.

Usually, when people start messing with Lucene, they do that by writing analyzers, and you are sort of exposed to the memory constraints by being encourage to use ReusableTokenStream, etc. It has also a nice pipeline architecture for doing the indexing work with filters.

On the querying side, Lucene does a lot of work to ensure that things just works. It has a Boolean Model for searches, and Vector Space Model for ranking. Writing your own Query classes is pretty easy too, once you understand how things work, and again, this is another common place for people to extend Lucene. But there is a lot going on behind the scenes. Lucene does a lot of caching on a segment basis, and it is quite nice, since segments are immutable, it means that you can get pretty good usage out of that.

That give it a lot of its speed, and it means that over time, things are actually going to be faster, because more parts of the segments are in memory and cached.

Finally, we have all of the other work that Lucene does. In practice, it means things like merging segments (hopefully in the background), and keeping the overall system humming along. Unfortunately, that is also one of the places that are usually most common for people to start tinkering with when they run into perf problems. That is anything but trivial, and optimizing it is something that require a lot of expertise and understanding about the specific scenario you have.

And on top of that, you have everything else that already works on top of Lucene. Which is quite a lot.

As I said earlier, that is a very impressive piece of technology. That doesn’t mean that it doesn’t have its own set of problems, but that is something that I’ll discuss in detail in my next post.

Ayende RahienSorting with Lucene Posted at 4/10/2014 9:00:00 AM

I talked about the Lucene formula and how it calculate things using tf-idf for best matches. Now I want to talk about the actual sorting implementation. As it turned out, the default sorting (by relevancy) is really simple. All you need is to get the relevant score for a query, then you shove the results through a heap with a specified size. The heap will take care of maintain the top results.

So far, that is pretty simple to understand. But the question is, how do you do sorting on a field value? The answer is, not easily.

image

GetStringIndex() does something very interesting. I returns  a string index, which gives us:

  • A string array containing all the distinct (sorted) value for this index.
  • A int array with all the documents in the index, with the position of the value of that field in the string value array

Now we can compare fields by their field position on the field, which give us pretty good sorting. Unfortunately, this also require us to load all the values to memory. Let us see another example, which would probably be easier to follow:

Sorting by an integer is done like this:

image

Get an array (whose size match the number of documents),  We can then sort things easily because accessing the relevant field value only require us to have the document id to index into the array.

The reason Lucene does this is that it uses an inverted index, and it has no easy way of going from the field values to the list documents it has. So it is easier to read all the values into memory and work with them there. I don’t like it, but off hand, I can’t think of a better way to handle this.