Wednesday, June 12, 2002

While glancing through the decompiled SortedList code I noticed there is a serious bug waiting to happen. When a key/value pair is added, a reference to the key object, not a copy or clone of it, is stored in the key array. This means that if a reference to the key object is also held outside the SortedList (or obtained via the GetKey method), the object can be modified, thereby corrupting the index. 

...

HashTable suffers from the same problem. In this case the documentation states: Key objects must be immutable as long as they are used as keys in the Hashtable.

An example of an immutable object would be an instance of String. Alternatively if a value type is passed in as the first parameter of Add, an implicitly boxed value is created which cannot be modified outside the SortedList instance. [Cook Computing]

Great stuff Charles! This is sure to lead to a lot of people staring at what looks like perfectly valid code for hours, tearing their hair out and wondering why their experiencing such strange sideffects. I believe there was a discussion about this very same topic on the DOTNET list late last year. While I don't think I'd call it a bug, it certainly is an oversight in the documentation of that class. The only hint the documentation gives you is:

A SortedList is a hybrid between a Hashtable and an Array.

The problem lies in the fact that the internals of a HashTable use Object::GetHashCode, by default, to efficiently sort the keys into hash buckets. You can also optionally supply an IHashCodeProvider which it will use to generate hash codes for the keys instead. The three basic properties of a hash function imply immutability:

A hash function must have the following properties:

  • If two objects of the same type represent the same value, the hash function must return the same constant value for either object.
  • For the best performance, a hash function should generate a random distribution for all input.
  • A hash function should be based on an immutable data member. The hash function should return exactly the same value regardless of any changes that are made to the object. Basing the hash function on a mutable data member can cause serious problems, including never being able to access that object in a hash table.

The last bullet-point is where it's at. As Charles also points out, if you use a value type you'll be safe because the data is copied as it boxed to/unboxed from the heap ensuring that you can never actually change the key instance. 

Bottom line is: make sure that the type used as a key to a HashTable is immutable or that the instance is simply not mutated while it's still a key in the HashTable.

11:46:37 PM    

I just started going through my news and see that Justin also discovered the for/foreach compiler optmization on his own after little investigation before I even got to post my follow up.

...Will use the same type of looping instructions that a for loop does.  You can take a look at the for loop IL here and the foreach loop IL here.  The one thing to notice is that the IL generated by C# for "foreach" uses a local array instead of the argument that is passed in ("a" in this case).  I'm not sure what the rationale is for that.  But if you look at instructions at IL_0004 and IL_0005, you'll see where they are initializing the index to 0. [Justin Rudd's Radio Weblog]

Your test code is a little different than mine, but they illustrate the same strange compiler behavior. Either way the compiler generates another local array and copies the original array reference to it and loops over that. Strange, eh?

11:16:32 PM    

Follow up on for vs. foreach on arrays

So I posted this quick blurb on foreach'ing vs. for'ing arrays earlier today. When I got home I opened an email from Fumiaki Yoshimatsu pointing out that the C# compiler was actually smart enough to detect that the variable myStructs was of type Array and emitted IL (almost) equivalent to that of a handwritten for loop automagically on the developers behalf. First off, thanks for pointing this out Fumiaki! Next, I don't know when this happened, but I remember Eric Gunnerson discussing this compiler optimization and coulda sworn it wasn't going in until after v1. Obviously somewhere between beta2 and v1 they must have snuck it in there. I guess I just haven't looked at the scenario for a while.

I checked out the VB.NET compiler and it uses this optimization as well. However, both languages seem to do something very strange. They create another local array variable of the same type, assign the array reference to that, then loop over that reference as opposed to your own. This happens in both debug and release builds. I don't really see a reason why they needed to do this and I'm not sure if the JIT will optimize that stack allocation and assignment away, but it's not the end of the world and it sure beats IEnumerable/IEnumerator.

So in the end, using either a manual for or foreach in C# and For or  For Each in VB.NET will emit IL that allows the JIT to optimize away bounds checking. For the sake of being thourough, click here to see the different approaches coded in C# and the IL that is emitted in a release build.

10:43:30 PM    

Sam pointed out that Dr. GUI has put up Part #6 of his .NET series: "Arrays in the .NET Framework".

He covers a lot of good stuff, but the most important thing I think he points out is how allocation differs between value types vs. reference types. He also mentions how Array implements IEnumerable and that you can foreach/For Each it, however it's important to note that you can actually get better perfomance by for'ing the array manually, like so:

<codeSnippet language="C#">

  for(int index = 0; index < myArray.Length; index++)

  {

    object o = myArray[index];

    ...

  }
</codeSnippet>

Not only does this avoid the overhead of IEnumerable and IEnumerator, but it also takes advantage of a little known Microsoft JIT trick which optimizes away bounds checking on the array. The only place I've ever seen the afforementioned JIT optmization documented is here in this article under the section titled "Use For Loops for String Iteration—version 1". From the section:

"The JIT is smart enough (in many cases) to optimize away bounds-checking and other things inside a For loop, but is prohibited from doing this on foreach walks. The end result is that in version 1, a For loop on strings is up to five times faster than using foreach. This will change in future versions, but for version 1 this is a definite way to increase performance."

3:54:46 PM    

I was just reading over my last post (something I probably should have done before I posted it) about CMS and realized I really screwed the pooch on some of the tech-terms. I blame it on acronyms... there's just too damn many of them! ;)

I said:

"Well, I think what Justin is probably interested in keeping is the CMA half of Radio's CMS: the template and macro execution/expansion features, which constitute a major portion of Radio's functionality. The CDA half, I agree, is pretty basic."

But given the simple definition of CMA as content authoring tools and CDA as content generation and publishing tools, I should have said:

Well, I think what Justin is probably interested in is throwing out the CMA half of Radio's CMS system and keeping the CDA half: template and macro execution/expansion features. The publishing part of CDA, I agree is pretty basic.

3:24:46 PM    

I don't want to replace Radio completely.  I want it to continue being my CMS.  I want it to continue publishing my blog.  What I want to replace is the news aggregator and editing tools. [Justin Rudd's Radio Weblog]

What else is there? There's no other content management to speak of, except blog & story editing. You drop files in a directory and it pushes them out for you? Bah, that's not CM. All I use Radio for is news and posting. If you replace news and posting, what's left? [The .NET Guy]

Well, I think what Justin is probably interested in keeping is the CMA half of Radio's CMS: the template and macro execution/expansion features, which constitute a major portion of Radio's functionality. The CDA half, I agree, is pretty basic.

In my dream weblog software, CMA is handled via XML templates which are processed using XSLT. When publishing content, the data is pushed through a pre-processor not unlike that of ASP.NET where all <%= %> occurences are translated to xsl:valueof calls. Macros would be written using VSA and/or as external .NET components (registered somehow with the system) which are then passed as extension objects to an XslTransform for processing.

11:27:59 AM