SCO Scoop.

When you want to know more
about the story
but don't know where to look.

IANAL. I am a paralegal, so if you have a legal problem
and want advice, this isn't the place. Hire an attorney
instead. Research is, however, what paras do, so here
I am sharing things I have found in my research.

Subscribe to "GROKLAW" in Radio UserLand.

Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog. Dictionary

Friday, August 29, 2003

Small World Dept.-- SCO and MS
Have a Mutual Friend

There's more than one way to fund a company, it seems. An alert reader noticed that Integral Capital Management companies just filed a 13G with the SEC regarding its shares in SCO. Here's their SEC filing. Here's the SEC Search page for Integral. Here's a list of institutional owners of SCO, and you can see that as of June, Integral was already number one on the list with 4.0%. Now it's presumably 5% or more, hence the 13G.

Who are these people? I thought I'd just check and see who thinks SCO is worth buying right along about now. I was particularly interested because last Friday I noticed a surge in buying, and all week there was a spike of institutional sell messages, and then this Friday, it suddenly stopped. No sell or buy messages, as you can see on this chart. If you switch the chart to the entire month, you can easily see that after this huge and atypical increase, suddenly nothing. So I went digging.

At first, there didn't seem much to find. Yahoo lists Integral as one of the "Top Institutional Holders" of, the top holder, as a matter of fact. And you can read the stock purchase agreement the parties signed on Findlaw. Here's how they describe themselves in the SEC filing:

"This statement is being filed by Integral Capital Management V, LLC, a Delaware limited liability company ('ICM5'), ICP Management V, LLC, a Delaware limited liability company ('ICP Management 5') and Integral Capital Management VI, LLC, a Delaware limited liability company ('ICM6'). The principal business address of ICM5, ICP Management 5 and ICM6 is 3000 Sand Hill Road, Building 3, Suite 240, Menlo Park, California 94025.

"ICM5 is the general partner of Integral Capital Partners V, L.P., a Delaware limited partnership ('ICP5'). ICP Management 5 is the general partner of Integral Capital Partners V Side Fund, L.P. ("Side Fund") and the Manager of Integral Capital Partners V SLP Side Fund, LLC ('SLP Side Fund'). ICM6 is the general partner of Integral Capital Partners VI, L.P., a Delaware limited partnership ('ICP6'). With respect to ICM5, ICP Management 5 and ICM6, this statement relates only to ICM5's, ICP Management 5's and ICM6's indirect, beneficial ownership of shares of Common Stock of the Issuer (the 'Shares'). The Shares have been purchased by ICP5, Side Fund, SLP Side Fund and ICP6, and none of ICM5, ICP Management 5 or ICM6 directly or otherwise hold any Shares. Management of the business affairs of ICM5, ICP Management 5 and ICM6, including decisions respecting disposition and/or voting of the Shares, resides in a majority of the managers of ICM5, ICP Management 5 and ICM6, respectively, such that no single manager of ICM5, ICP Management 5, or ICM6 has voting and/or dispositive power of the Shares."

Intricate, no? Why, I asked myself, would they be buying now? Google didn't have much to show.

Just when I thought I'd hit a dead end, I started to read's 10Q, their most recent one, just filed this month. Because this is another company Integral has invested in heavily, maybe I'd find something there, I thought.

And lo and behold, guess who was just elected a director of Melinda French Gates. Yes, that Mrs. Gates. Well, that got my attention. I switched to some other search engines and really started digging, and I went to the SEC to see what I could find.

As of last December, Integral Capital Management V didn't own any SCO stock, according to this SEC filing. They did own Microsoft stock back in November. But they didn't the previous May of 2002. So the chain of investment timeline appears to go like this: First, they invested in, then Microsoft, and then in SCO.

Small world, isn't it? But why? A venture capital firm is investing in Microsoft? Doesn't it seem like it should be the other way around?

Here's a description of Integral on VC Directory:

"20 portfolio companies brought to successful IPO 1997 - 99. This venture capitalist was incubated within Kleiner Perkins Caufield & Byers in 1991. Integral's partnerships are five years in duration. Integral IV began in March of 1998 with over $300 million of capital and the company currently has over $1.2 billion in capital. In 1999, Integral Capital Partners co-founded and now operates as a managing principal of Silver Lake Partners, a $2.2 billion buyout fund focused on technology and related growth businesses.

"Focus: Integral invests in technology businesses at all stages of company development beyond the start-up phase. . . .

"Integral typically invests between 20% and 50% of the original capital of each fund in private companies and buyout opportunities. Integral is an active investor, both in the venture and public stages. Their greatest contribution typically relates to corporate strategy, business development opportunities, and maximizing market value (either through an IPO or merger). Integral does not take board seats. Integral Capital Partners 2750 Sand Hill Rd., Menlo Park, CA 94025 . . ."

Here's their home page.

Think this is just a coincidence? Could be, but take a look at this article from June 6, 2003, entitled "Best friends -- VC buddies do many of their deals with pals", which explains that in a tough economic market, VC capitalist firms like Integral do business primarily with their friends, as its Roger McNamee explains:

"'People's time is limited,' said McNamee. Quoting a favorite saying in the VC biz, he added: 'When you have a choice, always do business with your friends. When you don't have a choice, your friends are the only ones who will do business with you.'"

There goes the coincidence theory. Well, they appear to know each other, according to this Fast Company article linked to on Integral's web site:

"As early as 1997, he says, 'I realized that we had escaped the earth's gravitational pull -- that we were in the midst of a true mania. The next question was, What do you do after you crash and burn? You need a strategy for investing in a long-term bear market.' So, along with a set of superstar partners, McNamee assembled the first large-scale private-equity fund focused on tech. Silver Lake Partners, which launched in May 1999, raised $2.3 billion in a matter of months, attracting a who's who of Silicon Valley and Wall Street, including Bill Gates, Michael Dell, Larry Ellison, major investment banks, and big institutions such as CalPERS and the World Bank. In the four years since its launch, Silver Lake has invested approximately $1.6 billion of the fund in nine megatransactions -- which included involvement in a landmark $20 billion leveraged buyout of legendary disk-drive maker Seagate in late 2000 and that company's IPO two years later."

So there you have it, folks. And what's the plan? Here's their investment strategy page:

"The Internet drives all three of Integral's major information technology investment themes. Cheaper computing power, more bandwidth at lower prices, and server-centric software are leading to a world of information-rich Internet services that empower decision-makers with accurate, real-time information. By utilizing the exponential forces of Moore's law and Metcalfe's law, the Internet encourages rapid development of low-cost business relationships. The Internet also liberates emerging competitors by allowing them to undermine traditional information control, price discrimination, customer relationships, and distribution channels. The Internet is delivering both buyers and suppliers better information and lower cost transactions."

This seems to dovetail with SCO's new web services model, and it also ties in with my long-held belief that the plan was to dump the GPL off a cliff, write a new kernel for UNIX, which will also do Windows, and then steal all the open source software applications they can find and let you run them as binaries on top of their kernel. In a word, yuck. Brand X Linux, for which you will pay a pretty penny, my friend. It seems there is a plan, and it looks to me now like Microsoft may really be part of it.

comment [] 4:00:24 PM    

Data Mining, Spectral Analysis, and All that Jazz

We're not born knowing what spectral analysis is. So when SCO said that spectral analysis is one of the methods they used to find "infringing" code, I had no idea what they were talking about. When Sontag compared it to finding a needle in a pile of needles, I figured it wasn't much use. And it turns out, on further investigation, that my intuitive conclusion may be about right, at least when it comes to using it for software code data mining for infringing code in this case.

An alert reader noticed something interesting. One of Canopy Group's companies is called DataCrystal. Could that be at least one of the three groups SCO hired to try to sort through the code of UNIX SystemV and the Linux kernel? DataCrystal does "advanced pattern recognition" and "AI systems". They actually claim to do a great deal more besides. One of the things listed on their what-we-do page is data mining. Presumably, that's what SCO wanted to do. And a look at their About page indicates that if you are the RIAA, you probably would want to have a company like DataCrystal to hunt down pirates for you. Another reader noticed this page about a DataCrystal, and he wondered if it might be the same company.

It isn't, because this DataCrystal is the name of a project at USC, not a company. While I don't know if the Canopy Group company DataCrystal was hired by SCO, or whether there is any connection between the company and the project, it did make me start to wonder about the field in general. If you really wanted to know if two piles of code had identical or similar code, can data mining find out? And would matches be reliable for use in the way SCO apparently is using them? Judging by the SCOForum demo, we might think no. And we might be right.

I asked a Groklaw resource person, a man who worked for over a decade doing basic and exploratory research for the US DoD and the Canadian Ministry of Defence on topics related to secure communications and signals intelligence, including cryptology, statistical processing of natural language, signal processing, and computational learning, if he'd be willing to explain it in general and understandable terms, so we can follow along. Very likely this subject is going to be a very significant part of the case when it goes to trial. Here is what he explained to me:

"Data mining is looking for patterns or similarities in large quantities of information. Google is a good example of data mining-on-demand where the pattern is supplied by the user and the large quantity of information is the entire set of webpages on the internet. But data mining in general is potentially much broader. For example, a typical data-mining task might be to take a sample document and look for other documents in a database that might be similar to it. But even beyond that, data mining can be applied to other kinds of data -- pictures, for example, or sound recordings.

"There are lots of different ways to approach problems like this. Beyond the most elementary, what all the techniques have in common is that they rely on mathematical models and transformations of the data. Part of the reason is efficiency, since turning the problem into math usually means there's a computationally clever way to do it. Another part of the reason is that, by transforming the problem into math, you make it possible to find and grade a continuum of approximate matches -- in short, to find ranges of similarities rather than just identities. Note very well that 'similarity' here is completely dependent on the particular flavor of math you've chosen as your technique. This is extremely important.

"OK, so you've taken your document or picture or whatever, and you've mined your database for similar items. Those items will be graded for similarity to your original, just as some search engines will rate their returned items in terms of probable pertinence. The most sophisticated and respectable data-mining systems will be using grades based on probabilities. This is because the underlying math will be using probability models. Many times the grade will reflect not merely the strength of a match in terms of probability, but also the likelihood that such a match would be found at random searching any old data at all. This also is extremely important, since 'any old data at all' can be subject to a wide range of interpretations. (This could pertinent in the SCO case, since, if data-mining techniques are used, it's a reasonable question whether any contamination discovered this way is real, or whether it's spurious, i.e., capable of being found to the same degree in other, unrelated data.)

"Now the DataCrystal webpage consists mostly of a laundry list of any and all of the subjects ever associated with data mining, artificial intelligence, knowledge discovery, or machine learning. But the .pdf white papers all focus on using data-mining techniques for indexing and retrieving digital video and audio. What's more, they're offering not just indexing and retrieval services, but also housing, protecting, and distributing the data itself.

"It outlines an enhanced technique for expanding data-mining coverage. It's a technique for building patterns out of patterns and data mining on the derived metapatterns in turn."

Not being a rocket scientist, I wanted to be sure I'd understood, so I wrote back and asked these followup questions, and got this reply:

"Q: I have two questions to follow up:
" 1. . .the results would depend on how you programmed the software? In other can look for similarities, but it can't evaluate them?"

"ANS: Absolutely correct.

"Q:..there might be in actuality no common code at all?"

"You know how Google sometimes matches all the words in your query, but not necessarily conjointly or in the same order?

"In the case of computer code, especially code written in C expressing similar or common algorithms, it would be astounding if there weren't pattern similarities at some level. If nothing else such things are enforced by the design of the language and commonly-held notions about good coding style."

"Q: simply would have to be the case that some of the code is close enough that they might have a case?"

"ANS: Just the contrary. As with the 1st slide example, the ancestry of that memory-management code is known to virtually anybody who's studied C from Kernighan and Ritchie's book. A similarity like that would stand out like Devil's Tower, but what it indicates is exactly the opposite of what they contend: it shows that everybody knows the pattern."

"Q: And can they program the math to increase "matches"? Pls. explain a bit more this part."

"ANS: Here's an example. Suppose you came up with a hitherto-unknown page of blank verse. The question is, was it written by Shakespeare or not?

"Data mining your way through that problem, you'd get one level of certainty if your database contained the Bible, Goethe, Racine, Pushkin, and the New York Times. You'd get a different level of certainty if your database were confined to Elizabethan dramatists. The scores for putative Shakespeare against the mixed database would probably be huge just for matching any English. The scores against Elizabethan dramatists would probably be quite a bit weaker, but clearly more conclusive. The mixed-database test -- the one with the Bible, Goethe, etc. -- will probably say 'Shakespeare indeed!' but it's expressing the idea that 'if it's English it's Shakespeare.'

"On the other hand, the Elizabethan dramatist test might say yes, might say no, but the answer will be based on such things as a small number of very subtle differences between, say, Shakespeare's and Marlowe's vocabulary. It expresses perhaps the idea that 'in any 1000-line chunk of Shakespeare and any 1000-line chunk of Marlowe, Shakespeare is likely to use the word 'ope' once and Marlowe not at all. This example doesn't use 'ope' at all therefore it's probably Marlowe.

"You can see it's still a matter of interpretation and probability, but the second test is simply more credible on grounds that are external to the data-mining method itself.

"Here's another point of view. How does a data-mining search for SVr4 code look if you run it against all C programs? In all likelihood you're going to find some matches. Are the matches against Linux actually any stronger than matches against an arbitrary body of C code? Against other Unix-like kernels? etc.

"These are interpretive issues, but there are statistical grounds for deciding them, and speaking strictly for myself, I seriously doubt they've been fielded satisfactorily. For my money you couldn't even start taking the matter seriously unless exactly the same tests were run against every body of  other kernel code like all the BSDs, and a chunk of the SVr4 kernel against the rest of that same kernel. And even then, you've only generated the raw information to start the business of verifying and refining the procedure."

"Q: Also, what is spectral analysis? Is that what this is?"

"No. In general, spectral analysis refers to breaking things down into component frequencies -- sort of like how a prism breaks white light into colors, and so on.

"In this case it refers to using the periodicities of the individual characters of program text as frequencies to look for a very specific set of 'colors' associated with a particular swatch of program code. It's not determinative either. It may also refer to a kind of computational trick using spectral-based techniques to look for certain kinds of approximate matches very quickly."

So, there you have it. At least now we know in general what they are talking about. As the case goes forward, and more is revealed, no doubt it'll be interesting to meaningfully follow along.

His analogy to Google made it all come clearer to me. On top of all that he wrote, I know with Google, input affects output. And input means humans, imperfect humans. I certainly know that I get different results from Google if I plug in the identical search terms, but in a different order, for example. So I totally get how results could be skewed by what you tell the software to do. For example, I get different results if I search for "Dave Farber" and IP than I do if I search for IP and "Dave Farber", and it's different still if I search for just IP or just "Dave Farber" or just Farber or just Farber and IP. And that's using the same pile of data. Input affects output.

Obviously they would argue that their methods are so refined, blah blah. But that human element can't be removed, because humans write the software, no matter how sophisticated. So how reliable are the matches? You use Google. What do you think? Doesn't a human at some point have to interpret the value of the results?

"A continuum of approximate matches" does not infringement prove, on its face. As he says, it's an interpretive issue. And data mining seems to be a better match with something like matching amino acid strings than figuring out if someone stole somebody's code, which requires knowing who has or doesn't have a valid copyright, which way matching code travelled, who had the code first, etc.

If I've understood what my friend has written, it means that if SCO swapped out Linux and searched Windows 2000 code instead, it'd likely find instances that looked like "infringing code" also. That's the same as saying that so far, they are holding maybe nothing. It all reinforces in my mind that, once again, nothing has been proven to date by their claims of similarity, derivative or obfuscated code matches, and nothing can be proven using data mining techniques, until this case goes to trial and the experts speak, followed by a decision by a judge.

If you are interested, here is a white paper, "Text Mining -- Automated Analysis of Natural Language Texts" that explains the process of searching just for simple text, and while it does the explaining, it also shows just how much human input goes into structuring your search before you begin the search and why the results still may not be what you want. It is hard to see how such techniques could answer the question: "Is this infringing code?" At best, it could show you where to begin to investigate. And here are the DataCrystal project's white papers.

Oh, one other thing I found out in my investigation. Guess where most of the cutting-edge brains working on such data-mining techniques work? . . . No, really. Guess. . .

That's right: at IBM.

comment [] 1:45:21 AM    

Click here to visit the Radio UserLand website. © Copyright 2003 PJ.
Last update: 8/31/03; 1:42:43 AM. Creative Commons License
This work is licensed under a Creative Commons License.

August 2003
Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Jul   Sep