Friday, December 25, 2009

A Memory of Vic Chesnutt

After I graduated from high school, I got on a bus and went to Athens for two weeks, and had the best possible experience an 18-year-old music junkie could have had at the time. It's unbelievable but true. I met Michael Stipe (how exactly is a story in itself) and ended up spending almost a half day with him. We hung out in the R.E.M. office in downtown Athens while listening to Pylon rehearse in the basement. We went to his house and had a drink. We went to a cafe and later, before he dropped me off at the city park where I was camping, a bar. The next morning I got kicked out of the park and ended up camping uninvited on the roof of the R.E.M. office. The woman who ran the fan club noticed me through a window sleeping in the sun. I woke up with a bag next to me; it had a tube of sunscreen and some fruit. Turns out her boyfriend, Armistead Wellford, was the bass player in Love Tractor. She arranged for me to stay in a big old house where a lot of the band members lived. All of that was more than I imagined would happen when I got on the Greyhound bus in Fargo, North Dakota.

But in retrospect the most significant thing I saw there was when Stipe and I were at the cafe. He introduced me to some people at a table -- two women who I don't remember, and a man in a wheelchair, Vic Chesnutt. A cassette player played recording of one of Vic's live performances. While we sat at the table and talked, Stipe told Chesnutt that he'd like to help him record a record. It was the summer of 1988. Two years later New West Records released Chesnutt's Little, which Stipe produced.

Tuesday, December 01, 2009

Wiki terminology, like wikipedia, not optimized for search.

In MediaWiki terms, including a page in another page is called transclusion, an accurate word, but not one that automatically comes to mind. I searched for "macros" and "include" for a while and, after a few minutes, found a page that mentions transclusion.

More papers to read

From a comment on the Natural Language Processing blog, classic papers in NLP:

  • [Bahl et al., 1983] L.R. Bahl, F. Jelinek and R.L. Mercer. "A Maximum Likelihood Approach to Continuous Speech Recognition." IEEE Journal of Pattern Analysis and Machine Intelligence.
  • [Charniak, 1983] Eugene Charniak. Passing Markers: A Theory of Contextual Influence in Language Comprehension, Cognitive Science, 7, pp. 171-190.
  • [Charniak, 1973] Jack and Janet in Search of a Theory of Knowledge. In Proceedings of the International Joint Conference on Artificial Intelligence (1973)
  • [Charniak, 1977] Eugene Charniak. Ms. Malaprop, A Language Comprehension Program. In Proceedings of the International Joint Conference on Artificial Intelligence (1977).
  • [Cohen et al. 1982] Philip R. Cohen, C. Raymond Perrault, and James F. Allen. Beyond Question Answering. Strategies for Natural Language Processing, pp. 245- 274.
  • [Grosz, Joshi, and Weinstein, 1995]. Centering: A Framework for Modeling the Local Coherence of Discourse. Computational Linguistics, 21 (2), pp. 203-226.
  • [Grosz and Sidner, 1986]. Attention, Intention, and the Structure of Discourse. Computational Linguistics, 12 (3), pp. 175-204, 1986.
  • [Hobbs et al., 1993]. Interpretation as Abduction. Artificial Intelligence, vol 63. pp. 69-142.
  • [Hobbs, 1979] Jerry Hobbs. Coherence and Coreference, Cognitive Science 3(1), pp. 67-90.
  • [Hovy, 1988] Hovy, E.H. 1988. Planning Coherent Multisentential Text. Proceedings of 26th ACL Conference. Buffalo, NY.
  • [Karttunen, 1969] Lauri Karttunen. 1969. Pronouns and variables. In CLS 5: Proceedings of the Fifth Regional Meeting, pages 108-116, Chicago, Illinois. Chicago Linguistic Society.
  • [Kay, 1986] Martin Kay. Parsing in functional unification grammar.
  • [Lakoff & Johnson, 1980] George Lakoff and Mark Johnson. Metaphors We Live By, Chapters 1-4. (short - a total of 21 pages).
  • [Lehnert, 1981] Wendy G. Lehnert. Plot units and narrative summarization. Cognitive Science, Volume 5, Issue 4, October-December 1981, Pages 293-331
  • [Lehnert, 1977] Wendy Lehnert. Human and Computational Question Answering. Cognitive Science, Vol. 1, No. 1, pp. 47-73.
  • [Mann and Thompson, 1988]. Rhetorical Structure Theory: Toward a functional theory of text organization. Text 8 (3), pp. 243-281, 1988.
  • [Martin et al., 1986] P. Martin, D. Appelt and F. Pereira. Transportability and generality in a natural-language interface system.
  • [McKeown 1986] Kathleen McKeown. Discourse strategies for generating natural-language text.
  • [Rosch and Mervis, 1975] Eleanor Rosch and Carolyn B. Mervis. Family Resemblances: Studies in the Internal Structure of Categories, Cognitive Psychology, 7, 573-605.
  • [Schank, 1986] Roger Schank. Language and memory.
  • [Schubert and Pelletier, 1986] L Schubert and F J Pelletier. From English to logic: context-free computation of "conventional" logical translations.
  • [Wilks, 1975] Yorick Wilks. An Intelligent Analyzer and Understander of English, CACM 1975.
  • [Woods, 1986] W.A. Woods. Semantics and quantification in natural language question answering.

Sunday, November 29, 2009

Independent Study: Concolic testing for web applications

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.)

This week, two papers about web applications from the International Symposium on Software Testing and Analysis '08:

Dynamic Test Input Generation for Web Applications

Here the authors use the concolic testing method pioneered in the seminal paper on Directed Automated Randomized Testing (PDF) to generate tests automatically for web applications written in PHP.

Use a taint-based (clarify) PHP runtime environment as the test oracle (which determines whether a failure occurs). The purpose of automatically generating tests is to automatically identify bugs in the program and, narrowly, the type of bugs the authors are trying to identify are SQL injection vulnerabilities. To that end, they iteratively construct (what they call) an approximate backward slice of the PHP program by (loosely speaking):
  1. identify statements where such vulnerabilities may cause undesired behavior (viz. database library calls where the injected SQL can ultimately do harm),
  2. add the functions where such statements occur to a set of functions to be analyzed,
  3. execute the program by loading it in a browser
  4. resolve control dependencies by recording stack trace at the beginning
  5. analyze data dependencies
  6. repeating (with some variations) until all data dependencies are resolved
The purpose of the preceding steps is to exclude from the analysis aspects of the program that won't help identify SQL vulnerabilities in the code.

Section 3 of the paper discusses the authors' algorithm for generating constraints for PHP.

Section 4 evaluates the system. Constraint generation is accomplished by a plugin to phc, a PHP compiler front-end. The plugin "wraps each statement in a function call"; the function call logs a trace of the program's execution to a file. They deal with eval by passing the string to be eval-ed through the plugin, so each statement in the eval-ed string is wrapped in a function call which logs a trace to the same file. Constraints are resolved by reading and symbolically executing the trace file. The result is "a list of Boolean control expressions where each subexpression is annotated with a concrete value from the execution."

An interesting aspect of the authors' evaluation is the overhead of the tracing process. When they evaluated an entire PHP program, the trace file for loading a single web page was almost 3 GB and the page load timed out. So the iterative process of limiting the scope of their analysis mentioned above was necessary for obvious practical reasons.
Previous work on leveraging symbolic and runtime values for input test generation falls back on concrete values when the symbolic value being addressed lies outside the theory of the resolution algorithm’s decision procedure. Our constraint resolution algorithm generates constraints only based on one variable instance per value. Therefore it may underapproximate the symbolic values of variables when program predicates depend on multiple variables, and it may miss paths that other resolution algorithms would find. In principle our constraint resolution algorithm could be enhanced to include multivariate constraints in some cases, but we leave that to future work.
An object-oriented web test model for testing web applications, Kung, et al, may be interesting reading:
This paper describes an Object-Oriented test model that captures both structural and behavioral test artifacts of Web applications. The model represents the entities of Web applications as objects and describes their structures, relationships, and dynamic behaviors. Based on the test model, test methods are presented to derive test cases automatically for ensuring the structures and behaviors of Web applications
Finding Bugs in Dynamic Web Applications

The previous paper focused on web application security. This paper focuses on web application reliability. Where the previous paper's goal was to identify vulnerabilities to SQL-injection attacks, this paper's goal is to identify bugs that cause web applications to crash or generate invalid HTML. (Web application crashes that can be triggered by user input become denial-of-service attack vulnerabilities once they become known to bad actors.) Similarly, where the test oracle of the previous paper is a PHP runtime environment that supports checking strings for taintedness (failure being defined as the use of an untained string in an SQL statement), the test oracle of this paper is an HTML validator (failure being defined as the web application generating invalid HTML).

Testing whether a web application generates valid HTML is hard for dynamic web pages. Systems exist for validating dynamically-generated web pages, but the require the tester to create tests manually. Here the authors present a system, Apollo, for automatically generating tests for dynamic pages.

Something I don't understand. Here's a passage from the paper:
The HTML pages generated by a PHP applications may contain buttons that—when pressed by the user—result in the loading and execution of additional PHP source files. We simulate such user input by transforming the source code. Specifically, for each page h that contains N buttons, we add an additional input parameter p to the PHP program, whose values may range from 1 through N. Then, at the place where page p is generated, a switch statement is inserted that includes the appropriate PHP source file, depending on the value supplied for p. The steps of the user input simulator are fully mechanic, and the required modifications are minimal, but for the evaluation we performed the program transformation by hand (due to time constraints).
Normally, submit buttons result in an HTML form being POST'ed to the web application. From the context, it's not clear why the system wouldn't simply POST the form. An additional passage
The stand-alone component of the User Input Simulator performs a transformation of the program that models interactive user input by way of additional parameters.
Still a little confused. :-)

Ah, I get it:

<?php echo "<h2>WebChess ".$Version." Login"</h2>;?>
<form method="post" action="mainmenu.php">
Nick: <input name="txtNick" type="text" size="15"/><br/>
Password: <input name="pwdPassword" type="password" size="15"/>
<input name="login" value="login" type="submit"/>
<input name="newAccount" value="New Account"
type="button" onClick="'newuser.php', '_self')"/>
Nothing else very interesting here. They evaluate their system, present the results, related work, etc.

Other reading:

Improving test case generation for web applications using automated interface discovery, Halfond, et al, focuses on JavaScript.

Thursday, November 26, 2009

Notes for Information Retrieval Quiz #3

Some things to review before the Information Retrieval quiz:

Lecture 18, Collaborative Filtering and Recommender Systems

Pearson correlation

Lecture 19, Information Extraction

Named entity recognition: find and classify (i.e. determine the category of) all the named entities in a text. Two approaches to named entity recognition:
Rule-based (regular expressions)

  • Lists of names
  • Patterns to match things that look like names
  • Patterns to match the environments that classes of names tend to occur in
  • Get annotated training data
  • Extract features
  • Train systems to replicate the annotation
Relation analysis consists of two tasks:
  1. determine if two entities are related
  2. if they are, classify the relation

Features in relation analysis (for each of the above tasks) are:

  1. Features of the named entities involved (their types [concatenation of the types, headwords of the entities)
  2. Features derived from the words between and around the named entities (+- 1, 2, 3; bag of words between)
  3. Features derived from the syntactic environment that governs the two entities (constituent path through the tree from one entity to the other; base syntactic chunk sequence from one to the other; dependency path)
Template filling
  1. Rules and cascades of rules
  2. Supervised ML as sequence labeling
    1. One sequence classifier per slot
    2. One big sequence classifier
Lecture 20, Sentiment Analysis

Classification in sentiment analysis
  • Coarse classification of sentiment
    • Document-level classification according to some simple (usually binary) scheme
      • Political bias
      • Likes/hates
    • Fine-grained classification of sentiment-bearing mentions in a text
      • Positive/negative classific of opinions about entities mentioned in a text
      • Perhaps with intensity
Choosing a vocabulary
  • Essentially feature selection
  • Previous examples used all words
  • Can we do better by focusing on subset of words?
  • How to find words, phrases, patterns that express sentiment or polarity?
  • Adjectives
    • positive: honest important mature large patient
    • negative: harmful hypocritical inefficient insecure
  • Verbs
    • positive: praise, love
    • negative: blame, criticize
  • Nouns
    • positive: pleasure, enjoyment
    • negative: pain, criticism
Lecture 21, Sentiment Analysis (cont.)

Identifying polarity words
  • Assume that generating exhaustive lists of polarity words is too hard
  • Assume contexts are coherent with respect to polarity
  • Fair and legitimate, corrupt and brutal
  • But not: fair and brutal, corrupt and legitimate
  • Example:
    • Extract all adjectives with > 20 frequency from WSJ corpus
    • Label them for polarity
    • Extract all conjoined adjectives
    • A supervised learning algorithm builds a graph of adjectives linked by the same or different semantic orientation
    • A clustering algorithm partitions the adjectives into two subsets
  • Mixed sentiment: The steering is accurate but feels somewhat anesthetized.
  • Sentiment inverters: ... never seen any RWD cars can handle well on snow even
    just few inches.
  • Anaphora and meronymy:
    • It's a great car for just about anything. The mkVI is pretty
      much a mkv but ironing out all the small problems.
    • Hey is the back seat comfortable? In my MkV it feels like
      you're sitting on a vat of acid.

Sunday, November 22, 2009

Independent Study: Concepts and Experiments in Computational Reflection

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.) (I've written most of these posts in the form of a review of a conference paper, but I'm going to use a free-form style this time.)

Pattie Maes's wonderful paper on reflection is a perfect follow-on to my reading about traits. My motivation for reading the traits paper was that Perl's Moose (Perl 6, too!) has traits (it calls them roles). I've been using Moose at lot at work and I wanted to catch up with the research behind traits to prepare for a short talk I gave about it to my colleagues. After the talk, a colleague who was at Bell Labs in the 80's mentioned a paper about reflection by Pattie Maes at OOPSLA in '87 which led to the MetaObject Protocol in the Common Lisp Object System (CLOS), which in part inspired Moose, so Maes's paper closes the loop, ices the cake, etc.

Many programmers know of reflection from the java.lang.reflect package in the Java API. Suffice it to say for the moment that Maes' reflection is expansive and Java's reflection is by comparison quite limited. However, reflection in Java does serve as a good starting point for understanding Maes. What the Java API does provide is programmatic access to information about objects at runtime by exposing an interface to what Maes calls the "self-representation of the system ... which makes it possible for the system to answer questions about itself and support actions on itself." To wit, java.lang.reflect allows the Java programmer to find out which methods or constructors, etc., exist for a given object, and to invoke them.

However, Maes goes a step further (or, more appropriately, Java didn't go as far as Maes imagined, for better or worse) by asserting that "a computational system can actually bring modifications to itself by virtue of its own computation." The modifications Maes envisions are pervasive. In her experimental object-oriented language, 3-KRS, reflection on an object occurs by way of a meta-object associated with the object. (The motivation for performing reflection on an object through a separate entity, its meta-object, may not have been obvious before Maes' paper, but in retrospect it's a clear case of separation of concerns.) Since everything is an object in a pure object-oriented language, meta-objects are everywhere:

[T]he self-representation of an object-oriented system is uniform. Every entity in a 3-KRS system is an object: instances, classes, slots, methods, meta-objects, messages, etc. Consequently every aspect of a 3-KRS system can be reflected upon. All these objects have meta-objects which represent the self-representation corresponding to that object.
For the uninitiated (including me), a slot is, loosely, an instance variable. And I suspect that the difference between a method and message is that a method defines a method (pardon the circularity) and a message "calls" a method. (The notion of a message likely goes back to Smalltalk.)

Further, meta-objects are manipulable at runtime, so — to borrow an example from Maes — a language that supports reflection in all its glory allows the programmer to modify meta-objects to provide support for multiple inheritance. Intuitively, a complete reflective system is an API for the semantics of a programming language. (See also: "A metaobject protocol (MOP) is an interpreter of the semantics of a program that is open and extensible.")

Two aspects of Maes's reflection which are not true for popular statically-typed languages, then, are:
  • meta-objects are pervasive
  • meta-objects are mutable
Obviously for, say, Java or C#, this is true by design. Enterprises of all kinds are often hard-pressed to find programmers who can understand the source code of their larger applications. Some managers are left in the lurch when a key employee departs. Self-modifying code, one might say, is job security. So clarity, explicitness, consistency, readability — these language properties are desirable for production software. A complete reflective system violates them. The flip side of this is expressiveness. Domain-specific languages (DSLs) come to mind. Being able to define little languages to solve a particular problem, and to compose larger applications from modules written in little languages, is an attractive idea. I'd like to say more about DSLs and how they relate to reflective systems, but I don't know much more than that they've been seeping into programming culture for quite some time.

Incidentally, Aspect-Oriented Programming (AOP) exists to address the fact that cross-cutting concerns — any application requirements that cause code to be scattered throughout a code base (e.g. logging, security) — violate modularity. However, it also bridges the gap between the limited form of reflection that exists in Java and some of the more interesting uses of reflection Maes mentions — such as being able to trace the execution of a program (with e.g. print statements) without modifying the program itself. This is no coincidence. Gregor Kiczales, the author of a book about the MetaObject Protocol, which was inspired by Maes' paper, is a coauthor of the earliest paper on AOP. In a sense, implementations of AOP attempt to augment a language with a meta-object protocol without changing the language itself.

Thursday, November 12, 2009

A Little History of Electronic SEC Filings

In 1986, the Securities and Exchange Commission started to accept SEC filings — 10-K forms and the like — electronically. Between '86 and '92, only a handful of companies filed their 10-K electronically. The companies?

  • Medical Monitors, Inc.
  • Medical Monitors, Inc.
  • Medical Monitors, Inc.
  • Fast Eddie Racing Stables, Inc.
  • Medical Monitors, Inc.
  • Fast Eddie Racing Stables, Inc,.
  • Jilco Industries, Inc.
  • Whitney American Corp.
  • Filmagic Entertainment Corp.
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1989-2
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1989-3
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1989-5
  • Medical Monitors, Inc.
  • Fast Eddie Racing Stables, Inc,.
  • Jilco Industries, Inc.
  • Filmagic Entertainment Corp.
  • Xanthic Enterprises, Inc.
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1988-1
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1988-2
  • Medical Monitors, Inc.
  • Fast Eddie Racing Stables, Inc,.
  • Jilco Industries, Inc.
  • Filmagic Entertainment Corp.
  • Xanthic Enterprises, Inc.
  • Admiral Financial Corp.
  • Quad Metals Corp.
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1988-4
  • Medical Monitors, Inc.
  • Fast Eddie Racing Stables, Inc,.
  • Jilco Industries, Inc.
  • Filmagic Entertainment Corp.
  • Xanthic Enterprises, Inc.
  • Admiral Financial Corp.
  • Quad Metals Corp.
  • American Housing Partners
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1992-3
While I get the obvious kick out of Fast Eddie Racing Stables, Inc. — that it's publicly traded and was an early-comer to electronic SEC filing — my hunch is that these early companies are there not necessarily because of who runs them (although it's certainly possible that Medical Monitors, Inc. has a techno-saavy CEO) but more likely because 86-92 was a period of controlled introduction. In the following years, the number of electronically-filed 10-Ks were:
Year # of filings
1993 1305
1994 1249
1995 3460
1996 6482
I suspect that the SEC opened the flood gates only partially in 93-94, a bit more in 95, and completely in 96. On the other hand, this was the era during which the Internet was really taking off, so it's hard to distinguish between what the SEC mandated and what it allowed.

If you're wondering what those First Boston Mortgage things are, they're probably asset-backed securities, which are by far the largest class of securities regulated by the SEC. Commercial state banks are up there, too. But it's interesting that the largest group of securities aren't even operating companies — they're just a publicly-traded piece of paper that represents a share of ownership in some asset.

Sunday, November 08, 2009

Modeling the World Wide Web

Just so I don't forget, some papers by Filippo Menczer, who appears to be doing work related to an idea I've been mulling over for a while.

Informally, consider the World Wide Web as a graph with web pages as nodes and hyperlinks as edges, label each node with some value derived from the contents of the web page (e.g. the length of the page, the set of terms in the page, or the term vector for the page), then define the value of an edge as the difference between the nodes it connects. This basically yields a geometric representation of the web graph by mapping it into some metric space. (If the idea is still fuzzy, imagine a hyperlink as a function from the document that contains it to the document to which it points.)

Now that I've found some already-published work on this model, I'll have to spend next semester learning what people have already done so I can do something new.

Friday, November 06, 2009

Independent Study: Traits: Composable Units of Behaviour

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.)

Submission: Traits: Composable Units of Behaviour [PDF]

Please give a brief, 2-3 sentence summary of the main ideas of this paper:

If you start with the reasonable assumption that code reuse improves programmer productivity, an important question is how to increase code reuse. Historically, inheritance — both single and multiple — has been a mechanism for code reuse, in as much as it has allowed classes to be composed at least in part from other classes. Mixins solve some of the problems of single and multiple inheritance, but they don't work well when a class uses two or more mixins containing identically-named methods. Traits solve the same problems as mixins, but don't suffer from their limitations.
What is the strength of this paper (1-3 sentences):
Traits solve the problems of previous attempts to facilitate code reuse. Further, they suggest a style of designing software — as collections of traits (which ideally define only a single method) instead of collections of classes — that may be useful in its own right.
What is the weakness of this paper (1-3 sentences):
None noted.
The authors note that traits are inspired by mixins, and in one (arguably incorrect) sense, traits are merely an incremental improvement on mixins; however, the deficiencies the authors identify in mixins are real (and especially important for large, complex applications) and require solving, so the fact that traits aren't light years ahead of mixins is irrelevant, as the improvements traits provide are necessary.
Yes. The refactoring example helps the authors make a stong case for traits.
Worth solving:
See my response about novelty above.
I'm confident in the material in this paper.
Detailed comments:

The notion that code reuse improves programmer productivity is non-controversial. A library of well-tested and widely-used classes in a language's ecosystem provides not only implementations of common functionality (thereby relieving programmers of having to implement and test that functionality themselves) but also (because classes and methods are named) a common language for communication among programmers — both of which reduce "friction" during software development.

A problem for API designers is that code reuse in most common object-oriented languages is done at the level of the class definition — a client class T of the some API class U reuses U either by extending U or by otherwise referring to U (e.g. statically, as the type of a local variable, as the type of a member variable). This can lead to an undesirable decoupling of generic functionality, making it hard for programmers to discover reusable code. For example, in Java, to sort an object of the type of a class which implements java.util.List, one needs to know to use the sort method of the Collections class. Acquiring this knowledge isn't necessarily onerous for a programmer, but it likely leads novice programmers to implement their own sorting methods. A language with mixins or traits does not have this problem. In such a language, a class C that can be sorted uses the mixin or trait that provides a generic sorting routine (as long as C provides whatever methods the mixin or trait requires [e.g. something akin to the compareTo of Java's Comparable]), and discovering that C can be sorted is a matter of invoking an IDE's method autocomplete feature or, absent IDE support, a cursory examination of the definition of C. This is similar to the problem of jungloid navigation, where a programmer knows what kind of object she wants, but doesn't know how to get it, except in this case a programmer knows what she wants to do with an object (i.e. sort a list), but doesn't know how to do it (i.e. call Collections.sort).

A little history on multiple inheritance, interfaces, and mixins. So far as I've been able to find, interfaces first appeared in Modula-2 (1978), where they're called definition modules. From the Modula-3 report (I wasn't able to find one for Modula-2): "An interface is a group of declarations. Declarations in interfaces are the same as in blocks, except that any variable initializations must be constant and procedure declarations must specify only the signature, not the body." Modula-3 (1980s) also had multiple inheritance. Mixins first appeared in an OO extension to LISP (called Flavors [circa 1980]). Flavors fed into the Common Lisp Object System, where the concept of a Meta-Object Protocol was first implemented. Perl's Moose is built atop Class::MOP, which was inspired by the CLOS MOP.

Sunday, November 01, 2009

It's Expensive Being Rich

When you open a web page in a browser, the browser loads the page and all other resources to which the page refers. Some of those resources are files containing JavaScript code, and those files keep getting larger. Some of them can be large enough to noticeably delay the complete loading of the web page, which is a real problem for web site operators. Their dilemma is that users demand features, and JavaScript is a way to provide lots of features, but users also demand that pages load quickly, and adding more and more JavaScript increases the time it takes a page to load. Without changing the underlying technology, it's akin to a zero-sum game, or a game of Whack-a-Mole, or whatever notion you prefer for identifying a situation like this. The two requirements — provide a rich end-user experience, provide it quickly — are to some degree at odds with one another.

Recently James Hamilton pointed out a cool research project which transformed JavaScript source files into a form that allows the source code to be loaded only when needed. Another approach to dealing with this problem — an approach that is complementary to the approach mentioned by Hamilton — is to modify the HTML script element to indicate whether the JavaScript source file needs to be loaded immediately or can be loaded "lazily." This approach exists in the script element section of the HTML 5 draft specification.

Meditations on Meat

Exhibit #1: "My childhood was typical, summers in Rangoon, luge lessons. In the spring we'd make meat helmets."

Exhibit #2: Turducken

Exhibit #3: Report: Meat Now America's No. 2 Condiment


Avoid the Lippis Report

My last job was in the telecommunications sector and I had an interest in keeping abreast of developments in the industry. At the time some people at work were subscribed to Nick Lippis's mailing list, so I subscribed too. Gotta know what's known by those in the know. Now I work in a different business, but I'm finding that the cost of unsubscribing from the list is torture (whether it's eternal torture, I don't know yet, as I've still not been able to unsubscribe).

First thing I tried was clicking the "unsubscribe" link at the bottom of his emails. All that did was load the URL Here's a screenshot:

The next logical thing to try was the instructions right above the unsubscribe link, which read:

Login to your Lippis account at with username "ndronen," or delete your account by clicking the link below.
Since it's been quite a while since I subscribed, I didn't remember my password, so I opted to have the site change my password. A new password arrived shortly thereafter, but logging in with it yielded no joy. Another screenshot:

So I'm stuck — can't unsubscribe with the unsubscribe link, can't login and unsubscribe — so I can only resort to emailing the site directly. Thanks for wasting my time, Mr. Lippis.

Incidentally, the message "You do not have sufficient permissions to access this page" does not mean that I entered an invalid password. As you can see here, entering an invalid password causes an entirely different message.

Update (9:49 MST): It gets better. I sent email to the contact address listed in the "Lippis Report" email ( and it bounced.

So now I'm sending email using the web form on the site.

Update (8:41 MST 11/02/2009): I've received a gracious email from Nick Lippis. They've removed me from the mailing list and are looking into the problem with the unsubscribe process.

Friday, October 30, 2009

Notes on Content Security Policy

It seems I learn better when I write things down, so I'm taking notes as I read the spec for Content Security Policy.

  • It's opt-in on a per-site basis.
  • It is initially activated in the browser by the presence of an X-Content-Security-Policy header field in an HTTP response. The value of the header field must be either contain a policy specification or a policy-uri field which denotes the URI from which the browser should fetch the policy.
  • The header field must not be in the trailer headers (i.e. it must be at the top of the HTTP response). I surmise the purpose of this constraint is that existing browsers may evaluate inline JavaScript as they can, so if the X-Content-Security-Policy field is in the trailer, it's too late.
  • There are two URI types in CSP: policy-uri and report-uri. The former defines a URI from which a security policy must be fetched. The latter defines a URI to which violations of the policy should be reported (using e.g. an HTTP POST).
  • This is interesting. If there's more than one X-Content-Security-Policy in a response, the browser complies with the intersection of the policies.
  • If there's more than one report-uri, the browser reports violations to each unique URI — if there are duplicate URIs, the browser only sends one report to it.
  • A policy-uri or report-uri is only legal if it complies with the conventional same-origin policy — that is, if the URI refers to the same scheme/host/port as the page itself.
  • Inline JavaScript won't execute when CSP is enabled. The presence of inline JavaScript in a page for which CSP is in effect is a violation and causes a report to be sent to each report-uri.
  • Eval and any other mechanism for creating code from data (e.g. new Function("i'm evil code masquerading as data")) are not allowed to execute. They trigger a report to the report-uri, too.
  • CSP has options for stating different sources for different media types (e.g. img-src for images, media-src for audio/video, script-src for JavaScript, object-src for applets and the like, frame-src for frame and iframe elements, font-src for fonts, xhr-src for XMLHttpRequest, style-src for stylesheets)
The spec also contains examples of policy definitions.

Friday, October 16, 2009

Independent Study: Web Application Security

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.)

Last week and the week before, I read papers which analyzed and proposed solutions for injection attacks in dynamic languages. This week, instead of reading a paper, I'm digging around for security trends related to JavaScript in the browser and dynamic languages on the server. Not much seems to have changed since the Spectator paper was published. There are still JavaScript worms, and the attackers are still using fancy tricks to subvert the filters of the web site operators. Compare, for instance, these technical descriptions of the original MySpace worm and the quite recent Reddit worm. So all-in-all, not much is new. However, reading this reportfrom the Web Application Security Consortium, I did run across what is, to me, a new kind of attack — HTTP response splitting — which may warrant further investigation. I suspect it is the case that existing taint mode techniques can be appropriately applied to HTTP response splitting, but it would be worthwhile to verify.


For Dynamic Languages

Douglas Crockford's slides on JavaScript security
Ruby on Rails Security Project
Python Security Advisories

General Internet Security

Web Application Security Consortium
Security Focus
Common Vulnerabilities and Exposures
SANS Storm Center

To get a flavor of US-CERT data, here are 2009 current activity reports:

Web Application Security Consortium feed
Security Focus feeds
Common Vulnerabilities and Exposures
SANS Internet Storm Center
SANS: @RISK: The Consensus Security Vulnerability Alert

File Under Very Useful

The Web Application Security Consortium has statistics (which appear to be actively maintained) on website vulnerabilities. The WASC describes the data as the result of "a collaborative industry wide effort to pool together sanitized website vulnerability data and to gain a better understanding about the web application vulnerability landscape."

For tracking web application (i.e. web app framework and web browser) security vulnerabilities, SANS's @RISK: The Consensus Security Vulnerability Alert seems to be quite useful. It compiles reports from a number of commercial security sources. Here's an example of the web application section of a recent report:
Web Application - Cross-Site Scripting
Web Application - SQL Injection
Web Application


The advice everyone who knows anything gives to anyone who wants to be safer online is to use NoScript. I use it. You should too. But it's only for Firefox, not all the other browsers out there, and discretionary plugins only get adopted so far. That leaves a whole lot of browsers (many of which have unpatched vulnerabilities) running on the desktops and laptops of the world. Further, most Internet users aren't sophisticated enough to know when they need to enable JavaScript, and since there's not a WWW cop to enforce the unobtrusive use of JavaScript, it's just easier for people to allow JavaScript from every site on the web, which gets us back to square one.

A few weeks ago I reviewed a system for detecting and containing JavaScript worms which mentions the MySpace JavaScript worm. Here are some more recent incidents. The Reddit incident was only a few weeks ago.
Somebody's made a javascript worm
source code for the reddit/firefox [sic] exploit

JavaScript worm from late 2007 happily frolicking in 2008
JavaScript worm still spreading, infection origin unknown

More on Orkut worm

JavaScript worm targets Yahoo!

I'm Popular
Technical explanation of the MySpace Worm
Buffer Overflows, Oh My!

Because they manage memory on behalf of the programmer, dynamic languages may be thought of being invulnerable to buffer overflow attacks. However, the runtimes of some dynamic languages are implemented in C, which is itself subject to buffer overflow attacks, so programs executing in such runtimes may themselves be vulnerable. This is illustrated by these Ruby, Perl, Python, and PHP vulnerabilities, all reported in 2008.

The same is true of JavaScript running in Firefox, Internet Explorer, and WebKit/Safari.

Friday, October 09, 2009

Indepent Study: Defending against Injection Attacks through Context-Sensitive String Evaluation

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.)

Submission: Defending against Injection Attacks through Context-Sensitive String Evaluation [PDF]

Please give a brief, 2-3 sentence summary of the main ideas of this paper:

What is the strength of this paper (1-3 sentences):

Its strength is that the authors present a simple but general analysis of all kinds of injection attacks (e.g. SQL, shell, and others), and implement a system for preventing and detecting those attacks. Their system is completely automated, requiring the programmer to make no decisions (and hence make no mistakes), and its generality is extremely appealing.
What is the weakness of this paper (1-3 sentences):
The runtime overhead of CSSE is a bit steep.
Excellent. This paper should be presented at POOPSLA '99!
All told, the approach the authors take to solving the problem of injection attacks is similar to Perl's taint mode, but the context-appropriate escaping mechanism is unique to CSSE.
I'm convinced.
Worth solving:
Definitely. And I love the way they've solved it. It's general and it requires no programmer input.
Detailed comments:

One thing I should note about this paper is that it is beautifully written. Killer paragraph (but you have to ignore the superfluous comma after "prevention method"):
This paper introduces Context-Sensitive String Evaluation (CSSE), which is an intrusion detection and prevention method, for injection attacks. It offers several advantages over existing techniques: it requires no knowledge of the application or application source code modifications and can therefore also be used with legacy applications. It is highly effective against most types of injection attacks, not merely the most common ones. It does not rely on the application developer, which makes it less error-prone. Finally, it is not tied to any programming language and can be implemented on a variety of platforms.
Asking programmers to help validate the security of their application -- as in last week's paper's attempt to disambiguate the purpose of regular expressions by prompting the programmer for input -- is invariably bound to fail.

The authors analyze injection attacks in general, SQL and shell e.g., focusing on how these attacks exploit assumptions about the syntactic content of user input.

More great paragraphs:
A common property of injection vulnerabilities is the use of textual representations of output expressions constructed from user-provided input. Textual representations are representations in a human-readable text form. Output expressions are expressions that are handled by an external component (e.g., database server, shell interpreter).

User input is typically used in the data parts of output expressions, as opposed to developer-provided constants, which are also used in the control parts. Therefore, user input should not carry syntactic content. In the event of an injection attack, specially crafted user input influences the syntax, resulting in a change of the semantics of the output expression. We will refer to this process as mixing of control and data channels.
The authors define a framework for understanding the sundry injection attacks in more general terms, identifying sets of input and output vectors. For most web applications, there's only a single input vector, HTTP operations. The output vectors for SQL injection attacks are the execution of SQL statements against a database, and for command injection attacks, the output vector is a call to execute a command, such as with system() or exec().

They describe existing approaches to this problem as either safe ad-hoc serialization or serialization APIs. Safe ad-hoc serialization includes manual input validation (i.e. the programmer is solely responsible for validating the safeness of the input), automated input validation (e.g. MagicQuotes in PHP), and variable tainting (e.g. Perl's -T flag), and, lastly, the approach of SQLrand, which requires that all SQL commands executed by an application must be encoded as constants in the application. Serialization APIs include DOM APIs for XML and, for SQL, any API which requires prepared statements. Examples of the latter are Java's PreparedStatement and the prepare_statement method of Perl's DBI module.

The authors propose to assign metadata to all strings in a program in order to track its origin. Strings read from a TCP/IP socket are tagged as untrusted. Strings that are constants in the source code are tagged as trusted. Their system, Context Sensitive String Evaluation (CSSE), tracks the untrusted string fragments at runtime. When an untrusted fragment is included in an expression passed to a function which interacts with the external resources (e.g. mysql_query(), exec()), CSSE can escape the untrusted fragment in a context-appropriate way (e.g. escape SQL in the case of mysql_query() and escape shell in the case of exec()), block the request, or raise an alarm.

This feature can be implemented using Aspect-Oriented Programming (AOP), but the authors note that at the time of their writing the AOP library for PHP did not support the interception of string operations, which is necessary to implement CSSE.

Sunday, October 04, 2009

Independent Study: Static detection of security vulnerabilities in scripting languages

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.)

Submission: Static detection of security vulnerabilities in scripting languages [PDF]

Please give a brief, 2-3 sentence summary of the main ideas of this paper:

SQL injection and other string-based exploits to which web applications are vulnerable can be detected by performing static analysis on web applications written in dynamic languages. The static analysis is supplemented with information gleaned from the symbolic execution of the source code.
What is the strength of this paper (1-3 sentences):
Techniques for automatically detecting SQL injection attacks in web applications written in dynamic languages are sorely needed.
What is the weakness of this paper (1-3 sentences):
I am skeptical of the usefulness of the interactive mode of the checker -- which is triggered when regular expressions are used to validate unsafe data -- for the average PHP programmer. Also, while the authors refer to Perl's taint mode (man perlsec) as an alternative way of sanitizing data, it would be useful if they were to compare the effectiveness of their approach to Perl's built-in approach.
Symbolic execution has been used before in the DART paper, but there the purpose was to determine what input values would cause a statically-typed program to take certain paths during automated testing; here the purpose is to determine whether any memory locations are untrusted.
The authors describe a checker that is effective at detecting SQL injection vulnerabilities.
Worth Solving
This problem is worth solving. It is all too easy for programmers to fail to untaint input received from the user of a web application, so a reliable, automated way of detecting such exploits is necessary.
Reasonably confident
Detailed Comments
Analysis starts with block-level symbolic execution, which generates a block summary. Intraprocedural analysis takes block summaries (a six-tuple, described below) as input and generates a four-tuple, which is consumed by the intraprocedural analysis phase.

The use of symbolic execution here reminds me of the DART paper. Here symbolic execution is used to understand the functioning of a program written in a dynamic language. In the DART paper, it was used to force a statically typed language to take different paths during automated testing.

Block Analysis

At the block level, the code is executed symbolically, and the resulting summary is used to perform analysis at intra- and interprocedural levels. Using a summary at the higher levels expedites the analysis.

The authors define a language to model what they believe is an appropriate subset of PHP for detecting SQL injection attacks with their simulator (i.e. the component of their checker which symbolically executes blocks of PHP code).

They devote particular attention to how they model strings, because strings are such essential types in dynamic languages: "Strings are typically constructed through concatenation. For example, user inputs (via HTTP get and post methods) are often concatenated with a pre-constructed skeleton to form an SQL query.... String values are represented as an unordered concatenation of string segments, which can be one of the following: a string constant, the initial value of a memory location on entry to the current block (l_0), or a string that contains initial values of zero or more elements from a set of memory locations (contains(sigma))." The latter part of the definition of strings in this model allows the cheker to track the flow of untainted data through a web application.

The motivation for and definition of untaint (as related to the definition of the Boolean type) in the modelling language is unclear to me.

The untainting of strings "occur[s] via function calls, casting to safe types (e.g. int, etc), regular expression matching (!), and other types."

The result of the block-level analysis is a six-tuple consisting of an error set ("the set of input variables that must be sanitized before entering the current block"), definitions ("the set of memory locations defined in the current block"), value flow ("the set of pairs of [memory] locations (l_1, l_2) where the string value of l_1 on entry becomes a substring of l_2 on exit"), termination predicate (whether the current block causes the program to exit), return value (undefined if and only if the termination predicate is true), and an untaint set (the set of [memory] locations that are sanitized by the current block, for each of the block's successors).

Intraprocedural Analysis

This phase of the analysis uses the six-tuple block summaries generated by the previous phase to generate a four-tuple consisting of an error set ("the set of memory locations ... whose value may flow into a database query, and therefore must be sanitized before invoking the current function"), return set ("the set of parameters or global values that may be a substring of the return value" of the function), sanitized values ("the set of parameters or global variables that are sanitized on function exit"), and program exit ("whether the current function terminates program execution on all paths").

Interprocedural Analysis

This phase involves using the previously-generated function-level tuple to substitute actual for formal parameters in the error set and marking memory locations as safe when they are unconditionally untainted. It also involves the use of the Boolean-related notion of untaint that I still don't understand.

In what order are functions analyzed? "Our algorithm analyzes the source codebase in topological order based on the static function call graph." Recursion doesn't compute fixed-point; the system inserts a no-op summary when it encounters recursion.

Since regular expressions are self-contained automata, little computational devices, I was surprised when the authors remarked that strings can be marked as untainted when they are checked by regular expressions. It sounded almost magical. It's not quite that. "Some regular expressions match well-formed input while others detect malformed input; assuming one way or the other results in either false positives or false negatives.... To make it easy for the user to specify the sanitization effects of regular expressions, the checker has an interactive mode where the user is prompted when the analysis encounters a previously unseen regular expression and the user's answers are recorded for future reference."

The authors mention the built-in Perl taint mode (man perlsec). This suggests that the proper way of implementing a checker like the one described here is to integrate it into the language runtime.

Sunday, September 27, 2009

Independent Study: Ripley: Automatically Securing Web 2.0 Applications Through Replicated Execution

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.)

Submission: Ripley: Automatically Securing Web 2.0 Applications Through Replicated Execution

Please give a brief, 2-3 sentence summary of the main ideas of this paper:

AJAX or Web 2.0 applications provider richer end-user experience by moving computations to the browser, but the results of those computations are vulnerable to manipulation by known tools. Since client-side computations often cause server-side events, such as modifying or removing data in a database, it's important for the server to be able to validate the integrity of the client's computations. The authors present RIPLEY, a system for replaying client-side computations in the server to reestablish the lower bound on computational integrity that was lost by moving code to the client.
What is the strength of this paper (1-3 sentences):
The architecture of RIPLEY is sound for preventing the problem the authors identify, and the empirical results of deploying the system are sufficiently promising for internet application providers to consider investigating this approach to the problem.
What is the weakness of this paper (1-3 sentences):
The authors don't provide a compelling case that this problem needs to be solved. In their paper on Spectator, they refer to specific JavaScript worms that have caused damage to Web site operators. Here they don't refer to specific large-scale exploits that their approach solves. This makes the problem they're solving seem less pressing.
It's an interesting solution but the problem, while real, doesn't seem to have been exploited in any significant way. The fact that the server-side code can anticipate client actions, pre-compute their results, and send them to the client before the client takes the given action, thus increasing client-side responsiveness, is a lovely way to improve the user experience, one that should pique the interest of Web site operators making heavy use of client-side JavaScript.
RIPLEY is novel. The authors note that an alternative approach (explored here and here) is to require the client to send to the server a computed proof of the correctness of its current state (e.g. the client could intermittently send a stack trace to the server). Unlike RIPLEY, these approaches only provide a degree of assurance, not a guarantee, of the correctness of the client-side computations.
The architecture, implementation details, and runtime performance
Worth Solving
The problem is worth solving, but not as pressing as the problem of JavaScript worms that the authors solve in their work on Spectator.
95% +- 2
Detailed Comments

See citations 27, 29 for vulnerabilities to code modification in browser.

See citation 20 for, more generally, AJAX vulnerabilities and exploits.

Volta, a distributing compiler -- in some respects similar to Google Web Toolkit, which generates JavaScript code from Java. The idea of GWT is to allow the user to write dynamically-typed code in a statically-typed language -- the GWT compiler can reason about the statically-typed code and emits code that's known to be safe. While Volta appears to be designed to work in a more general way than GWT -- generating code for multiple languages (more than just JavaScript) via the Microsoft Common Language Runtime intermediate representation, for example -- the basic idea is similar. One feature of Volta is that because it can reason about the statically-typed program, it can divide the computations in the statically-typed code to be divided between the server and client.

A developer who wishes to use RIPLEY applies class-level annotations to identify which code runs on the client, which on the server. The code is then converted to .NET bytecode, at which point Volta reads the bytecode (which contains the annotations), and generates the appropriate client- or server-side code, introducing RPCs between client and server when necessary. RIPLEY is implemented as a tweak to this phase of Volta's execution. The authors added a feature to Volta to generates additional server-side code for validating the client-side computations.

RIPLEY is intended to enforce the integrity of the original application in the face of the vulnerability of the client-side code. It accomplishes this by reproducing each client-side computation on the server; if the results of a client-side computation cannot be reproduced on the server, this is reason to believe that something faulty or malicious occurred in the client.

In the RIPLEY architecture, a browser emulator runs on the server, validating the client-side computations by replaying them. Since the browser emulator runs in a .NET virtual machine, which uses JIT, sometimes the server-side validations can anticipate client-side events, and the server can push pre-computed results to the client.

Sunday, September 20, 2009

Independent Study: Spectator: Detection and Containment of JavaScript Worms

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.)

Submission: Spectator: Detection and Containment of JavaScript Worms [PDF]

Please give a brief, 2-3 sentence summary of the main ideas of this paper:

Detecting JavaScript worms can be accomplished by adding a tag to content uploaded to a Web server, associating the tag with the IP address of the client that originated the upload, and using tags to identify propagation chains. A long chain of propagation is a signature of JavaScript worms, so identifying a long chain (where length is user-defined) should be sufficient for detecting JavaScript worms. One a worm has been identified, containment is a matter of disallowing further uploads along the chain of known-infected clients until an administrator tells the Spectator proxy that the chain is safe.
What is the strength of this paper (1-3 sentences):
The strength of this paper is that it proposes a solution which can be implemented simply as a proxy server in the domain of a Web site operator, and which doesn't require any modifications or plug-ins in the web browser. With a reasonable amount of time and capital, a large Web site operator can implement this solution today.
What is the weakness of this paper (1-3 sentences):
Not to be so generous, but I don't find any problems with it.
This is on the whole an excellent paper. I would like to see it in SIGFOO this year.
I'm not sure how novel this is, as XSS attacks and JavaScript worms aren't my specialty, but naively, it seems novel. In the spirit of many useful papers in computer science, it combines a number of sound techniques to solve an important problem.
I'm convinced not only that the detection and containment algorithms work correctly and efficiently, but also that the author of a JavaScript worm would have a hard time subverting the system. The authors' approach is essentially invisible to the browser, with the exception of some JavaScript injected by the Spectator proxy (Figure 5) (which appears to be designed so as to prevent subversion by malicious code).

While the empirical data in section 5.2 (Overhead and Scalability) is encouraging, a Web site operator would obviously want to subject any implementation of this system to rigorous testing before deploying it.
Worth Solving
Very much so!
Detailed Comments
I'm curious whether it's possible for each page downloaded from a site to contain JavaScript code to validate that only Web site operator-provided JavaScript is executing in that page. Something like an unmodifiable onload event handler similar to the unload event handler in Figure 5.

Some papers to follow up on:

S. Meschkat. JSON RPC: Cross site scripting and client side Web services. In 23rd Chaos Communication Congress, 12 2006.

T. Pietraszek and C. V. Berghe. Defending against injection attacks through context-sensitive string evaluation [PDF]. In Proceedings of the Recent Advances in Intrusion Detection, Sept. 2005.

Y. Xie and A. Aiken. Static detection of security vulnerabilities in scripting languages [PDF]. In Proceedings of the Usenix Security Symposium, pages 271–286, Aug. 2006.

Y.-W. Huang, F. Yu, C. Hang, C.-H. Tsai, D.-T. Lee, and S.-Y. Kuo. Securing Web application code by static analysis and runtime protection. In Proceedings of the Conference on World Wide Web, pages 40–52, May 2004.

Wednesday, September 16, 2009

Fibonacci Strings and Performance

Fibonacci strings are related to Fibonacci numbers in that the length of the nth Fibonacci string is the nth Fibonacci number. Here's pseudocode for a function that generates them:

fibstr: int -> string
if n is 0
return the empty string
else if n is 1
return the string "b"
else if n is 2
return the string "a"
return fibstr(n-1) concatenated with fibstr(n-2)

The strings generated by this function for n > 2 have the interesting property that if you delete the last two letters the resulting string is a palindrome. Anyway, for kicks I decided to use Fibonacci strings to test the string performance of popular programming languages. The test was simple: run fibstr(31) 31 times. I was lazy in writing the tests, but the results were consistent across many cups of Bhakti chai and many executions of the test yesterday morning , so as far as I'm concerned they're valid representations of the relative performance of the languages with respect to string concatenation and, to some extent, memory management.

Java (StringBuilder w/JIT2.807s2.652s0.084s
Java (String w/JIT)2.819s2.692s0.076s
JavaScript (Rhino 1.6.r5-34.240s4.100s0.080s
Python (2.4.4)4.708s3.992s0.012s
Perl (5.8.8)7.528s7.484s0.008s
Java (StringBuilder w/o JIT)8.171s8.053s0.040s
Java (String w/o JIT)11.707s11.553s0.068s
Ruby (1.8.6)21.348s18.565s2.488s

Off-the-cuff ... Ruby's performance speaks for itself. I expected Perl to do better, since it's been around quite a while. I can see why Sun responded to early complaints about Java's performance with JIT. It's clearly effective, but I'd like to know more about how the JRE decides to compile to native code. With JIT, StringBuilder doesn't buy you anything if it's compiled to native code; without it, StringBuilder can make a noticeable (on paper) difference over plain String, but I wonder whether the difference shows up much in real workloads. If a string-manipulating function is a hot spot, it'll probably be compiled to native code. I suppose one would have to examine the memory impact of String v. StringBuilder as well.

I wouldn't be surprised if most of the time a language spends in system mode comes from calls to brk(2).

Sunday, September 13, 2009

Independent Study: A Universal Modular ACTOR Formalism for Artificial Intelligence

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format of this and subsequent papers will be similar to that of a review of a paper submitted to a conference. These are already-published papers, so I'll be writing with the obvious benefit of hindsight.)

Submission: A Universal Modular ACTOR Formalism for Artificial Intelligence [PDF]

Reviewer: Nicholas Dronen

Please give a brief, 2-3 sentence summary of the main ideas of this paper:
The authors present their notion of an actor, a formalism which unifies control flow and data flow. Actors allow programs to achieve a high degree of parallelism without the explicit use of semaphores. Actors accomplish this by being free of side effects and communicating with other actors only by passing messages.
What is the strength of this paper (1-3 sentences):
The paper's greatest impact is the scalability (i.e. degree of parallelism) and reliability (discussed below) of actors.
What is the weakness of this paper (1-3 sentences):
While the ambition of the paper is admirable, it nonetheless casts too wide a net, covering topics from hardware design to formal representations of knowledge and belief. Historically, the value of actors as a way to reason about the properties of programs isn't clear, considering that other approaches, such as the pi calculus, have seen wider adoption.
In retrospect, a great deal of this paper could be removed without harm. Specifically, the idea of actors as an appropriate abstraction for hardware (e.g. removing interrupts) and the discussion of using actors to represent knowledge could be completely removed. The remaining material could then be expanded, ideally covering in more depth the concurrency-related aspects of actors.
Some aspects of actors -- such as message passing and perhaps pattern matching -- are not original. What is novel in this paper is the combination of these techniques into a scheme that provides a high degree of concurrency.
Worth Solving
I am very confident in my evaluation of this paper. I've been thinking about actors for a year or so.
Detailed Comments
To understand actors, consider a thread. Its stack is finite, but it's undecidable how much stack space a function running in a thread requires, so system designers allocate a constant amount of stack space for each thread -- enough for well-behaved programs, but still too much to allow even tens of thousands of threads on a reasonably-equipped machine. It's conventional for 2 MB to be allocated to a thread on a 32-bit machine running Linux. If a process is limited to 2 GB of memory, and assuming for the sake of discussion that no other implementation issues (such as the placement of the heap) interfere, a process can have at most ~1000 threads. Actors solve this problem, which is admittedly not an enormous problem, but the solution to this problem solves another huge problem -- the high degree of concurrency that programs will require in order to exploit the large number of cores of future processors.

An actor is a user space thread, the stack of which doesn't grow -- essentially a tail-recursive function. So the system designer only has to allocate an extremely small amount of stack space -- basically enough for an activation record and local variables -- which means that a single process can contain many thousands of actors. For example, a benchmark has run with 20 million Erlang processes (the Erlang term for an actor).

An actor contains only local state, so it has no side effects. Since there's no shared state, the programmer doesn't have to concern herself with error-prone synchronization primitives, such as semaphores. Actors communicate with each other by sending messages. The incoming messages for an actor are placed in an actor's mailbox (this term doesn't appear in the paper but is in common use today). When an actor receives a message, it uses pattern matching (in the style of Haskell and ML) to determine the type of the message.

One would be wrong to infer that synchronization primitives are not necessary for a runtime system that supports actor-based programs in a single operating system process on an SMP or multicore machine. Imagine a program running on a multicore machine with 10,000 actors, and 9,999 are sending messages to the remaining one (who we'll call A). The actor A's mailbox will be filling up with messages, and it's possible that two actors executing concurrently can send a message to A at the same time, which implies that the scheduler for A has to be capable of handling two requests at once without corrupting A's mailbox.

The paper's authors found inspiration for actors in packet-switched networks, which in part explains why a program written in actor-based language seems so much like a distributed system. Where the program has actors sending messages to each other, the distributed system has nodes exchanging packets. In the domain of software architecture, there is an interesting analogue between REpresentational State Transfer (REST) and the actor model, which is mentioned during the question and answer period of a recent presentation by REST advocate Steve Vinoski.

Additional benefits of actors: since they contain only local state, they can be terminated if an error occurs. This effectively isolates errors. A small amount of data may be lost, but the program continues to run. Since actors consume so little memory, each actor can be monitored by another actor. The monitoring actor can be notified when the "child" actor dies, and it can restart the child. This increases the reliability of programs. Another consequence of the lack of global state in actor-based programs is the ease with which code can be reloaded into a running program without consequences. This increases availability, which is especially important in systems that are supposed to be fault tolerant.