Sunday, November 29, 2009

Independent Study: Concolic testing for web applications

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.)

This week, two papers about web applications from the International Symposium on Software Testing and Analysis '08:

Dynamic Test Input Generation for Web Applications

Here the authors use the concolic testing method pioneered in the seminal paper on Directed Automated Randomized Testing (PDF) to generate tests automatically for web applications written in PHP.

Use a taint-based (clarify) PHP runtime environment as the test oracle (which determines whether a failure occurs). The purpose of automatically generating tests is to automatically identify bugs in the program and, narrowly, the type of bugs the authors are trying to identify are SQL injection vulnerabilities. To that end, they iteratively construct (what they call) an approximate backward slice of the PHP program by (loosely speaking):
  1. identify statements where such vulnerabilities may cause undesired behavior (viz. database library calls where the injected SQL can ultimately do harm),
  2. add the functions where such statements occur to a set of functions to be analyzed,
  3. execute the program by loading it in a browser
  4. resolve control dependencies by recording stack trace at the beginning
  5. analyze data dependencies
  6. repeating (with some variations) until all data dependencies are resolved
The purpose of the preceding steps is to exclude from the analysis aspects of the program that won't help identify SQL vulnerabilities in the code.

Section 3 of the paper discusses the authors' algorithm for generating constraints for PHP.

Section 4 evaluates the system. Constraint generation is accomplished by a plugin to phc, a PHP compiler front-end. The plugin "wraps each statement in a function call"; the function call logs a trace of the program's execution to a file. They deal with eval by passing the string to be eval-ed through the plugin, so each statement in the eval-ed string is wrapped in a function call which logs a trace to the same file. Constraints are resolved by reading and symbolically executing the trace file. The result is "a list of Boolean control expressions where each subexpression is annotated with a concrete value from the execution."

An interesting aspect of the authors' evaluation is the overhead of the tracing process. When they evaluated an entire PHP program, the trace file for loading a single web page was almost 3 GB and the page load timed out. So the iterative process of limiting the scope of their analysis mentioned above was necessary for obvious practical reasons.
Previous work on leveraging symbolic and runtime values for input test generation falls back on concrete values when the symbolic value being addressed lies outside the theory of the resolution algorithm’s decision procedure. Our constraint resolution algorithm generates constraints only based on one variable instance per value. Therefore it may underapproximate the symbolic values of variables when program predicates depend on multiple variables, and it may miss paths that other resolution algorithms would find. In principle our constraint resolution algorithm could be enhanced to include multivariate constraints in some cases, but we leave that to future work.
An object-oriented web test model for testing web applications, Kung, et al, may be interesting reading:
This paper describes an Object-Oriented test model that captures both structural and behavioral test artifacts of Web applications. The model represents the entities of Web applications as objects and describes their structures, relationships, and dynamic behaviors. Based on the test model, test methods are presented to derive test cases automatically for ensuring the structures and behaviors of Web applications
Finding Bugs in Dynamic Web Applications

The previous paper focused on web application security. This paper focuses on web application reliability. Where the previous paper's goal was to identify vulnerabilities to SQL-injection attacks, this paper's goal is to identify bugs that cause web applications to crash or generate invalid HTML. (Web application crashes that can be triggered by user input become denial-of-service attack vulnerabilities once they become known to bad actors.) Similarly, where the test oracle of the previous paper is a PHP runtime environment that supports checking strings for taintedness (failure being defined as the use of an untained string in an SQL statement), the test oracle of this paper is an HTML validator (failure being defined as the web application generating invalid HTML).

Testing whether a web application generates valid HTML is hard for dynamic web pages. Systems exist for validating dynamically-generated web pages, but the require the tester to create tests manually. Here the authors present a system, Apollo, for automatically generating tests for dynamic pages.

Something I don't understand. Here's a passage from the paper:
The HTML pages generated by a PHP applications may contain buttons that—when pressed by the user—result in the loading and execution of additional PHP source files. We simulate such user input by transforming the source code. Specifically, for each page h that contains N buttons, we add an additional input parameter p to the PHP program, whose values may range from 1 through N. Then, at the place where page p is generated, a switch statement is inserted that includes the appropriate PHP source file, depending on the value supplied for p. The steps of the user input simulator are fully mechanic, and the required modifications are minimal, but for the evaluation we performed the program transformation by hand (due to time constraints).
Normally, submit buttons result in an HTML form being POST'ed to the web application. From the context, it's not clear why the system wouldn't simply POST the form. An additional passage
The stand-alone component of the User Input Simulator performs a transformation of the program that models interactive user input by way of additional parameters.
Still a little confused. :-)

Ah, I get it:

<?php echo "<h2>WebChess ".$Version." Login"</h2>;?>
<form method="post" action="mainmenu.php">
Nick: <input name="txtNick" type="text" size="15"/><br/>
Password: <input name="pwdPassword" type="password" size="15"/>
<input name="login" value="login" type="submit"/>
<input name="newAccount" value="New Account"
type="button" onClick="'newuser.php', '_self')"/>
Nothing else very interesting here. They evaluate their system, present the results, related work, etc.

Other reading:

Improving test case generation for web applications using automated interface discovery, Halfond, et al, focuses on JavaScript.

Thursday, November 26, 2009

Notes for Information Retrieval Quiz #3

Some things to review before the Information Retrieval quiz:

Lecture 18, Collaborative Filtering and Recommender Systems

Pearson correlation

Lecture 19, Information Extraction

Named entity recognition: find and classify (i.e. determine the category of) all the named entities in a text. Two approaches to named entity recognition:
Rule-based (regular expressions)

  • Lists of names
  • Patterns to match things that look like names
  • Patterns to match the environments that classes of names tend to occur in
  • Get annotated training data
  • Extract features
  • Train systems to replicate the annotation
Relation analysis consists of two tasks:
  1. determine if two entities are related
  2. if they are, classify the relation

Features in relation analysis (for each of the above tasks) are:

  1. Features of the named entities involved (their types [concatenation of the types, headwords of the entities)
  2. Features derived from the words between and around the named entities (+- 1, 2, 3; bag of words between)
  3. Features derived from the syntactic environment that governs the two entities (constituent path through the tree from one entity to the other; base syntactic chunk sequence from one to the other; dependency path)
Template filling
  1. Rules and cascades of rules
  2. Supervised ML as sequence labeling
    1. One sequence classifier per slot
    2. One big sequence classifier
Lecture 20, Sentiment Analysis

Classification in sentiment analysis
  • Coarse classification of sentiment
    • Document-level classification according to some simple (usually binary) scheme
      • Political bias
      • Likes/hates
    • Fine-grained classification of sentiment-bearing mentions in a text
      • Positive/negative classific of opinions about entities mentioned in a text
      • Perhaps with intensity
Choosing a vocabulary
  • Essentially feature selection
  • Previous examples used all words
  • Can we do better by focusing on subset of words?
  • How to find words, phrases, patterns that express sentiment or polarity?
  • Adjectives
    • positive: honest important mature large patient
    • negative: harmful hypocritical inefficient insecure
  • Verbs
    • positive: praise, love
    • negative: blame, criticize
  • Nouns
    • positive: pleasure, enjoyment
    • negative: pain, criticism
Lecture 21, Sentiment Analysis (cont.)

Identifying polarity words
  • Assume that generating exhaustive lists of polarity words is too hard
  • Assume contexts are coherent with respect to polarity
  • Fair and legitimate, corrupt and brutal
  • But not: fair and brutal, corrupt and legitimate
  • Example:
    • Extract all adjectives with > 20 frequency from WSJ corpus
    • Label them for polarity
    • Extract all conjoined adjectives
    • A supervised learning algorithm builds a graph of adjectives linked by the same or different semantic orientation
    • A clustering algorithm partitions the adjectives into two subsets
  • Mixed sentiment: The steering is accurate but feels somewhat anesthetized.
  • Sentiment inverters: ... never seen any RWD cars can handle well on snow even
    just few inches.
  • Anaphora and meronymy:
    • It's a great car for just about anything. The mkVI is pretty
      much a mkv but ironing out all the small problems.
    • Hey is the back seat comfortable? In my MkV it feels like
      you're sitting on a vat of acid.

Sunday, November 22, 2009

Independent Study: Concepts and Experiments in Computational Reflection

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.) (I've written most of these posts in the form of a review of a conference paper, but I'm going to use a free-form style this time.)

Pattie Maes's wonderful paper on reflection is a perfect follow-on to my reading about traits. My motivation for reading the traits paper was that Perl's Moose (Perl 6, too!) has traits (it calls them roles). I've been using Moose at lot at work and I wanted to catch up with the research behind traits to prepare for a short talk I gave about it to my colleagues. After the talk, a colleague who was at Bell Labs in the 80's mentioned a paper about reflection by Pattie Maes at OOPSLA in '87 which led to the MetaObject Protocol in the Common Lisp Object System (CLOS), which in part inspired Moose, so Maes's paper closes the loop, ices the cake, etc.

Many programmers know of reflection from the java.lang.reflect package in the Java API. Suffice it to say for the moment that Maes' reflection is expansive and Java's reflection is by comparison quite limited. However, reflection in Java does serve as a good starting point for understanding Maes. What the Java API does provide is programmatic access to information about objects at runtime by exposing an interface to what Maes calls the "self-representation of the system ... which makes it possible for the system to answer questions about itself and support actions on itself." To wit, java.lang.reflect allows the Java programmer to find out which methods or constructors, etc., exist for a given object, and to invoke them.

However, Maes goes a step further (or, more appropriately, Java didn't go as far as Maes imagined, for better or worse) by asserting that "a computational system can actually bring modifications to itself by virtue of its own computation." The modifications Maes envisions are pervasive. In her experimental object-oriented language, 3-KRS, reflection on an object occurs by way of a meta-object associated with the object. (The motivation for performing reflection on an object through a separate entity, its meta-object, may not have been obvious before Maes' paper, but in retrospect it's a clear case of separation of concerns.) Since everything is an object in a pure object-oriented language, meta-objects are everywhere:

[T]he self-representation of an object-oriented system is uniform. Every entity in a 3-KRS system is an object: instances, classes, slots, methods, meta-objects, messages, etc. Consequently every aspect of a 3-KRS system can be reflected upon. All these objects have meta-objects which represent the self-representation corresponding to that object.
For the uninitiated (including me), a slot is, loosely, an instance variable. And I suspect that the difference between a method and message is that a method defines a method (pardon the circularity) and a message "calls" a method. (The notion of a message likely goes back to Smalltalk.)

Further, meta-objects are manipulable at runtime, so — to borrow an example from Maes — a language that supports reflection in all its glory allows the programmer to modify meta-objects to provide support for multiple inheritance. Intuitively, a complete reflective system is an API for the semantics of a programming language. (See also: "A metaobject protocol (MOP) is an interpreter of the semantics of a program that is open and extensible.")

Two aspects of Maes's reflection which are not true for popular statically-typed languages, then, are:
  • meta-objects are pervasive
  • meta-objects are mutable
Obviously for, say, Java or C#, this is true by design. Enterprises of all kinds are often hard-pressed to find programmers who can understand the source code of their larger applications. Some managers are left in the lurch when a key employee departs. Self-modifying code, one might say, is job security. So clarity, explicitness, consistency, readability — these language properties are desirable for production software. A complete reflective system violates them. The flip side of this is expressiveness. Domain-specific languages (DSLs) come to mind. Being able to define little languages to solve a particular problem, and to compose larger applications from modules written in little languages, is an attractive idea. I'd like to say more about DSLs and how they relate to reflective systems, but I don't know much more than that they've been seeping into programming culture for quite some time.

Incidentally, Aspect-Oriented Programming (AOP) exists to address the fact that cross-cutting concerns — any application requirements that cause code to be scattered throughout a code base (e.g. logging, security) — violate modularity. However, it also bridges the gap between the limited form of reflection that exists in Java and some of the more interesting uses of reflection Maes mentions — such as being able to trace the execution of a program (with e.g. print statements) without modifying the program itself. This is no coincidence. Gregor Kiczales, the author of a book about the MetaObject Protocol, which was inspired by Maes' paper, is a coauthor of the earliest paper on AOP. In a sense, implementations of AOP attempt to augment a language with a meta-object protocol without changing the language itself.

Thursday, November 12, 2009

A Little History of Electronic SEC Filings

In 1986, the Securities and Exchange Commission started to accept SEC filings — 10-K forms and the like — electronically. Between '86 and '92, only a handful of companies filed their 10-K electronically. The companies?

  • Medical Monitors, Inc.
  • Medical Monitors, Inc.
  • Medical Monitors, Inc.
  • Fast Eddie Racing Stables, Inc.
  • Medical Monitors, Inc.
  • Fast Eddie Racing Stables, Inc,.
  • Jilco Industries, Inc.
  • Whitney American Corp.
  • Filmagic Entertainment Corp.
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1989-2
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1989-3
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1989-5
  • Medical Monitors, Inc.
  • Fast Eddie Racing Stables, Inc,.
  • Jilco Industries, Inc.
  • Filmagic Entertainment Corp.
  • Xanthic Enterprises, Inc.
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1988-1
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1988-2
  • Medical Monitors, Inc.
  • Fast Eddie Racing Stables, Inc,.
  • Jilco Industries, Inc.
  • Filmagic Entertainment Corp.
  • Xanthic Enterprises, Inc.
  • Admiral Financial Corp.
  • Quad Metals Corp.
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1988-4
  • Medical Monitors, Inc.
  • Fast Eddie Racing Stables, Inc,.
  • Jilco Industries, Inc.
  • Filmagic Entertainment Corp.
  • Xanthic Enterprises, Inc.
  • Admiral Financial Corp.
  • Quad Metals Corp.
  • American Housing Partners
  • First Boston Mortgage Sec. Corp. Con Mor Pas Thr Cer CR 1992-3
While I get the obvious kick out of Fast Eddie Racing Stables, Inc. — that it's publicly traded and was an early-comer to electronic SEC filing — my hunch is that these early companies are there not necessarily because of who runs them (although it's certainly possible that Medical Monitors, Inc. has a techno-saavy CEO) but more likely because 86-92 was a period of controlled introduction. In the following years, the number of electronically-filed 10-Ks were:
Year # of filings
1993 1305
1994 1249
1995 3460
1996 6482
I suspect that the SEC opened the flood gates only partially in 93-94, a bit more in 95, and completely in 96. On the other hand, this was the era during which the Internet was really taking off, so it's hard to distinguish between what the SEC mandated and what it allowed.

If you're wondering what those First Boston Mortgage things are, they're probably asset-backed securities, which are by far the largest class of securities regulated by the SEC. Commercial state banks are up there, too. But it's interesting that the largest group of securities aren't even operating companies — they're just a publicly-traded piece of paper that represents a share of ownership in some asset.

Sunday, November 08, 2009

Modeling the World Wide Web

Just so I don't forget, some papers by Filippo Menczer, who appears to be doing work related to an idea I've been mulling over for a while.

Informally, consider the World Wide Web as a graph with web pages as nodes and hyperlinks as edges, label each node with some value derived from the contents of the web page (e.g. the length of the page, the set of terms in the page, or the term vector for the page), then define the value of an edge as the difference between the nodes it connects. This basically yields a geometric representation of the web graph by mapping it into some metric space. (If the idea is still fuzzy, imagine a hyperlink as a function from the document that contains it to the document to which it points.)

Now that I've found some already-published work on this model, I'll have to spend next semester learning what people have already done so I can do something new.

Friday, November 06, 2009

Independent Study: Traits: Composable Units of Behaviour

(This is one of a series of posts about papers I'm reading for an independent study with Prof. Evan Chang at the University of Colorado, Boulder. The format is similar to that of a review of a paper submitted to a computer science conference. There are already-published papers, so I'll be writing with the benefit of hindsight, especially when the paper was published at least several years ago.)

Submission: Traits: Composable Units of Behaviour [PDF]

Please give a brief, 2-3 sentence summary of the main ideas of this paper:

If you start with the reasonable assumption that code reuse improves programmer productivity, an important question is how to increase code reuse. Historically, inheritance — both single and multiple — has been a mechanism for code reuse, in as much as it has allowed classes to be composed at least in part from other classes. Mixins solve some of the problems of single and multiple inheritance, but they don't work well when a class uses two or more mixins containing identically-named methods. Traits solve the same problems as mixins, but don't suffer from their limitations.
What is the strength of this paper (1-3 sentences):
Traits solve the problems of previous attempts to facilitate code reuse. Further, they suggest a style of designing software — as collections of traits (which ideally define only a single method) instead of collections of classes — that may be useful in its own right.
What is the weakness of this paper (1-3 sentences):
None noted.
The authors note that traits are inspired by mixins, and in one (arguably incorrect) sense, traits are merely an incremental improvement on mixins; however, the deficiencies the authors identify in mixins are real (and especially important for large, complex applications) and require solving, so the fact that traits aren't light years ahead of mixins is irrelevant, as the improvements traits provide are necessary.
Yes. The refactoring example helps the authors make a stong case for traits.
Worth solving:
See my response about novelty above.
I'm confident in the material in this paper.
Detailed comments:

The notion that code reuse improves programmer productivity is non-controversial. A library of well-tested and widely-used classes in a language's ecosystem provides not only implementations of common functionality (thereby relieving programmers of having to implement and test that functionality themselves) but also (because classes and methods are named) a common language for communication among programmers — both of which reduce "friction" during software development.

A problem for API designers is that code reuse in most common object-oriented languages is done at the level of the class definition — a client class T of the some API class U reuses U either by extending U or by otherwise referring to U (e.g. statically, as the type of a local variable, as the type of a member variable). This can lead to an undesirable decoupling of generic functionality, making it hard for programmers to discover reusable code. For example, in Java, to sort an object of the type of a class which implements java.util.List, one needs to know to use the sort method of the Collections class. Acquiring this knowledge isn't necessarily onerous for a programmer, but it likely leads novice programmers to implement their own sorting methods. A language with mixins or traits does not have this problem. In such a language, a class C that can be sorted uses the mixin or trait that provides a generic sorting routine (as long as C provides whatever methods the mixin or trait requires [e.g. something akin to the compareTo of Java's Comparable]), and discovering that C can be sorted is a matter of invoking an IDE's method autocomplete feature or, absent IDE support, a cursory examination of the definition of C. This is similar to the problem of jungloid navigation, where a programmer knows what kind of object she wants, but doesn't know how to get it, except in this case a programmer knows what she wants to do with an object (i.e. sort a list), but doesn't know how to do it (i.e. call Collections.sort).

A little history on multiple inheritance, interfaces, and mixins. So far as I've been able to find, interfaces first appeared in Modula-2 (1978), where they're called definition modules. From the Modula-3 report (I wasn't able to find one for Modula-2): "An interface is a group of declarations. Declarations in interfaces are the same as in blocks, except that any variable initializations must be constant and procedure declarations must specify only the signature, not the body." Modula-3 (1980s) also had multiple inheritance. Mixins first appeared in an OO extension to LISP (called Flavors [circa 1980]). Flavors fed into the Common Lisp Object System, where the concept of a Meta-Object Protocol was first implemented. Perl's Moose is built atop Class::MOP, which was inspired by the CLOS MOP.

Sunday, November 01, 2009

It's Expensive Being Rich

When you open a web page in a browser, the browser loads the page and all other resources to which the page refers. Some of those resources are files containing JavaScript code, and those files keep getting larger. Some of them can be large enough to noticeably delay the complete loading of the web page, which is a real problem for web site operators. Their dilemma is that users demand features, and JavaScript is a way to provide lots of features, but users also demand that pages load quickly, and adding more and more JavaScript increases the time it takes a page to load. Without changing the underlying technology, it's akin to a zero-sum game, or a game of Whack-a-Mole, or whatever notion you prefer for identifying a situation like this. The two requirements — provide a rich end-user experience, provide it quickly — are to some degree at odds with one another.

Recently James Hamilton pointed out a cool research project which transformed JavaScript source files into a form that allows the source code to be loaded only when needed. Another approach to dealing with this problem — an approach that is complementary to the approach mentioned by Hamilton — is to modify the HTML script element to indicate whether the JavaScript source file needs to be loaded immediately or can be loaded "lazily." This approach exists in the script element section of the HTML 5 draft specification.

Meditations on Meat

Exhibit #1: "My childhood was typical, summers in Rangoon, luge lessons. In the spring we'd make meat helmets."

Exhibit #2: Turducken

Exhibit #3: Report: Meat Now America's No. 2 Condiment


Avoid the Lippis Report

My last job was in the telecommunications sector and I had an interest in keeping abreast of developments in the industry. At the time some people at work were subscribed to Nick Lippis's mailing list, so I subscribed too. Gotta know what's known by those in the know. Now I work in a different business, but I'm finding that the cost of unsubscribing from the list is torture (whether it's eternal torture, I don't know yet, as I've still not been able to unsubscribe).

First thing I tried was clicking the "unsubscribe" link at the bottom of his emails. All that did was load the URL Here's a screenshot:

The next logical thing to try was the instructions right above the unsubscribe link, which read:

Login to your Lippis account at with username "ndronen," or delete your account by clicking the link below.
Since it's been quite a while since I subscribed, I didn't remember my password, so I opted to have the site change my password. A new password arrived shortly thereafter, but logging in with it yielded no joy. Another screenshot:

So I'm stuck — can't unsubscribe with the unsubscribe link, can't login and unsubscribe — so I can only resort to emailing the site directly. Thanks for wasting my time, Mr. Lippis.

Incidentally, the message "You do not have sufficient permissions to access this page" does not mean that I entered an invalid password. As you can see here, entering an invalid password causes an entirely different message.

Update (9:49 MST): It gets better. I sent email to the contact address listed in the "Lippis Report" email ( and it bounced.

So now I'm sending email using the web form on the site.

Update (8:41 MST 11/02/2009): I've received a gracious email from Nick Lippis. They've removed me from the mailing list and are looking into the problem with the unsubscribe process.