Scripting with C++by Steve Donovan, the Author of C++ by Example: "UnderC" Learning EditionMAR 22, 2002
| |
Over the last 10 years, people have been making a distinction between scripting and programming. Scripts are short programs written by people wanting to get something done without the bother of using "real" programming languages; they are often power users, not professional programmers. End-user customization is a catch phrase that describes the spreadsheet user writing a macro, an engineer writing a script to prepare data in a more useful format, and so on. This empowers users, and they will then be less bother to programmers. So modern packages often have a scripting back door, such as a COM interface that someone could access using VBA or Windows Scripting Host.
Of course, programming remains programming. The game is the same, whether the players are barefoot in the yard or under the stadium lights. Scripting is informal programming, and scripting languages are more informal than the usual languages such as C++ and Java, which are often called system programming languages. Scripting languages such as Python or JavaScript relax the usual demand that you specify the type of variables in advance; a variable's type can change, depending on what kind of data you put into it. Here is an example of the style:
// test.js function out(s) { WScript.Echo(s) } function sum(x,y) { return x + y } v = 23.3 out(sum(v,1.2)) v = "hello dolly" out(sum(v," you're so fine"))This is a short JScript that runs under Windows Scripting Host (you can type cscript test.js at the command prompt to see it in action). The variable v is first a number and then a string. Sometimes these languages are called typeless, but it's more accurate to call them dynamic typing, or tolerant typing.
As well as being informal, these languages are interpreted, not compiled. This is not an essential difference, really. Old compilers running on modern machines, such as Turbo C, can go through a compile-and-run cycle as fast as any interpreter. Also, there is a continuum between a true interpreter (which reads program text and performs actions immediately) and a true compiler (which generates native machine code). Both Python and MAWK generate an intermediate code ("pcode") that is later executed, but MAWK throws away this code after each run. Generally, such programs will be slower than optimized compiled code, but they are executed immediately. So, testing small programs becomes faster. It used to be true that these languages were simpler because they then became easier to interpret; since then, it's recognized that the simplicity is for the humans, not the computer. Which is how it should be.
A more fundamental difference is that script languages can usually evaluate strings containing expressions in that language. It's usually called eval() and it is tremendously powerful. The CINT/ROOT1 scripting environment for C++ has plotting functions that expect strings containing C++ expressions.
For example, the following C++ statements typed at the ROOT interactive prompt plot the function y = sin(x)/x over the range 0 to 10.
root [1] TF1 f1("f1","sin(x)/x",0,10); root [2] f1.Draw();Tcl scripts can generate other Tcl scripts, which are then executed. In general, all Tcl expressions are strings, so code is data, as with LISP. Generally, the use of dynamic evaluation prevents scripting languages from being directly compiled as an executable.
Scripting languages operate at a higher level than system programming languages. AWK is typical: There are associative arrays and control statements to access their elements, and there is built-in regular expression-matching and powerful string-manipulation functions. Because these languages are more expressive than traditional programming languages, programmers find that they need to write less code for a particular task. The basic strategy in scripting is to sacrifice program efficiency for programmer efficiency. Interpreters are always slower than compilers, but programmers can develop small programs more quickly. Typeless languages are less efficient, but are easier to write. (In the JScript example, note that the function sum() can handle string and numerical parameters with equal ease.) As machines get faster and programmers more pressured, this becomes an attractive trade-off.
Working programmers often automate small tasks using scripting languages—they write shell scripts or use a specialized tool such as AWK. For me, nothing can beat AWK at processing data files organized in columns: The program '{ print $1,$4*$2 }' prints out the first column, followed by the product of the fourth and the second column. Most scripts are very robust; this AWK script can be applied to arbitrary data and will not break—it only fails to give sensible numbers. You have to go to some trouble to build an equivalent C++ program that is as fault-tolerant.
The term scripting actually covers both specialized languages such as AWK (which is not extendable) and extendable glue languages. Glue languages are intended to combine components written in a lower-level language such as C. This is the "two languages" approach to systems integration popularized by John Ousterhout2, the creator of Tcl/Tk. For example, the UNIX philosophy is that shell scripts are used to build applications out of small specialized programs. The glue doesn't need to be fast because it's assumed that the components encapsulate all serious processing. The consequences of using two languages are interesting.
There is a syntactical gap, which is often at the boundary between the component writers and the glue writers. Some like using two distinct languages because (a) they are always aware at what level they are operating, (b) the glue language is more forgiving, and (c) the semantic gap means that component interfaces are forced to be very straightforward. But why can't the same language be used in both roles?
There has been a surprising amount of scripting using C-like languages in recent years. For example, PHP is commonly used for server-side scripting, and JavaScript is the standard for client-side scripting. C itself is probably too low-level to make a good scripting language, (although the EiC project believes otherwise). Typically, people take C syntax and add the features they always wanted in C (for instance, proper string types or garbage collection). The fact that these languages look like C does not make them trivial to learn for a C/C++ programmer. Most of the hassle of learning a language is getting a feeling for the semantics and the available libraries. You are left with yet another member of the C family with no proper compiler, so it must be extended with plain old C. My problem with this linguistic forking is that it produces more incompatible C dialects, which are usually too focused on doing a particular job.
C++ appears to be an unlikely choice for a scripting language because it is a classic strongly typed systems language. It has a complex syntax, subtle semantics, and a fearsome reputation; and it's slow to compile optimally. C++ interpreters make it possible to change and execute code quickly, but surely they don't change the nature of the beast? Yes, but the nature of the beast is adaptability. Unlike Java, which was designed to restrain programmers into making "correct" object-oriented design choices, C++ can adapt to any paradigm. Whether this is a good or bad thing I can't say; natural languages have the same problems. The C++ programmer is restrained by a freely chosen style, not by the language designer's prejudices. Or (to be more realistic) the project lead can enforce a style appropriate to the project, without being constrained by linguistic prescription. I believe this capability to adapt to the task makes C++ a very good glue language, with the advantage that there is no unbridgeable gap between the glue and component language.
C++ interpreters such as CINT and UnderC, which try to be close to the standard, run fast enough to be useful, especially because it is straightforward to call fully compiled code. CINT is more of a true interpreter, so large source files are loaded very quickly, but with the clever twist that any repeated execution causes the loop and any called functions to be translated into pcode. (This is, of course, the JIT (Just In Time) strategy used by the Java JVM, and one can further imagine a second-order JIT compiler that then translates the pcode into native machine code.) UnderC translates all the source into pcode, meaning that full type-checking takes place, which is important for its primarily educational purpose.
Here is a funny story that illustrates the usual C++ mindset confronted with scripting. A local Unix guru was working at a company in which a programmer needed to search for certain patterns at the end of a log file. The programmer took several days to do it in C++, and showed the guru, who said, "Why didn't just you pipe the output of tail through grep?" Apparently, the response was unprintable. This is classic script-thinking: build a new tool by connecting two simpler tools together. I found this amusing, and thought about whether C++ could emulate this style. Well, it can be done by suitable operator overloading—you can write programs that look like this, courtesy of C++ syntactical sugar. The pipe symbol, which we would like to represent by operator|, has unfortunately got the wrong associativity, so we have to use operator|| instead.
#include <cshell.h> int main() { return tail() || grep("[a-z][1-9]*"); }At the time, I was amusing myself, but because then I see it as an example of C++ being enriched by other programming styles. Why use this C++ style instead of the shell script equivalent? Such code can be interpreted just as quickly as the shell script, and the program can then be compiled and becomes a portable component that can be more easily deployed. Better still, this style can be embedded within more traditional C++ code.
C++ remains a strongly typed language, but it is possible to define a tolerant type that freely converts between strings and numbers. You still have to declare variables, except all variables are of type var. This is equivalent to the JScript script shown earlier:
// testvar.cpp #include <var.h> var sum(var x,var y) { return x + y; } int main() { var v = 23.3; cout << sum(v,1.2) << endl; v = "hello dolly"; cout << sum(v," you're so fine"); return 0; }With dynamic typing, the meaning of operators changes according to the current content of their arguments. operator>> generally extracts values from a stream; by generalizing this, you can treat a string containing a filename as if it were a stream. If the argument were a list, then operator>> would extract the list items. Here is a program that adds all the numbers found in the files specified on the command line (vmain() would be called from a hidden main() function):
#include <var.h> void vmain(var args) { var f,x,sum; while (args >> f) while (f >> x) sum += x; cout << sum << endl; }This is a good deal easier than the usual equivalent, and it compares well to the version in other scripting languages. It can be interpreted, as well! My question here is this: Is this a bad C++ program? If so, why? This style would horrify most C++ programmers because they are convinced that strong typing is always the best option. Personally I think strong typing is better for large involved systems, but this dynamically typed style is appropriate for smaller programs. We sacrifice strong type-checking and some efficiency for a simple no-nonsense style that does the job. This style (like AWK) tolerates non-numerical data by regarding it as zero, so this program is much less brittle than the simplest standard equivalent. As with the tail-grep example, this code remains standard C++, and can be deployed as a standalone executable. Of course, this is about as far as you can go with dynamic typing in C++. Classes must at least have well-defined interfaces because there is no dynamic dispatch in the language.
Recently, people have been questioning the importance of strong static typing. The traditional position can be expressed like this: What do you prefer—for the compiler or the customer to catch type violations? The advocates of extreme programming, such as Robert C. Martin3, would answer that it is your testing routines that should catch these violations. Strong typing (so the argument goes) has a cost, particularly the need for long build times for large systems because each part of the system needs intimate static knowledge of the rest. With extreme programming, all code is created with the appropriate unit tests that will catch badly behaved arguments. These tests are stronger than the formal guarantees of strong typing, so you may as well put your energy into strong testing, rather than strong typing.
My feeling is that strong typing is a form of intrinsic documentation. With C++ it is obvious (say) that a function is passed a list of type structures; with Python, you would have to make this into a comment. And then it would not be available to a source browser because comments are outside the language. In some applications, it can be useful to strengthen the type system even further (as C.A.R Hoare has suggested), so that a routine expecting a length in meters can't be passed a length in inches. In any case, Bertrand Meyer points out that the long build times are not a necessary consequence of strong typing, but a solvable technical problem. In my article, "Conversational C++," I give one approach to solving that problem.
There are problems with using C++ as a scripting language. For example, the Visualization Toolkit (VTK) is a powerful open-source data visualization library consisting of 300+ classes4. Because it is slow to build C++ VTK applications, the VTK community has traditionally used Tcl/Tk to prototype applications. After the visualization pipeline is understood, it is then worthwhile to translate this into C++. Tcl is not object-oriented, and the semantic gap is particularly wide in this case. The VTK team had to build a custom parser for C++ class declarations in order to wrap up the VTK classes as a set of Tcl commands; objects are passed around as unique strings. UnderC can import the VTK headers and shared library directly, and even override imported virtual methods, although this is currently still in the experimental stage. However, it must parse all the C++ headers involved, which takes a few seconds. The actual program takes milliseconds to compile and run. This header problem is the major obstacle in using standard C++ as a fast batch mode prototyping tool. CINT has a tool called makecint (which can be thought of as a interpreter compiler) to build custom versions of the interpreter involving the imported classes. Another promising solution is shown in the conversational C++ article, in which I discuss the use of C++ in an interactive mode.
In this article, I discussed what is meant by scripting, in both its micro (small utilities) and macro (glue for system integration) meanings. C++ is sufficiently flexible and adaptable to serve in both roles, whether interpreted or compiled. The advantage of having the same language for both components and glue is that the boundaries between component and glue are not so fixed. As for the moral for C++ programmers, I would say that informal C++ has its place in the development cycle. We are still anxious when using the language (about both performance and style). Call it local optimization anxiety. But the best designs are optimized globally. Using C++ in a more informal way frees up its expressive power.
------------------------------------
1、The ROOT homepage is http://root.cern.ch/. There is a strong CINT C++ scripting community in high-energy physics. FermiLab has a ROOT tutorial at http://www-pat.fnal.gov/root/.
2、John Ousterhout, "Scripting: Higher Level Programming for the 21st Century." Originally appeared in IEEE Computer, March 1998. Available at http://home.pacbell.net/ouster/scripting.html.
3、Interview on ITWorld. There was an interesting follow-up by Bertrand Meyer: http://www.itworld.com/AppDev/1262/itw-0314-rcmappdevint/.
4、VTK: http://www.kitware.com