Ranged Integers and Saturation Semantics - By Robert C. Seacord

blog January 18th, 2007

Robert C. Seacord of CERT was kind enough to provide us with the following proposal for addressing integral vulnerabilities in future versions of the C standard. It’s an intriguing approach that we hope generates some good discussion. So, without further ado…

Ranged Integers and Saturation Semantics

By Robert C. Seacord

In 2005, I wrote that "Integers represent a growing and underestimated source of vulnerabilities in C and C++ programs" as the lead sentence in the "Integer Security" chapter of Secure Coding in C and C++. This prediction has been borne out in a recent study published by MITRE on the types of errors that lead to publicly reported vulnerabilities. "Integer overflows, barely in the top 10 overall in the past few years, are in the top 3 for OS vendor advisories."

Fortunately, some individual members of the ISO/IEC WG14 international standardization working group for the programming language C are starting to mull over some options for addressing integer issues. One idea is to adopt ranged integer types; a second idea is to implement saturation semantics for at least signed integer types.

Because most integer vulnerabilities are type range errors, type range checking—if properly applied—can eliminate the majority of integer vulnerabilities. Languages such as Pascal and Ada allow range restrictions to be applied to any scalar type to form subtypes. Ada, for example, allows range restrictions to be declared on derived types using the range keyword:

type day is new INTEGER range 1..31;

The range restrictions are then enforced by the language runtime. C and C++, on the other hand, are not nearly as good at enforcing type safety. The languages specify that unsigned integers must have modulo or wrap-around behavior and that signed integers can also behave in this fashion. Alternatively, signed integers can generate exceptions when they exceed the maximum or minimum possible representations for the corresponding type.

Ranged integers have two possible applications. The first is that they can be used as array indices. When matched with a like-sized array, they can guarantee that all fetches and stores of array elements are valid. A second application is to specify ranged integer types at the minimum and maximum ranges of the various signed and unsigned integer types. Using these ranged types will prevent integer overflow by implementing saturation semantics. Saturation semantics are defined in the ISO/IEC TR 18037 "Extensions for the programming language C to support embedded processors."  Saturation semantics are most easily described mathematically. Assume that the mathematical result of the computation is result. The value actually returned to the user is set out in the following table:

 Range of mathematical result   Result returned 
MAX < result MAX
MIN <= result <= MAX result
result < MIN MIN

When you first suggest the notion of ranged integer types, you normally get the reaction "My God, man, have you thought of the performance overhead!" I have, and my feeling is that it may not be that bad. Increasingly, systems are being developed to check for array bounds violations in an attempt to eliminate the dreaded buffer overflow from C and C++. These bounds checking libraries can often incur 200-300% overhead. Many of these runtime checks can be eliminated by using advanced static analysis techniques, but this is not possible in all cases. Take the following code, for example:

int array[10];
int i = atoi(argv[1]);
array[i] = 99; 

Clearly, the value of i can be anything and must be checked before the memory location at array[i] can be safely written to. That would not be the case, however, if the value of i could be guaranteed to be within a range, as in the following example:

int array[10];
int 0..9 i = atoi(argv[1]);
array[i] = 99; 

It is no longer necessary to check that i is a valid index for array, as this is guaranteed by the range specification 0..9. This range specification indicates that the integer i can only assume values between 0 and 9 inclusive. This eliminates the need for bounds checking when dereferencing array because only valid subscripts can be used.

The atoi() function, on the other hand, is clearly capable of returning any value in the range of int, including positive and negative values. So how would this initialization clause handle out-of-range values?

One possibility is that this initialization would result in an error condition. The C programming language, however, lacks a general error handling mechanism that could be used to handle to this error. Another possibility is to implement either modwrap or saturation semantics.

Modwrap semantics is where the integer values "wrap round" (also called modulo arithmetic). That is, adding one to MAX produces MIN. This will always produce a value in the specified range with the effect that as the return value from atoi() increases, i will iteratively range from 0-9.

In this example, saturation semantics simply mean that if the return value from atoi() is greater than the upper bound (9), the upper bound is used, and if the return value is less than the lower bound (0), the lower bound is used.

In many applications, it would be more sensible to use saturation semantics rather than modwrap semantics. For example, in the computation of a size (using unsigned integers), it is often better for the size to stay at the maximum value in the event of overflow rather than suddenly becoming a very small value. For these reasons, it makes sense to implement saturation semantics for ranged integer types.

Similar concepts to ranged integers already exist in the C standard.   For example, bit fields allow small integer sizes consisting of some number of bits as shown below:

struct {
    unsigned int a: 8;
} bits = {255};

However, the syntax for declaring bit fields is somewhat confusing. In this example, is a an eight-bit integer or a 32-bit unsigned integer?  The answer to this question determines how the integer promotion rules are applied to a when it is used as an rvalue.

The syntax for integer ranges seeks to minimize these problems. In the following declaration,

int 0..9 i = atoi(argv[1]);

the int type only signifies that i is an integer type-it does not signify the length of the type or the signedness. All the information describing the length and signedness of the type is provided by the range specification. The actual type used to store the integer should be the smallest type that is portably guaranteed to maintain the range of values. Unsigned types should be used when the range is non-negative. In this example, i should be represented as an unsigned char value, as this type can adequately represent the range of values, and the type should be treated as such when used as an rvalue.

Ranged integers require a change in existing C language programming idioms. Currently, it is quite common to write a simple loop using the following form:

int array[10];
for (i = 0; i <= 9; ++i) {
    array[i] = f(i);
}

The value of i that causes the loop to terminate is 1 greater than the upper bound for the array. The type of the loop variable is assumed to be able to represent this one-too-far value. If i were declared with a range of 0..9 in this example, it would result in an infinite loop. Instead, ranged integer types serve a specific purpose. In the following example, the ranged integer ai is only used as an index for array[]:

int i;
int array[10];
int 0..9 ai = 0;

for (i = 0; i <= 9; ++i) {    array[ai] = f(i);    ++ai; }

The ranged integer paradigm requires a separate integer declaration for each array (with a different bound). Even in this example, the ai index is unconditionally incremented following the array reference, meaning that saturation semantics need to be invoked to keep this value within range. Alternatively, this loop could be written as follows:

for (i = 0; i <= 9; ++i) {
    ai = i;
    array[ai] = f(i);
}

There are some interesting questions for ranged integer types. For example, should we allow the range to be established dynamically at runtime?  This would certainly be necessary to support the definition of indices for variable length arrays. Maybe in this case we should say that ranged integers do not promote and must be cast explicitly.

There are also issues concerning the use of ranged integers and function calls. Each ranged integer declaration is, in effect, the definition of a new type with different semantics. This means that you cannot write a general function using ranged integers that takes an array (that is, a pointer to the first element in the array) and a size. If the function was implemented to use ranged integers, it would be syntactically and semantically different for each array of a particular size and ranged integer type. In C++, this could potentially lead to an explosion in the number of functions that would need to be instantiated.

A second, less invasive modification to the C programming language is to implement saturation semantics for, at least, signed integers. Implementing saturation semantics for signed integers does not violate the current C99 specification, as signed integer overflow is already undefined behavior. Implementing saturation semantics for unsigned integers would require modification of the standard that stipulates that unsigned arithmetic exhibits modulo behavior.

Saturation semantics could also be introduced using an additional keyword. ISO/IEC TR 18037 introduces the _Sat keyword, which can be used to declare signed or unsigned integers with saturation semantics, as in the following example:

_Sat unsigned integer sui = 1;

As long as the language provides saturation at INT_MIN to INT_MAX, you can use ordinary program logic (macros, for example) to saturate reliably to a smaller range.

Similar to ranged integers, this approach does not really help legacy code, which will need to be modified to exhibit the new behavior.

So what do you think?

2 Responses to “Ranged Integers and Saturation Semantics - By Robert C. Seacord”

  1. Liudvikas Bukyson 19 Jan 2007 at 3:22 pm

    Both modwrap and saturation semantics would inspire a whole new generation of “unexpected arithmetic results” vulnerabilities.

    Exception on overflow would be painful but perhaps might be justified as a way to force coders to really understand what code does under any circumstance. It’s painful because there are a number of common expressions where “the right thing” happens despite overflows in internediate results, and it would take a while for people to understand what’s OK in an exception on overflow world.

  2. jmon 22 Jan 2007 at 10:42 am

    Things have been hectic this past week, so sorry I didn’t get a chance to comment earlier. Hopefully, we can get some discussion going, and we’ll follow up with a blog post later on.

    So, there are a few general kinds of things that one would like to address: integer overflows, integer underflows, and the various type conversion related gotchas: signed/unsigned value changing conversions, sign-extension related flaws, and truncation. I might come up with a quick set of examples of integral vulnerabilities to help frame the discussion. One thing we observed when writing the book is that many integer-related vulnerabilities can be analyzed in terms of three components, which we denoted ACC: allocation, check, and copy (or more generally, write access). Generally, when arithmetic expressions can be manipulated such that any of these processes get subverted, or desynchronized from each other, then you end up with a situation where you can write outside of the intended memory boundaries. Naturally, not all integer related vulnerabilities can be conceptualized in this way, but it works pretty well for the mainstream case. Anyway, it would probably be helpful to sketch out a handful of insecure idioms in order to help think about how language modifications could address them. I’ll try to do this later on.

    Thinking out loud, there are probably four goals that modifications to the language could aspire to:

    1. Make it easy for someone concerned with integral flaws to write bulletproof code. e.g. ok, I need to take a size from a packet, do some math on it, allocated some memory, and then parse and populate my data structures. If there are some simple features or code idioms that will prevent me from shooting myself in the foot, I’ll definitely employ them.

    2. Make it easy for someone to modify existing code so that it can be made safe with a minimum of effort. e.g. I have some complicated code here to read in several bitmaps and font definitions from a file, and I can tell it’s going to be vulnerable at least fifteen ways to Sunday with all the multiplications and allocations going on. Can I make a couple of changes to variable types or something equally minor that will just make it safe?

    3. Make existing code safe without modification to that code. Can I set some compiler flag or pragma, or can we change C/C++ so that vulnerable code can be recompiled so that it is safe?

    4. Change the language and/or make simple idioms possible that would herd developers in general towards writing secure code just by using natural constructs. In other words, make changes that non-security aware developers would want to use. Maybe something like bounds-checked arrays?

    Right off the bat, I think #3 would probably be impossible, and even if it was, such changes wouldn’t ever be approved because they’d destroy more code than they would be fix. I think the set of developers that actually understand integral behavior in C is arguably vastly smaller than the set of developers that *think* they understand integral behavior in C. So, I’d argue that #1 and #2 are currently hard in C, for a coder that hasn’t written a compiler or sat down with the standards (or a good book ;>). If the modifications to the language for #1 and #2 were well-done, they might ultimately become idiomatic, realizing #4.

    Anyway, enough rambling. The idea of applying saturation semantics is IMHO quite insightful, as I think it would address many types of integer vulnerabilities. Here are some thoughts:

    The idea of encoding range into type is interesting, but, I do see your point about it ballooning the number of definable types in the C language very quickly. The inability to write a general purpose function sort of reminds me of that one classic article on why Pascal sucks - array sizes are part of the type, thus you can’t write a general purpose function that takes an array and a number of elements.

    The way that current loop idioms would fail with ranged integers is an interesting observation. It almost seems like the use of ranged integers in this context is really shadow-implementing bounds-checked arrays. I obviously haven’t thought this through very much, but if you did have bounds checked pointers, you could use the old idioms. E.g. instead of turning a[b] into (*((a)+(b))), the compiler could apply saturation semantics to b as an offset of a. This would very much change how things probably work, but it’s another possible idea. Not sure how much it would demolish existing compilers or how much of it would be optimizable with static analysis.

    Globally applying saturation semantics applied to signed integers might break existing code. Obviously existing code shouldn’t rely on implementation-defined code, but one thing that springs to mind is the TCP sequence number code. It uses an idiom like:
    #define SEQ_LT(a,b) ((int)((a)-(b)) < 0)
    Haven’t thought it through, but that would probably be messed up.

    Having support for a _Sat keyword would probably kill a lot of exploitable issues, even if it was just bounded to the normal integral boundaries. I kinda like this, but I need to think through it more.

    Liudy’s point is well taken - modifications would likely introduce new subtle semantics and gotchas. I guess it comes down to what you’re hoping to accomplish, but if the newer semantics are more deterministic and straightforward than the older ones, it might be worth doing.

    Unsigned arithmetic is closed under modulus, but maybe it would be useful to have the ability to have overflow and underflow cause an implementation-defined result, including an exception. The machine support would probably be sketchy (or maybe not, most arithmetic isn’t sign-sensitive so it might just require duplicating flags checking code in the compiler). The problem with exceptions is that they are necessarily run-time and not compile-time, and the issues with intermediate results that Liudy mentioned. Definitely an interesting idea, though.

    Anyway, lots to think about. I usually think about how to break things, not fix them. :>

Permanent Link | Trackback URI | Comments RSS

Leave a Reply