2    Lexical Conventions

This chapter describes lexical conventions associated with the following items:


2.1    Blank and Tab Characters

You can use blank and tab characters anywhere between operators, identifiers, and constants. Adjacent identifiers or constants that are not otherwise separated must be separated by a blank or tab.

These characters can also be used within character constants; however, they are not allowed within operators and identifiers.


2.2    Comments

The number sign character (#) introduces a comment. Comments that start with a number sign extend through the end of the line on which they appear. You can also use C language notation (/*...*/) to delimit comments.

Do not start a comment with a number sign in column one; the assembler uses cpp (the C language preprocessor) to preprocess assembler code and cpp interprets number signs in the first column as preprocessor directives.


2.3    Identifiers

An identifier consists of a case-sensitive sequence of alphanumeric characters (A-Z, a-z, 0-9) and the following special characters:

Identifiers can be up to 31 characters long, and the first character cannot be numeric (0-9).

If an undefined identifier is referenced, the assembler assumes that the identifier is an external symbol. The assembler treats the identifier like a name specified by a .globl directive (see Chapter 5).

If the identifier is defined to the assembler and the identifier has not been specified as global, the assembler assumes that the identifier is a local symbol.


2.4    Constants

The assembler supports the following constants:


2.4.1    Scalar Constants

The assembler interprets all scalar constants as twos complement numbers. Scalar constants can be any of the digits 0123456789abcdefABCDEF.

Scalar constants can be either decimal, hexadecimal, or octal constants:



2.4.2    Floating-Point Constants

Floating-point constants can appear only in floating-point directives (see Chapter 5) and in the floating-point load immediate instructions (see Section 4.2). Floating-point constants have the following format:

±d1[.d2][e|E±d3]

d1
is written as a decimal integer and denotes the integral part of the floating-point value.

d2
is written as a decimal integer and denotes the fractional part of the floating-point value.

d3
is written as a decimal integer and denotes a power of 10.

The "+" symbol (plus sign) is optional.

For example, the number .02173 can be represented as follows:

21.73E-3

The floating-point directives, such as .float and .double, may optionally use hexadecimal floating-point constants instead of decimal constants. A hexadecimal floating-point constant consists of the following elements:

[+|-]0x[1|0].<hex-digits>h0x<hex-digits>

The assembler places the first set of hexadecimal digits (excluding the 0 or 1 preceding the decimal point) in the mantissa field of the floating-point format without attempting to normalize it. It stores the second set of hexadecimal digits in the exponent field without biasing them. If the mantissa appears to be denormalized, it checks to determine whether the exponent is appropriate. Hexadecimal floating-point constants are useful for generating IEEE special symbols and for writing hardware diagnostics.

For example, either of the following directives generates the single-precision number 1.0:

.float 1.0e+0
.float 0x1.0h0x7f

The assembler uses normal (nearest) rounding mode to convert floating-point constants.


2.4.3    String Constants

All characters except the newline character are allowed in string constants. String constants begin and end with double quotation marks (").

The assembler observes most of the backslash conventions used by the C language. Table 2-1 shows the assembler's backslash conventions.

Table 2-1: Backslash Conventions

Convention Meaning
\a Alert (0x07)
\b Backspace (0x08)
\f Form feed (0x0c)
\n Newline (0x0a)
\r Carriage return (0x0d)
\t Horizontal tab (0x09)
\v Vertical feed (0x0b)
\\ Backslash (0x5c)
\" Quotation mark (0x22)
\' Single quote (0x27)
\nnn Character whose octal value is nnn (where n is 0-7)
\Xnn Character whose hexadecimal value is nn (where n is 0-9, a-f, or A-F)


Deviations from C conventions are as follows:

For octal notation, the backslash conventions require three characters when the next character could be confused with the octal number.

For hexadecimal notation, the backslash conventions require two characters when the next character could be confused with the hexadecimal number. Insert a 0 (zero) as the first character of the single-character hexadecimal number when this condition occurs.


2.5    Multiple Lines Per Physical Line

You can include multiple statements on the same line by separating the statements with semicolons. Note, however, that the assembler does not recognize semicolons as separators when they follow comment symbols (# or /*).


2.6    Statements

The assembler supports the following types of statements:

Each keyword statement can include an optional label, an operation code (mnemonic or directive), and zero or more operands (with an optional comment following the last operand on the statement):

[ label: ] opcode operand [ ; opcode operand; ... ] [ # comment ]

Some keyword statements also support relocation operands (see Section 2.6.4).


2.6.1    Labels

Labels can consist of label definitions or numeric values.


2.6.2    Null Statements

A null statement is an empty statement that the assembler ignores. Null statements can have label definitions. For example, the following line has three null statements in it:

label: ; ;


2.6.3    Keyword Statements

A keyword statement contains a predefined keyword. The syntax for the rest of the statement depends on the keyword. Keywords are either assembler instructions (mnemonics) or directives.

Assembler instructions in the main instruction set and the floating-point instruction set are described in Chapter 3 and Chapter 4, respectively. Assembler directives are described in Chapter 5.


2.6.4    Relocation Operands

Relocation operands are generally useful in only two situations:

Some macro instructions (for example, ldgp) require special coordination between the machine-code instructions and the relocation sequences given to the linker. By using the macro instructions, the assembler programmer relies on the assembler to generate the appropriate relocation sequences.

In some instances, the use of macro instructions may be undesirable. For example, a compiler that supports the generation of assembly language files may not want to defer instruction scheduling to the assembler. Such a compiler will want to schedule some or all of the machine-code instructions. To do this, the compiler must have a mechanism for emitting an object file's relocation sequences without using macro instructions. The mechanism for establishing these sequences is the relocation operand.

A relocation operand can be placed after the normal operand on an assembly language statement:

opcode operand relocation_operand

The syntax of the relocation_operand is as follows:

!relocation_type! sequence_number

relocation_type
Any one of the following relocation types can be specified:

literal
lituse_base
lituse_bytoff
lituse_jsr
gpdisp
gprelhigh
gprellow

The relocation types must be enclosed within a pair of exclamation points (!) and are not case sensitive. See Table 7-11 for descriptions of the different types of relocation operations.

sequence_number
The sequence number is a numeric constant with a value range of 1 to 2147483647. The constant can be base 8, 10, or 16. Bases other than 10 require a prefix (see Section 2.4.1).

The following examples contain relocation operands in the source code:


2.7    Expressions

An expression is a sequence of symbols that represents a value. Each expression and its result have data types. The assembler does arithmetic in twos complement integers with 64 bits of precision. Expressions follow precedence rules and consist of the following elements:

You can also use a single character string in place of an integer within an expression. For example, the following two pairs of statements are equivalent:

.byte "a" ; .word "a"+0x19
.byte 0x61 ; .word 0x7a


2.7.1    Expression Operators

The assembler supports the operators shown in Table 2-2.

Table 2-2: Expression Operators

Operator Meaning
+ Addition
- Subtraction
* Multiplication
/ Division
% Remainder
<< Shift left
>> Shift right (sign is not extended)
^ Bitwise EXCLUSIVE OR
& Bitwise AND
| Bitwise OR
- Minus (unary)
+ Identity (unary)
~ Complement


2.7.2    Expression Operator Precedence Rules

For the order of operator evaluation within expressions, you can rely on the precedence rules or you can group expressions with parentheses. Unless parentheses enforce precedence, the assembler evaluates all operators of the same precedence strictly from left to right. Because parentheses also designate index registers, ambiguity can arise from parentheses in expressions. To resolve this ambiguity, put a unary + in front of parentheses in expressions.

The assembler has three precedence levels. The following table lists the precedence rules from lowest to highest:

Table 2-3: Operator Precedence

Precedence Operators
Least binding, lowest precedence Binary +, -
.  
. Binary *, /, %, <<, >>, ^, &, |
.  
Most binding, highest precedence Unary -, +, ~

Note

The assembler's precedence scheme differs from that of the C language.


2.7.3    Data Types

Each symbol you reference or define in an assembly program belongs to one of the type categories shown in Table 2-4.

Table 2-4: Data Types

Type Description
undefined Any symbol that is referenced but not defined becomes global undefined. (Declaring such a symbol in a .globl directive merely makes its status clearer.)
absolute A constant defined in an assignment (=) expression.
text Any symbol defined while the .text directive is in effect belongs to the text section. The text section contains the program's instructions, which are not modifiable during execution.
data Any symbol defined while the .data directive is in effect belongs to the data section. The data section contains memory that the linker can initialize to nonzero values before your program begins to execute.
sdata The type sdata is similar to the type data, except that defining a symbol while the .sdata ("small data") directive is in effect causes the linker to place it within the small data section. This increases the chance that the linker will be able to optimize memory references to the item by using gp-relative addressing.
rdata and
rconst
Any symbol defined while the .rdata or .rconst directives are in effect belongs to this category. The only difference between the types rdata and rconst is that the former is allowed to have dynamic relocations and the latter is not. (The types rdata and rconst are also similar to the type data but, unlike data, cannot be modified during execution.)
bss and sbss Any symbol defined in a .comm or .lcomm directive belongs to these sections, except that a .data, .sdata, .rdata, or .rconst directive can override a .comm directive. The .bss and .sbss sections consist of memory that the kernel loader initializes to zero before your program begins to execute.

If a symbol's size is less than the number of bytes specified by the -G compilation option (which defaults to eight), it belongs to .sbss section (small bss section), and the linker places it within the small data section. This increases the chance that the linker will be able to optimize memory references to the item by using gp-relative addressing.

Local symbols in the .bss or .sbss sections efined by .lcomm directives are allocated memory by the assembler, global symbols are allocated memory by the linker, and symbols defined by .comm directives are overlaid upon like-named symbols (in the fashion of Fortran COMMON blocks) by the linker.

Symbols in the undefined category are always global; that is, they are visible to the linker and can be shared with other modules of your program. Symbols in the absolute, text, data, sdata, rdata, rconst, bss, and sbss type categories are local unless declared in a .globl directive.


2.7.4    Type Propagation in Expressions

For any expression, the result's type depends on the types of the operands and the operator. The following type propagation rules are used in expressions:


2.8    Address Formats

The assembler accepts addresses expressed in the formats described in Table 2-5.

Table 2-5: Address Formats

Format Address Description
(base-register) Specifies an indexed address, which assumes a zero offset. The base register's contents specify the address.
expression Specifies an absolute address. The assembler generates the most locally efficient code for referencing the value at the specified address.
expression(base-register) Specifies a based address. To get the address, the value of the expression is added to the contents of the base register. The assembler generates the most locally efficient code for referencing the value at the specified address.
relocatable-symbol Specifies a relocatable address. The assembler generates the necessary instructions to address the item and generates relocation information for the linker.
relocatable-symbol±expression Specifies a relocatable address. To get the address, the value of the expression, which has an absolute value, is added or subtracted from the relocatable symbol. The assembler generates the necessary instructions to address the item and generates relocation information for the linker. If the symbol name does not appear as a label anywhere in the assembly, the assembler assumes that the symbol is external.
relocatable-symbol(index-register) Specifies an indexed relocatable address. To get the address, the index register is added to the relocatable symbol's address. The assembler generates the necessary instructions to address the item and generates relocation information for the linker. If the symbol name does not appear as a label anywhere in the assembly, the assembler assumes that the symbol is external.
relocatable-symbol±expression(index-register) Specifies an indexed relocatable address. To get the address, the assembler adds or subtracts the relocatable symbol, the expression, and the contents of index register. The assembler generates the necessary instructions to address the item and generates relocation information for the link editor. If the symbol name does not appear as a label anywhere in the assembly, the assembler assumes that the symbol is external.