Download presentation
Presentation is loading. Please wait.
1
CS 3304 Comparative Languages
Lecture 11: Composite Data Types 21 February 2012
2
Distinguished Lecture
Code as a Metaphor for Computational Thinking Owen Astrachan, Duke University Location: Torgerson 2150 Date: Friday, February 24, 2012 Time: 11:15am-12:30pm A Meet-the-Speaker session will be held 4:00pm-5:30pm in McBryde 106.
3
Introduction Supporting composite data types (arrays, strings, sets, pointers, lists, and files) involves additional syntactic, semantic, and pragmatic issues. Pointer related issues require a more detailed discussion of the value and reference models of variables and of the heap management issues. Input and output mechanisms are important when dealing with files.
4
Records (Structures) Record types allow related data of heterogeneous types to be stored and related together. Usually laid out contiguously. Possible holes for alignment reasons. Compilers keep track of the offset of each field within each record type. Smart compilers may re-arrange fields to minimize holes (C compilers promise not to). Implementation problems are caused by records containing dynamic arrays but we won't be going into that in any detail.
5
Syntax Examples C: struct element { char name[2]; int atomic_number; double atomic_weight; _Bool metallic; } The ordering of record fields is significant in most languages. Pascal: type two_chars = packed array [1..2] of char; type element = record name : two_chars; atomic_number : integer; atomic_weight : real; metallic : Boolean; } In ML the ordering is insignificant: tuples are abbreviations for records with field names as small integers. (“Cu”, 29) {1 = “Cu”, 2 = 29} {2 = 29, 1 = “Cu”} Java: classes. 20. What are struct tags in C? How are they related to type names? How did they change in C++? 21. Summarize the distinction between records and tuples in ML. How do these compare to the records of languages like C and Ada? 22. Discuss the significance of “holes” in records. Why do they arise? What problems do they cause?
6
Nested Records I Nested definitions (in C) and the “no-nesting” equivalent: struct ore { struct ore { char name[30]; char name[30]; int atomic_number; struct element double atomic_weight; element_yielded; _Bool metallic; } } element_yielded; Fortran 90, Common Lisp: no-nesting only. Naming for the nested record: Record to field: . in Pascal or C. Field to record: of in Cobol, # in ML. Models of variables: Value: nested records are naturally embedded in the parent record (large fields, word or double-word alignment). Reference: fields are usually references to data in other locations.
7
Other Features Packed records:
Pascal: optimize for space. Ada, Modula-3, C: more elaborate packing, bits per field. Assignment (most): an entire record in a single operation. Comparison: Ada allows but most languages do not. Copy/comparison: use library routines (e.g., block_copy) but what about the holes (zeros, customized routines)? A trade-off between packing (time) and holes (space). Compilers “re-arrange” the field order: usually not a problem except when dealing with systems programs. Ada, C++: non-standard alignment. with statement for deeply nested records. 23. Why is it easier to implement assignment than comparison for records? 24. What is packing? What are its advantages and disadvantages? 25. Why might a compiler reorder the fields of a record? What problems might this cause?
8
Unions (Variants) Unions (variant records): union { Main purpose:
If variables are not used at the same time, int i; they can share the same memory space double d; The size of the space is the size of the } largest variable. Main purpose: System programs. Alternative sets of fields within a record. Problem for type checking. Lack of tag means you don't know what is there. Ability to change tag and then access fields hardly better: Can make fields “uninitialized” when tag is changed (requires extensive run-time support). Can require assignment of entire variant, as in Ada. 26. Briefly describe two purposes for union/variant records.
9
Arrays Arrays are the most common and important composite data types.
Unlike records, which group related fields of disparate types, arrays are usually homogeneous. Semantically, they can be thought of as a mapping from an index type to a component or element type: Index type integer or any discrete type. Element type scalar (Fortran 77) or any type. Associative arrays (nondiscrete index types): Implemented with hash tables or search trees. Supported by the standard libraries of object-oriented languages.
10
Array Syntax and Operations
Array element: refer to by appending a subscript delimited by parenthesis (Fortran, Ada) or square brackets (C, Pascal) to the name of array. Declaring an array: Appending subscript notation to the syntax used to declare a scalar. Using an array constructor. Slice (section): a rectangular portion of an array (Figure 7.4). 27. What is an array slice? For what purposes are slices useful?
11
Arrays Dimensions, Bounds and Allocation
Global lifetime, static shape — If the shape of an array is known at compile time, and if the array can exist throughout the execution of the program, then the compiler can allocate space for the array in static global memory. Local lifetime, static shape — If the shape of the array is known at compile time, but the array should not exist throughout the execution of the program, then space can be allocated in the subroutine’s stack frame at run time. Local lifetime, shape bound at elaboration time. 28. Is there any significant difference between a two-dimensional array and an array of one-dimensional arrays? 29. What is the shape of an array?
12
Descriptors or Dope Vectors
Symbol table maintains dimension and bounds information for every array in the program. When these values are not statically known, the compiler generates code to look them up in dope vector at run time. Dope vector contains the lower bound of each dimension and the size of each dimension other than last. Initialized at elaboration time or whenever the number or bounds of dimensions change. Assignment (for array) might require copying both array data and the dope vector. Languages with a value model of variables and arrays of dynamic shape may use dope vectors also for dynamic shape records. 30. What is a dope vector? What purpose does it serve?
13
Stack Allocation Arrays as subroutine parameters: Ada and C99:
Early Pascal: required shape to be specified statically. Standard Pascal: bounds are symbolic names rather than constants. Conformant arrays arrays with these parameters: very useful in scientific applications (depend on numeric libraries). Can be passed by reference or by value. Ada and C99: Support conformant arrays and local arrays of dynamic shape. Local array shape fixed at elaboration time. Stack frame is divided into: Fixed-size: object’s size statically known. Variable-size: object’s size known at elaboration time. 32. What is a conformant array?
14
Heap Allocation Fully dynamic arrays: can change shape at arbitrary times – must be allocated in the heap. If the number of dimensions is statically known, the dope vector and the pointer to the data an be kept in the stack frame of the subroutine in which the array was declared. If the number of dimensions is dynamic, the dope vector must generally be placed at the beginning of the heap. Compiler has to reclaim the space occupied by fully dynamic arrays. Some languages (Snobol, Icon, scripting languages) allow strings to change size after elaboration time. Java, C#: strings are immutable objects. 31. Under what circumstances can an array declared within a subroutine be allocated in the stack? Under what circumstances must it be allocated in the heap?
15
Array’s Memory Layout Contiguous elements (Figure 7.7):
Column major - only in Fortran. Row major: Used by everybody else. Makes array [a..b, c..d] the same as array [a..b] of array [c..d]. 34. Explain the difference between row-major and column-major layout for contiguously allocated arrays. Why does a programmer need to know which layout the compiler uses? Why do most language designers consider row-major layout to be better?
16
Array Layout Strategies
Two layout strategies for arrays (Figure 7.8): Contiguous elements. Row pointers. Row pointers: An option in C. Allows rows to be put anywhere - nice for big arrays on machines with segmentation problems. Avoids multiplication. Nice for matrices whose rows are of different lengths: e.g. an array of strings. Requires extra space for the pointers. 33. Discuss the comparative advantages of contiguous and row-pointer layout for arrays.
17
Array Allocations
18
Accessing Array Elements
A: array [L1..U1] of array [L2..U2] of array [L3..U3] of elem: D1 = U1-L1+1 D2 = U2-L2+1 D3 = U3-L3+1 Let: S3 = size of elem S2 = D3 * S3 S1 = D2 * S2 We could compute all that at run time, but we can make do with fewer subtractions: == (i * S1) + (j * S2) + (k * S3) + address of A - [(L1 * S1) + (L2 * S2) + (L3 * S3)] The stuff in square brackets is compile-time constant that depends only on the type of A. 35. How much of the work of computing the address of an element of an array can be performed at compile time? How much must be performed at run time?
19
Strings In many languages strings are really just arrays of characters. They are often special-cased, to give them flexibility (like polymorphism or dynamic sizing) that is not available for arrays in general (Snobol, Icon, scripting languages). Literal characters, literal strings, escape sequences. Available operations on strings tied to implementation: Pascal, Ada: assignment, comparison. C: only a pointer to a string literal. Dynamic length strings: Fundamental to a large number of applications. It's easier to provide these things for strings than for arrays in general because strings are one-dimensional and non-circular. Built in type (ML, Lisp) or class (C++, Java, C#) – a string variable is a reference to a string. 36. Name three languages that provide particularly extensive support for character strings. 37. Why might a language permit operations on strings that it does not provide for arrays?
20
Sets Set is an unordered collection of an arbitrary number of distinct values of a common type. Pascal: sets of any discrete type. Icon: only sets of characters. Python: sets of arbitrary type. Ada: set package. C++, Java, C#: standard libraries. Possible implementations: Arrays, hash tables, trees. Bit vectors are what usually get built into programming languages. Things like intersection, union, membership, etc. can be implemented efficiently with bitwise logical instructions. Some languages place limits on the sizes of sets to make it easier for the implementor. 38. What are the strength and weaknesses of the bit-vector representation for sets? How else might sets be implemented?
21
Pointers Pointers serve two purposes:
Efficient (and sometimes intuitive) access to elaborated objects (C). Dynamic creation of linked data structures, in conjunction with a heap storage manager. Pointers are used with a value model of variables: Pointers (high-level concept) are not addresses (low-level concept). They are not needed with a reference model. Several languages (e.g. Pascal) restrict pointers to accessing things in the heap: How and when is storage reclaimed for objects no longer needed? Many languages require the programmer to explicitly reclaim space: Memory leak: failure to reclaim space for objects no longer needed. Dangling reference: reclaims objects that are still in use. Garbage collection: automatic storage reclamation. 41. What is the difference between a pointer and an address?
22
Pointer Syntax and Operations
Operations include: Allocation/deallocation objects on the heap. Assignment of one pointer to another. Functional languages: A reference model for names, objects allocated automatically. Imperative languages, for example A := B: Value model (C, Pascal, Ada): if B refers to an object, B is a pointer and A has to be a pointer to refer to that object. Reference model (Clu, Smalltalk): always makes A refer to the same object to which B refers. Mixed approach (Java): Value mode: built-in primitive data types. Reference model: user-defined types. Mixed approach (C#): mirrors Java but provides additional, “unsafe” features when pointers are needed.
23
Reference Model ML (static typing) - datatype mechanism: datatype chr_tree = empty | node of char * chr_tree * chr_tree; node (#”Y”, node (#”Z”, empty, empty), node (#”W”, empty, empty)) Lisp (dynamic typing): Semantically, each list is a pair of references, on to the head and one to the remainder of things. (#\Y (#\Z () ()) (#\W ()())) In purely functional languages, the data structures created with recursive types turn out to be acyclic: New objects refer to old ones while old objects never change. Circular structures can be defined only using the imperative features. Mutually recursive types: ML: types declared together in a group. Lisp: trivial since it is dynamically typed. 39. Discuss the tradeoffs between pointers and the recursive types that arise naturally in a language with a reference model of variables.
24
Value Model Pascal: type chr_tree_ptr = ^chr_tree; chr_tree = record left, right : chr_tree_ptr; val : char end; Ada: type chr_tree; type chr_tree_ptr is access chr_tree; type chr_tree is record left, right : chr_tree_ptr; val : character; end record; C: struct chr_tree { struct chr_tree *left, *right; char val; } Dereferencing: an explicit dereferencing operator (C) and automatic dereferencing (Ada). 40. Summarize the ways in which one dereferences a pointer in various programming languages.
25
Pointers and Arrays C pointers and arrays: int *a == int a[] int **a == int *a[] But equivalences don't always hold: Specifically, a declaration allocates an array if it specifies a size for the first dimension. Otherwise it allocates a pointer: int **a, int *a[] pointer to pointer to int. int *a[n], n-element array of row pointers. int a[n][m] 2D array. Compiler has to be able to tell the size of the things to which you point: so the following aren't valid: int a[][] bad int (*a)[] bad C declaration rule: read right as far as you can (subject to parentheses), then left, then out a level and repeat: int *a[n], n-element array of pointers to integer int (*a)[n], pointer to n-element array of integers 42. Discuss the advantages and disadvantages of the interoperabiltiy of pointers and arrays in C. 43. Under what circumstances must the bounds of a C array be specified in its declaration.
26
Dangling References Problems with dangling pointers are due to:
Explicit deallocation of heap objects: Only in languages that have explicit deallocation. Implicit deallocation of elaborated objects. Two implementation mechanisms to catch dangling pointers: Tombstones: An extra level of indirection on every pointer access. When an object is reclaimed, the tombstone is marked to invalidate future references to the objects. Can be used in languages that permit pointers to nonheap objects. How to reclaim to tombstones themselves. Locks and keys: Add a word to every pointer and to every object in the heap. These words must match for the pointers to be valid. Simpler but works only for objects in the heap. 44. What are dangling references? How are they created, and why re they a problem?
27
Garbage Collection Garbage collection: automatic reclamation of objects that are no longer used (difficult to implement): Essential for functional languages. Popular for imperative languages. A classic trade-off between convenience/safety and performance. Reference counts: each object has a counter for the number of pointers that point to it: Each pointer must be initialized to null at elaboration time. The implementation must identify the location of every pointer: relies on type descriptors generated by the compiler. Useful object: an object may be useless hen references exist. Tracing collection: A useful object can be reached by following a chain of valid pointers starting from something that has a name (i.e., outside the heap). 45. What is garbage? How is it created, and why it is a problem? Discuss the comparative advantages of reference counts and tracing collection as a means of solving the problem.
28
Garbage Collection Mechanisms I
Mark-and-sweep is the classic mechanism: Every block in the heap is marked “useless”. Starting from all pointers outside the heap, recursively explore all linked data structures and mark newly discovered block as “useful”. Move the blocks that are still “useless” to the free list. 1/3: Variable size blocks must start with the indicators of size/free. 2: Must be able to fin the pointers within each block. Needs a stack with depth proportional to the heap size: stack and heap grow toward each other so full heap means no stack space. Pointer reversal embeds the equivalent of the stack in already existing fields in heap block: As it explores the path to a given block, it reverses the pointers. Reversed pointer must be marked (usually another bookkeeping field) to distinguish them from forward. At most one pointer in a block will be reversed at any given time. 47. What is pointer reversal? What problem does it address?
29
Garbage Collection Mechanisms II
Stop and Copy: reduce fragmentation by storage compaction: The heap divided into two halves: all allocation in the first half. When full, all useful objects are moved to the second half Generational Collection: Most dynamically allocated objects are short lived. The heap is divided into regions based on the ”age” of objects. Objects gradually progress to “older” regions (like stop-and-copy): pointers need new values. Write barrier: a hidden list of old-to-new pointers. Conservative Collection: Mark-and-sweep without being able to find pointers. The heap spans a relatively small number of addresses. Small probability a non-pointer will contain that address pattern. Safe if the programmer does not “hide” pointers. 46. Summarize the differences among mark-and-sweep, stop-and-copy, and generational garbage collection. 48. What is “conservative” garbage collection? How does it work? 49. Do dangling references and garbage ever arise in the same programming language? Why or why not? 50. Why was automatic garbage collection so slow to be adopted by imperative programming languages? 51. What are the advantages and disadvantages of allowing pointers to refer to objects that do not lie in the heap?
30
Lists List is defined recursively:
The empty list. A pair consisting of an object (list or atom) and another (shorter) list. Suited for functional and logic languages but also used in imperative languages. ML list are homogeneous: a chain of blocks, each of which contains an element and a pointer to the next block. Lisp list are heterogeneous: a chain of cons cells, each containing two pointers, one to the element an one to the next cons cell. List notation: ML: enclosed in square brackets, with elements separated by commas, [a, b, c, d]. Lisp: enclosed in parenthesis, with elements separated by white spaces, (a b c d). 52. Why are list so heavily used in functional programming languages?
31
Basic List Operations Constructing lists from and extracting from components: (cons ‘a ‘(b)) => (a b) a :: b => [a, b] (car ‘(a b)) => a hd [a, b] => a (car nil) => ?? hd [] => ?? (cdr ‘(a b c)) => (b c) tl [a, b, c] => [b, c] (cdr ‘(a)) => nil tl [a] => nil (cdr nil) => ?? tl [] => run-time exc. (append ‘(a b) ‘(c d)) => (a b c d) [a, [c, d] => [a, b, c, d] List comprehension (Miranda, Haskellm Python, F#): Adopted from traditional mathematical set notation. A common form comprises an expression, an enumerator, and one or more filters. Example: List of the squares of all odd numbers les than 100: Mathematical: {i ⨉ i | i ∈ {1,…,100} ∧ i mod 2 =1} Haskell: [i*i | i < [1..100], I ‘mod’ 2 == 1] Python: [i*i for i in range(1, 100) if i % 2 == 1) F#: [for i in do if i % 2 = 1 then yield i*i]
32
Lisp: car and cdr car gets the first element from a list.
cdr gets the remainder of a list. The names derived from the original implementation of Lisp on the IBM 704: The machine architecture included 15-bit “address” and “decrement” fields in some of the 36-bit loop control instructions. Additional instructions to load an index register from, or store it to, one of these fields within a 36-bit memory word. The Lisp interpreter designers mimicked the internal format of instructions to be bale to exploit them. CAR: contents of address of register. CDR: contents of decrement of register. Fortran was also developed on IBM 704 (three-way IF): First commercial machine to include hardware floating-point and magnetic core memory.
33
Files and Input/Output
Input/output facilities allow a program to communicate with the outside world. Interactive input/output is very platform specific. Files: off-line storage implemented by the operating system. Temporary: exist for the duration of a single program run. Persistent: exist before the program beings and/or after it ends. Input/output is one of the most difficult aspect of a language to design and that varies most from language to language. Built-in file data type and special syntactic constructs for I/O: Ability to employ non-subroutine syntax. Ability to perform operations not available to library routines. Library packages providing a file type and a variety of input/output subroutines: Keeps the “clutter” out of the language definition. 79. Explain the differences between interactive and file-based I/O, between tem- porary and persistent files, and between binary and text files (Some of this information is in the main text.) 80. What are the comparative advantages of text and binary files? 81. Describe the end-of-line conventions of Unix, Windows, and Macintosh files. 82. What are the advantages and disadvantages of building I/O into a programming language, as opposed to providing it through library routines? 83. Summarize the different approaches to text I/O adopted by Fortran, Ada, C, and C++. 84. Describe some of the weaknesses of C’s scanf mechanism. 85. What are stream manipulators? How are they used in C++?
34
Equality Testing and Assignment
Primitive data types: relatively straightforward for simple. Complex/abstract data types: semantic and implementation issues. For example, comparing two character strings: Are aliases for one another? Occupy storage that is bi-wise identical over its full length? Contain the same sequence of characters? Would appear the same if printed? Distinction between l-values and r-values, for references: Shallow comparison: refer to same object. Deep comparison: refer to equal object, may need recursive traversal. Imperative languages, a:=b assignment: Reference model: shallow (same reference) and deep (copy object). Value model: shallow (copy value but not objects). 53. Why is equality testing more subtle than it first appears?
35
Language Implementations
Most programming languages employ both shallow comparisons and shallow assignments. Some provide more than one option for comparison: Scheme has three general-purpose equality-testing functions: (eq? a b) ; refer to the same object (eqv? a b) ; semantically equivalent (equal? a b) ; same recursive structure Deep assignments are relatively rare. User defined-abstractions - no single language-specified mechanism for equality testing/assignment is likely to work: Allow the programmer to define the comparison/assignment operators for each new data type. Allow the programmer to specify that equality testing and/or assignment is not allowed. 53. Why is equality testing more subtle than it first appears?
36
Summary Key issues for records include the syntax and semantics of variant records, whole-records operations, type safety and related memory layout issues. For recursive data types, much depends on the choice between the value and reference models of variable/names. Recursive types are generally used to create linked data structures. Newer languages have improved semantic at the expense of complexity and cost, such as the type-safe variant records (Ada), standard length numeric types (Java, C#), array slicing (Fortran 90), etc.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.