Bioinformatics Programming

Presentation on theme: "Bioinformatics Programming"— Presentation transcript:

Bioinformatics Programming
EE, NCKU Tien-Hao Chang (Darby Chang)

Data Abstraction

Data Abstraction Abstract data type Data type The data types of C
A data type is a collection of objects and a set of operations that act on those objects For example, the data type int consists of the objects {0, +1, -1, +2, -2, …, INT_MAX, INT_MIN} and the operations +, -, *, /, and % The data types of C basic data types: char, int, float, and double group data types: array and struct pointer data type user-defined types Abstract data type An abstract data type (ADT) is a data type that is organized in such a way that the specification of the objects and the operations on the objects is separated from the representation of the objects and the implementation of the operations. We know what is does, but not necessarily how it will do it.

Stack

The Stack ADT A stack is an ordered list in which insertions and deletions are made at one end called the top If we add the elements A, B, C, D, and E to the stack, in that order, then E is the first element we delete from the stack A stack is also known as a Last-In-First-Out (LIFO) list

Implementation with an array

Why we need such a data structure?
Why we need such a data structure?

Stack Evaluation of Expressions
The representation and evaluation of expressions is of great interest to computer scientists (rear+1==front) || (rear==MAX_QUEUE_SIZE-1) (3.1) x=a/b-c+d*e-a*c (3.2) If we examine these expressions, we notice that they contains: operators ==, +, -, ||, &&, ! operands a, b, c, e parentheses ( ) Understanding the meaning of expressions assume a=4, b=c=2, d=e=3 in the statement (3.2) interpretation 1: ((4/2)-2)+(3*3)-(4*2) = = 1 interpretation 2: (4/(2-2+3))*(3-4)*2 = (4/3)*(-1)*2 = … The challenge is to efficiently generate the machine instructions corresponding to a given expression with precedence and associative rule

Evaluation of Expressions Postfix Expressions
The standard wry of writing expressions is known as infix notation binary operator in-between its two operands Infix notation is not the one used by compilers to evaluate expressions Actually, Java virtual machine is a stack machine Instead compilers typically use a parenthesis-free notation referred to as postfix notation

Evaluation of Expressions Evaluate Postfix Expressions
Evaluating postfix expressions is much simpler than the evaluation of infix expressions no parentheses no precedence There are no parentheses to consider To evaluate an expression we make a single left-to-right scan of it We can evaluate an expression easily by using a stack

Evaluating 62/3-42*+

Evaluation of Expressions Data Representation
We now consider the representation of both the stack and the expression

get_token()

You write a program to evaluate expressions? If not, what’s missing?
Can You write a program to evaluate expressions? If not, what’s missing? A further question

Evaluation of Expressions Infix to Postfix
We can describe am algorithm for producing a postfix expression from an infix one as follows fully parenthesize expression a / b - c + d * e - a * c ((((a / b) - c) + (d * e)) - (a * c)) all operators replace their corresponding right parentheses ((((a / b) - c) + (d * e)) - (a * c)) / * *- delete all parentheses The order of operands is the same in infix and postfix

icp 13 20 12 19 isp 13 12

Evaluation of Expressions From Infix to Postfix
Assumptions operators (, ), +, -, *, /, % operands single digit integer or variable of one character Operands are taken out immediately Operators are taken out of the stack as long as their in-stack precedence (isp) is higher than or equal to the incoming precedence (icp) of the new operator if (isp >= icp) pop ‘(’ has low isp, and high icp op ( ) * / % eos Isp Icp

Such two-phase strategy (a. infix to postfix and then b
Such two-phase strategy (a. infix to postfix and then b. evaluate postfix) is used in practice

Precedence hierarchy and associative for C

Queue

The Queue ADT A queue is an ordered list in which all insertion take place one end, called the rear and all deletions take place at the opposite end, called the front If we insert the elements A, B, C, D, E, in that order, then A is the first element we delete from the queue A stack is also known as a First-In-First-Out (FIFO) list

Implementation with an 1D array and two variables

There might be available space when IsFullQ is true (movement is required)

Queue Regard Array as Circular
We can obtain a more efficient representation if we regard the array queue[MAX_QUEUE_SIZE] as circular front: one position counterclockwise from the first element rear: current end Only one space left when full

addq() and deleteq() are slightly more complicated

Queue is much trivial in life
Queue is much trivial in life

A Maze Problem The most obvious choice is a 2D array
0s the open paths and 1s the barriers Notice that not every position has eight neighbors To avoid checking for these border conditions we can surround the maze by a border of ones an mp maze requires an (m+2)(p+2) array from [1][1] to [m][p]

Possible moves from maze[row][col]

A Maze Problem Implementation of Move
typedef struct { short int vert; short int horiz; } offsets; offsets move[8]; // array of moves for each direction If we are at maze[row][col] and we wish to find the position of the next move, maze[next_row][next_col] next_row = row + move[dir].vert; next_col = col + move[dir].horiz;

A Maze Problem Maze Traversal Algorithm
Maintain a second two-dimensional array, mark, to record the maze positions already checked Use stack to keep path history typedef struct { short int row; short int col; short int dir; } element; element stack[MAX_STACK_SIZE];

We use queue to do the maze problem? If yes, what’s the differences ?
Can We use queue to do the maze problem? If yes, what’s the differences ? A further question

A Maze Problem Analysis of path()
The worst case of computing time of path is O(mp), where m and p are the number of rows and columns of the maze respectively The choice of add() and delete() decides the search behavior

List

Consider the following alphabetized list of three letter English words
List Ordered List Consider the following alphabetized list of three letter English words bat, cat, sat, vat If we store this list in an array add the word mat to this list move sat and vat one position to the right before we insert mat remove the word cat from the list move sat and vat one position to the left Problems of a sequence representation (ordered list) arbitrary insertion and deletion from arrays can be very time-consuming waste storage

An elegant solution of ordered list Items may be placed anywhere in memory Store the address, or location, of the next element for accessing elements in the correct order Associated with each element is a node which contains both a data component and a pointer to the next item

Two most important operators used with the pointer type :
List Pointers in C Two most important operators used with the pointer type : & the address operator * the dereferencing (or indirection) operator Example int i, *pi; i is an integer variable and pi is a pointer to an integer pi = &i; &i returns the address of i and is assigned as the value of pi to assign a value to i we can use i = 10; *pi = 10;

List Dynamically Allocated Storage
When programming, you may not know how much space you will need, nor do you wish to allocate some vary large area that may never be required C provides heap, for allocating storage at run-time You may call a function, malloc, and request the amount of memory you need When you no longer need an area of memory, you may free it by calling another function, free, and return the area of memory to the system

Dynamically Allocated Storage Example

Linked lists are drawn as an order sequence of nodes with links represented as arrows the name of the pointer to the first node in the list is the name of the list (the list of Figure 4.1 is called ptr) notice that we do not explicitly put in the values of pointers, but simply draw allows to indicate that they are there

List Insertion To insert the word mat between cat can sat, we must
Get a node that is currently unused; let its address be paddr Set the data field of this node to mat Set paddr’s link field to point to the address found in the link field of the node containing cat Set the link field of the node containing cat to point to paddr

Delete mat from the list
List Deletion Delete mat from the list We only need to find the element that immediately precedes mat, which is cat, and set its link field to point to mat’s link (Figure 4.3) We have not moved any data, and although the link field of mat still points to sat, mat is no longer in the list

Defining a node’s structure, that is, the fields it contains
List Implementation We need the following capabilities to make linked representations possible Defining a node’s structure, that is, the fields it contains self-referential structures Create new nodes when we need them malloc() new in C++ Remove nodes that we no longer need free() delete in C++

Two extra pointers are required
List Invert For a list of length ≧1 nodes, the while loop is executed length times and so the computing time is linear or O(length) Two extra pointers are required

List More about Lists Circularly linked lists the link field of the last node points to the first node in the list Maintain an available List the space of freed nodes can be reused later Doubly linked lists

List Stacks and Queues When several stacks and queues coexisted, there was no efficient way to represent them sequentially The solution presented above to the n-stack, m-queue problem is both computationally and conceptually simple We no longer need to shift stacks or queues to make space Computation can proceed as long as there is memory available

Longest Common Subsequence
In two strings Out length of the longest common subsequence Requirement - dynamic programming - time/space analyses - using C would be the best Bonus - output a longest common subsequence - output all longest common subsequences

Dynamic Programming P(n) P(m1) P(m2) … P(mk) S1 S2 … Sk S
Like divide-and-conquer, perform iterative calculations The most difference is that divided sub-problems are overlapped (or say, dependent) P(n) P(m1) P(m2) … P(mk) S S … Sk S

Dynamic Programming Matrix Multiplication
Given a sequence of matrices, <A1, A2, …, An>, where the size of Ai is pi-1pi, find the best order for minimum scalar multiplications For example A1A2A3A4 pi: 5 possiblities (A1(A2(A3A4))) costs = 26418 (A1((A2A3) A4)) costs = 4055 ((A1A2)(A3A4)) costs = 54201 ((A1(A2A3))A4) costs = 2856 (((A1A2) A3)A4) costs = 10582 n marices result in C(2n,n)/(n+1)=(4n/n3/2) orders

Matrix Multiplication Observation of Sub-problems
Let T is a order for <A1, A2, …, An>, T1 is a order for <A1, A2, …, Ak>, and T2 is a order for <Ak+1, A2, …, An> if T is an optimal solution for <A1, A2, …, An> then, T1 and T2 are the optimal solutions for <A1, A2, …, Ak>and <Ak+1, A2, …, An>, respectively Let m[i,j] be the minimum number of scalar multiplications needed to compute the product Ai…Aj, for 1ijn If the optimal solution splits the product Ai…Aj=(Ai…Ak)(Ak+1…Aj), for some k, ik<j, then m[i,j]=m[i,k]+m[k+1,j]+pi-1pkpj. we have m[i,j]=minik<j{m[i,k]+m[k+1,j]+pi-1pkpj}

Dynamic Programming Elements
Optimal sub-structure (a problem exhibits optimal sub-structure if an optimal solution to the problem contains within it optimal solutions to sub-problems) Overlapping sub-problems Memorization (usually by a table, i.e., a 2D array) Procedure characterize the structure of an optimal solution derive a recursive formula for computing the values of optimal solutions the relation between the problem and its sub-problems

Dynamic Programming Longest Common Subsequence
Given two sequences X=<x1, x2, … , xm> and Y=<y1, y2, … , yn>, find a maximum-length common subsequence of X and Y For example X is 'ABCBDAB' and Y is 'BDCABA' common subsequences: 'AB', 'ABA', 'BCB', 'BCAB', 'BCBA' … longest common subsequences: 'BCAB', 'BCBA', … (length = 4) A B C B D A B B D C A B A

Longest Common Subsequence The Recursive Formula
Let L[i,j] be the length of an LCS of the prefixes Xi=<x1, x2, …, xi> and Yj=<y1, y2, …, yj>, for 1im and 1jn L[i, j] = L[i-1, j-1]+1 if xi=yj = max(L[i,j-1], L[i-1, j]) if xiyj A B C B D A B B D C A B A A LCS: BCBA