Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Systems Lab TJHSST Current Projects 2004-2005 First Period.

Similar presentations


Presentation on theme: "Computer Systems Lab TJHSST Current Projects 2004-2005 First Period."— Presentation transcript:

1 Computer Systems Lab TJHSST Current Projects First Period

2 2 Current Projects, 1st Period Caroline Bauer: Archival of Articles via RSS and Datamining Performed on Stored Articles Susan Ditmore: Construction and Application of a Pentium II Beowulf Cluster Michael Druker: Universal Problem Solving Contest Grader

3 3 Current Projects, 1st Period Matt Fifer: The Study of Microevolution Using Agent-based Modeling Jason Ji: Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Anthony Kim: A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree John Livingston: Kernel Debugging User-Space API Library (KDUAL)

4 4 Current Projects, 1st Period Jack McKay: Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Peden Nichols: An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Robert Staubs: Part-of-Speech Tagging with Limited Training Corpora Alex Volkovitsky: Benchmarking of Cryptographic Algorithms

5 Archival of Articles via RSS and Datamining Performed on Stored Articles RSS (Really Simple Syndication, encompassing Rich Site Summary and RDF Site Summary) is a web syndication protocol used by many blogs and news websites to distribute information it saves people having to visit several sites repeatedly to check for new content. At this point in time there are many RSS newsfeed aggregators available to the public, but none of them perform any sort of archival of information beyond the RSS metadata. The purpose of this project is to create an RSS aggregator that will archive the text of the actual articles linked to in the RSS feeds in some kind of linkable, searchable database, and, if all goes well, implement some sort of datamining capability as well. 5

6 Archival of Articles via RSS, and Datamining Performed on Stored Articles Caroline Bauer Abstract: RSS (Really Simple Syndication, encompassing Rich Site Summary and RDF Site Summary) is a web syndication protocol used by many blogs and news websites to distribute information; it saves people having to visit several sites repeatedly to check for new content. At this point in time there are many RSS newsfeed aggregators available to the public, but none of them perform any sort of archival of information beyond the RSS metadata. As the articles linked may move or be eliminated at some time in the future, if one wants to be sure one can access them in the future one has to archive them oneself; furthermore, should one want to link such collected articles, it is far easier to do if one has them archived. The purpose of this pro ject is to create an RSS aggregator that will archive the text of the actual articles linked to in the RSS feeds in some kind of linkable, searchable database, and, if all goes well, implement some sort of datamining capability as well.

7 Archival of Articles via RSS, and Datamining Performed on Stored Articles Caroline Bauer Introduction This paper is intended to be a detailed summary of all of the author's findings regarding the archival of articles in a linkable, searchable database via RSS. Background RSS RSS stands for Really Simple Syndication, a syndication protocol often used by weblogs and news sites. Technically, RSS is an xml-based communication standard that encompasses Rich Site Summary (RSS 0.9x and RSS 2.0) and RDF Site Summary (RSS 0.9 and 1.0). It enables people to gather new information by using an RSS aggregator (or "feed reader") to poll RSS-enabled sites for new information, so the user does not have to manually check each site. RSS aggregators are often extensions of browsers or programs, or standalone programs; alternately, they can be web- based, so the user can view their "feeds" from any computer with Web access. Archival Options Available in Existing RSS Aggregators Data Mining Data mining is the searching out of information based on patterns present in large amounts of data. //more will be here.

8 Archival of Articles via RSS, and Datamining Performed on Stored Articles Caroline Bauer Purpose The purpose of this project is to create an RSS aggregator that, in addition to serving as a feed reader, obtains the text of the documents linked in the RSS feeds and places it into a database that is both searchable and linkable. In addition to this, the database is intended to reach an implementation wherein it performs some manner of data mining on the information contained therein; the specifics on this have yet to be determined. Development Results Conclusions Summary References 1. "RSS (protocol)." Wikipedia. 8 Jan Jan "Data mining." Wikipedia. 7 Jan Jan

9 9 Construction and Application of a Pentium II Beowulf Cluster I plan to construct a super computing cluster of about or more Pentium II computers with the OpenMosix kernel patch. Once constructed, the cluster could be configured to transparently aid workstations with computationally expensive jobs run in the lab. This project would not only increase the computing power of the lab, but it would also be an experiment in building a lowlevel, lowcost cluster with a stripped down version of Linux, useful to any facility with old computers they would otherwise deem outdated.

10 Construction and Application of a Pentium II Beowulf Cluster Susan Ditmore Text version needed (your pdf file won't copy to text)

11 11 Universal Problem Solving Contest Grader Michael Druker (poster needed)

12 Universal Problem Solving Contest Grader Michael Druker Steps so far: Creation of directory structure for the grader, the contests, the users, the users' submissions, the test cases. -Starting of main grading script itself. Refinement of directory structure for the grader. -Reading of material on bash scripting language to be able to write the various scripts that will be necessary.

13 Universal Problem Solving Contest Grader Michael Druker Current program: #!/bin/bash CONDIR="/afs/csl.tjhsst.edu/user/mdruker/techlab/code/new/" #syntax is "grade contest user program" contest=$1 user=$2 program=$3 echo "contest name is " $1 echo "user's name is " $2 echo "program name is " $3

14 Universal Problem Solving Contest Grader Michael Druker Current program continued: #get the location of the program and the test data #make sure that the contest, user, program are valid PROGDIR=${CONDIR}"contests/"${contest}"/users/"${user} echo "user's directory is" $PROGDIR if [ -d ${PROGDIR} ] then echo "good input" else echo "bad input, directory doesn't exist" exit 1 fi exit 0

15 15 Study of Microevolution Using Agent-Based Modeling in C++ The goal of the project is to create a program that uses an agent-environment structure to imitate a very simple natural ecosystem: one that includes a single type of species that can move, reproduce, kill, etc. The "organisms" will contain genomes (libraries of genetic data) that can be passed from parents to offspring in a way similar to that of animal reproduction in nature. As the agents interact with each other, the ones with the characteristics most favorable to survival in the artificial ecosystem will produce more children, and over time, the mean characteristics of the system should start to gravitate towards the traits that would be most beneficial. This process, the optimization of physical traits of a single species through passing on heritable advantageous genes, is known as microevolution.

16 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer Abstract The goal of the project is to create a program that uses an agent- environment structure to imitate a very simple natural ecosystem: one that includes a single type of species that can move, reproduce, kill, etc. The "organisms" will contain genomes (libraries of genetic data) that can be passed from parents to offspring in a way similar to that of animal reproduction in nature. As the agents interact with each other, the ones with the characteristics most favorable to survival in the artificial ecosystem will produce more children, and over time, the mean characteristics of the system should start to gravitate towards the traits that would be most beneficial. This process, the optimization of physical traits of a single species through passing on heritable advantageous genes, is known as microevolution.

17 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer Purpose One of the most controversial topics in science today is the debate of creationism vs. Darwinism. Advocates for creationism believe that the world was created according to the description detailed in the 1st chapter of the book of Genesis in the Bible. The Earth is approximately 6,000 years old, and it was created by God, followed by the creation of animals and finally the creation of humans, Adam and Eve. Darwin and his followers believe that from the moment the universe was created, all the objects in that universe have been in competition. Everything - from the organisms that make up the global population, to the cells that make up those organisms, to the molecules that make up those cells has beaten all of its competitors in the struggle for resources commonly known as life.

18 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer This project will attempt to model the day-today war between organisms of the same species. Organisms, or agents, that can move, kill, and reproduce will be created and placed in an ecosystem. Each agent will include a genome that codes for its various characteristics. Organisms that are more successful at surviving or more successful at reproducing will pass their genes to their children, making future generations better suited to the environment. The competition will continue, generation after generation, until the simulation terminates. If evolution has occurred, the characteristics of the population at the end of the simulation should be markedly different than at the beginning.

19 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer Background Two of the main goals of this project are the study of microevolution and the effects of biological mechanisms on this process. Meiosis, the formation of gametes, controls how genes are passed from parents to their offspring. In the first stage of meiosis, prophase I, the strands of DNA floating around the nucleus of the cell are wrapped around histone proteins to form chromosomes. Chromosomes are easier to work with than the strands of chromatin, as they are packaged tightly into an "X" structure (two ">"s connected at the centromere). In the second phase, metaphase I, chromosomes pair up along the equator of the cell, with homologous chromosomes being directly across from each other. (Homologous chromosomes code for the same traits, but come from different parents, and thus code for different versions of the same trait.) The pairs of chromosomes, called tetrads, are connected and exchange genetic material.

20 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer This process, called crossing over, results in both of the chromosomes being a combination of genes from the mother and the father. Whole genes swap places, not individual nucleotides. In the third phase, anaphase I, fibers from within the cell pull the pair apart. When the pairs are pulled apart, the two chromosomes are put on either side of the cell. Each pair is split randomly, so for each pair, there are two possible outcomes. For instance, the paternal chromosome can either move to the left or right side of the cell, with the maternal chromosome moving to the opposite end. In telophase I, the two sides of the cell split into two individual cells. Thus, for each cell undergoing meiosis, there are 2n possible gametes. With crossing over, there are almost an infinite number of combinations of genes in the gametes. This large number of combinations is the reason for the genetic biodiversity that exists in the world today, even among species. For example, there are 6 billion humans on the planet, and none of them is exactly the same as another one.

21 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer Procedure This project will be implemented with a matrix of agents. The matrix, initialized with only empty spaces, will be seeded with organisms by an Ecosystem class. Each agent in the matrix will have a genome, which will determine how it interacts with the Ecosystem. During every step of the simulation, an organism will have a choice whether to 1. do nothing 2. move to an empty adjacent space 3. kill an organism in a surrounding space, or 4. reproduce with an organism in an adjacent space. The likelihood of the organism performing any of these tasks is determined by the organism's personal variables, which will be coded for by the organism's genome. While the simulation is running, the average characteristics of the population will be measured. In theory, the mean value of each of the traits (speed, agility, strength, etc.) should either increase with time or gravitate towards a particular, optimum value.

22 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer At its most basic level, the program written to model microevolution is an agentenvironment program. The agents, or members of the Organism class, contain a genome and have abilities that are dependent upon the genome. Here is the declaration of the Organism class: class Organism { public: Organism(); //constructors Organism(int ident, int row2, int col2); Organism(Nucleotide* mDNA, Nucleotide* dDNA, int ident, bool malefemale, int row2, int col2); ~Organism(); //destructor void printGenome(); void meiosis(Nucleotide* gamete); Organism* reproduce(Organism* mate, int ident, int r, int c); int Interact(Organism* neighbors, int nlen);...

23 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer //assigns a gene a numeric value int Laziness(); //accessor functions int Rage(); int SexDrive(); int Activity(); int DeathRate(); int ClausIndex(); int Age(); int Speed(); int Row(); int Col(); int PIN(); bool Interacted(); bool Gender(); void setPos(int row2, int col2); void setInteracted(bool interacted); private: void randSpawn(Nucleotide* DNA, int size); //randomly generates a genome Nucleotide *mom, *dad; //genome int ID, row, col, laziness, rage, sexdrive, activity, deathrate, clausindex, speed; //personal characteristics double age; bool male, doneStuff;...

24 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer The agents are managed by the environment class, known as Ecosystem. The Ecosystem contains a matrix of Organisms. Here is the declaration of the Ecosystem class: class Ecosystem { public: Ecosystem(); //constructors Ecosystem(double oseed); ~Ecosystem(); //destructor void Run(int steps); //the simulation void printMap(); void print(int r, int c); void surrSpaces(Organism* neighbors, int r, int c, int &friends); //the neighbors of any cell private: Organism ** Population; //the matrix of Organisms }; };

25 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer The simulation runs for a predetermined number of steps within the Ecosystem class. During every step of the simulation, the environment class cycles through the matrix of agents, telling each one to interact with its neighbors. To aid in the interaction, the environment sends the agent an array of the neighbors that it can affect. Once the agent has changed (or not changed) the array of neighbors, it sends the array back to the environment which then updates the matrix of agents. Here is the code for the Organisms function which enables it to interact with its neighbors:

26 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer int Organism::Interact(Organism* neighbors, int nlen) //returns 0 if the organism hasn't moved & 1 if it has { fout << row << " " << col << " "; if(!ID)//This Organism is not an organism { fout << "Not an organism, cannot interact!" << endl; return 0; } if(doneStuff)//This Organism has already interacted once this step { fout << "This organism has already interacted!" << endl; return 0; } doneStuff = true; int loop; for(loop = 0; loop < GENES * CHROMOSOMES * GENE_LENGTH; loop++) { if(rand() % RATE_MAX < MUTATION_RATE) mom[loop] = (Nucleotide)(rand() % 4); if(rand() % RATE_MAX < MUTATION_RATE)

27 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer The Organisms, during any simulation step, can either move, kill a neighbor, remain idle, reproduce, or die. The fourth option, reproduction, is the most relevant to the project. As explained before, organisms that are better at reproducing or surviving will pass their genes to future generations. The most critical function in reproduction is the meiosis function, which determines what traits are passed down to offspring. The process is completely random, but an organism with a "good" gene has about a 50% chance of passing that gene on to its child. Here is the meiosis function, which determines what genes each organism sends to its offspring: void Organism::meiosis(Nucleotide *gamete) { int x, genect, chromct, crossover; Nucleotide * chromo = new Nucleotide[GENES * GENE_LENGTH], *chromo2 = new Nucleotide[GENES * GENE_LENGTH]; Nucleotide * gene = new Nucleotide[GENE_LENGTH], *gene2 = new Nucleotide[GENE_LENGTH];... (more code)

28 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer The functions and structures above are the most essential to the running of the program and the actual study of microevolution. At the end of each simulation step, the environment class records the statistics for the agents in the matrix and puts the numbers into a spreadsheet for analysis. The spreadsheet can be used to observe trends in the mean characteristics of the system over time. Using the spreadsheet created by the environment class, I was able to create charts that would help me analyze the evolution of the Organisms over the course of the simulation.

29 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer The first time I ran the simulation, I set the program so that there was no mutation in the agent's genomes. Genes were strictly created at the outset of the program, and those genes were passed down to future generations. If microevolution were to take place, a gene that coded for a beneficial characteristic would have a higher chance of being passed down to a later generation. Without mutation, however, if one organism possessed a characteristic that was far superior to the comparable characteristics of other organisms, that gene should theoretically allow that organism to "dominate" the other organisms and pass its genetic material to many children, in effect exterminating the genes that code for less beneficial characteristics.

30 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer For example, if an organism was created that had a 95% chance of reproducing in a given simulation step, it would quickly pass its genetic material to a lot of offspring, until its gene was the only one left coding for reproductive tendency, or libido.

31 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer As you can see from Figure 1, the average tendency to reproduce increases during the simulation. The tendency to die decreases to almost nonexistence. The tendency to remain still, since it has relatively no effect on anything, stays almost constant. The tendency to move to adjacent spaces, thereby spreading one's genes throughout the ecosystem, increases to be almost as likely as reproduction. The tendency to kill one's neighbor decreases drastically, probably because it does not positively benefit the murdering organism. In Figure 2, we can see that the population seems to stabilize at about the same time as the average characteristics. This would suggest that there was a large amount of competition among the organisms early in the simulation, but the competition quieted down as one dominant set of genes took over the ecosystem.

32 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer Figure 4 These figures show the results from the second run of the program, when mutation was turned on. As you can see, many of the same trends exist, with reproductive tendency skyrocketing and tendency to kill plummeting. Upon reevaluation, it seems that perhaps the tendencies to move and remain idle do not really affect an agent's ability survive, and thus their trends are more subject to fluctuations that occur in the beginning of the simulation. One thing to note about the mutation simulation is the larger degree of fluctuation in both characteristics and population. The population stabilizes at about the same number, but swings between simulation steps are more pronounced. In Figure 3, the stabilization that had occurred in Figure 1 is largely not present.

33 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer Conclusion The goal of this project at the outset was to create a system that modeled trends and processes from the natural world, using the same mechanisms that occur in that natural world. While this project by no means definitively proves the correctness of Darwin's theory of evolution over the creationist theory, it demonstrates some of the basic principles that Darwin addressed in his book, The Origin of Species. Darwin addresses two distinct processes- -natural selection and artificial selection. Artificial selection, or selective breeding, was not present in this project at all. There was no point in the program where the user was allowed to pick organisms that survived. Natural selection, though it is a stretch because nature was the inside of a computer, was simulated. Natural selection, described as the "survival of the fittest," is when an organism's characteristics enable it to survive and pass those traits to its offspring.

34 THE STUDY OF MICROEVOLUTION USING AGENTBASED MODELING Matt Fifer In this program, "nature" was allowed to run its course, and at the end of the simulation, the organisms with the best combination of characteristics had triumphed over their predecessors. "Natural" selection occurred as predicted. *All of the information in this report was either taught last year in A.P. Biology last year and, to a small degree, Charles Darwin's The Origin of Species. I created all of the code and all of the charts in this paper. For my next draft, I will be sure to include more outside information that I have found in the course of my research*

35 35 Using Machine Translation in a German – English Translator This project attempts to take the beginning steps towards the goal of creating a translator program that operates within the scope of translating between English and German.

36 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji Abstract: The field of machine translation - using computers to provide translations between human languages - has been around for decades. And the dream of an ideal machine providing a perfect translation between languages has been around still longer. This pro ject attempts to take the beginning steps towards that goal, creating a translator program that operates within an extremely limited scope to translate between English and German. There are several different strategies to machine translation, and this pro ject will look into them - but the strategy taken to this pro ject will be the researcher's own, with the general guideline of "thinking as a human."

37 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji For if humans can translate between language, there must be something to how we do it, and hopefully that something - that thought process, hopefully - can be transferred to the machine and provide quality translations. Background There are several methods of varying difficulty and success to machine translation. The best method to use depends on what sort of system is being created. A bilingual system translates between one pair of languages; a multilingual system translates between more than two systems.

38 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji The easiest translation method to code, yet probably least successful, is known as the direct approach. The direct approach does what it sounds like it does - takes the input language (known as the "source language"), performs 2 morphological analysis - whereby words are broken down and analyzed for things such as prefixes and past tense endings, performs a bilingual dictionary look-up to determine the words' meanings in the target language, performs a local reordering to fit the grammar structure of the target language, and produces the target language output. The problem with this approach is that it is essentially a word-for-word translation with some reordering, resulting often in mistranslations and incorrect grammar structures.

39 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji Furthermore, when creating a multilingual system, the direct approach would require several different translation algorithms - one or two for each language pair. The indirect approach involves some sort of intermediate representation of the source language before translating into the target language. In this way, linguistic analysis of the source language can be performed on the intermediate representation. Translating to the intermediary also enables semantic analysis, as the source language input can be more carefully to detect idioms, etc, which can be stored in the intermediary and then appropriately used to translate into the target language.

40 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji The transfer method is similar, except that the transfer is language dependent - that is to say, the French-English intermediary transfer would be different from the EnglishGerman transfer. An interlingua intermediary can be used for multilingual systems. Theory Humans fluent in two or more languages are at the moment better translators than the best machine translators in the world. Indeed, a person with three years of experience in learning a second language will already be a better translator than the best machine translators in the world as well.

41 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji Yet for humans and machines alike, translation is a process, a series of steps that must be followed in order to produce a successful translation. It is interesting to note, however, that the various methods of translation for machines - the various processes - become less and less like the process for humans as they become more complicated. Furthermore, it was interesting to notice that as the method of machine translation becomes more complicated, the results are sometimes less accurate than the results of simpler methods that better model the human rationale for translation.

42 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji Therefore, the theory is, an algorithm that attempts to model the human translation process would be more successful than other, more complicated methods currently in development today. This theory is not entirely plausible for full-scale translators because of the sheer magnitude of data that would be required. Humans are better translators than computers in part because they have the ability to perform semantic analysis, because they have the necessary semantic information to be able to, for example, determine the difference in a word's definition based on its usage in context. Creating a translator with a limited-scope of vocabulary would require less data, leaving more room for semantic information to be stored along with definitions.

43 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji A limited-scope translator may seem unuseful at first glance, but even humans fluent in any language, including their native language, don't know the entire vocabulary of the language. A language has hundreds of thousands of words, and no human knows even half of them all. A computer with a vocabulary of commonly used words that most people know, along with information to avoid semantic problems, would therefore be still useful for nonprofessional work. Development On the most superficial level, a translator is more user-friendly for an average person if it is GUI-based, rather than simply text-based. This part of the development is finished. The program presents a GUI for the user.

44 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji A JFrame opens up with two text areas and a translate button. The text areas are labeled "English" and "German". The input text is typed into the English window, the "Translate" button is clicked, and the translator, once finished, outputs the translated text into the German text area. Although typing into the German text area is possible, the text in the German text area does not affect the translator process. The first problem to deal with in creating a machine translator is to be able to recognize the words that are inputted into the system. A sentence or multiple sentences are input into the translator, and a string consisting of that entire sentence (or sentences) is passed to the translate() function.

45 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji The system loops through the string, finding all space (' ') characters and punctuation characters (comma, period, etc) and records their positions. (It is important to note the position of each punctuation mark, as well as what kind of a punctuation mark it is, because the existence and position of punctuation marks alter the meaning of a sentence.) The number of words in the sentence is determined to be the number of spaces plus one. By recording the position of each space, the string can then be broken up into the words. The start position of each word is the position of each space, plus one, and the end position is the position of the next space. This means that punctuation at the end of any given word is placed into the String with that word, but this is not

46 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji a problem: the location of each punctuation mark is already recorded, and the dictionary look-up of each word will first check to ensure that the last character of each word is a letter; if not, it will simply disregard the last character. The next problem is the biggest problem of all, the problem of actual translation itself. Here there is no code yet written, but development of pseudocode has begun already. As previously mentioned, translation is a process. In order to write a translator program that follows the human translation process, the human process must first be recognized and broken down into programmable steps. This is no easy task. Humans with five years of experience

47 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji in learning a language may already translate any given text quickly enough, save time to look up unfamiliar words, that the process goes by too quickly to fully take note of. The basic process is not entirely determined yet, but there is some progress on it. The process to determine the process has been as followed: given a random sentence to translate, the sentence is first translated by a human, then the process is noted. Each sentence given has ever- increasing difficulty to translate.

48 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji For example: the sentence, "I ate an apple," is translated via the following process: 1) Find the sub ject and the verb. (I; ate) 2) Determine the tense and form of the verb. (ate = past, imperfekt form) a) Translate sub ject and verb. (Ich; ass) (note - "ass" is a real German verb form.) 3) Determine what the verb requires. (ate - ¿ eat; requires a direct ob ject) 4) Find what the verb requires in the sentence. (direct ob ject comes after verb and article; apple) 5) Translate the article and the direct ob ject. (ein; Apfel) 6) Consider the gender of the direct ob ject, change article if necessary. (der Apfel; ein -¿ einen) Ich ass einen Apfel.

49 Natural Language Processing: Using Machine Translation in Creation of a German-English Translator Jason Ji References (I'll put these in proper bibliomumbo jumbographical order later!) 1. (dictionary) 2. "An Introduction To Machine Translation" (available online at TOC.htm) 3. (some info on machine translation) 4.

50 50 A Study of Balanced Search Trees This project investigates four different balanced search trees for their advantages and disadvantages, thus ultimately their efficiency. Runtime and memory space management are two main aspects under the study. Statistical analysis is provided to distinguish subtle difference if there is any. A new balanced search tree is suggested and compared with the four balanced search trees.

51 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Abstract This project investigates four different balanced search trees for their advantages and disadvantages, thus ultimately their efficiency. Run time and memory space management are two main aspects under the study. Statistical analysis is provided to distinguish subtle differences if there is any. A new balanced search tree is suggested and compared with the four balanced search trees under study. Balanced search trees are implemented in C++ extensively using pointers and structs.

52 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Introduction Balanced search trees are important data structures. A normal binary search tree has some disadvantages, specifically from its dependence on the incoming data, that significantly affects its tree structure hence its performance. Height of search tree is the maximum distance from the root of the tree to a leaf. An optimal search tree is one that tries to minimize its height given some number of data. To improve its height thus its efficiency, balanced search trees have been developed that self-balance themselves into optimal tree structures that allows quicker access to data stored in the trees, For example red-black treee is a balanced binary tree that balances according to color pattern of nodes (red or black) by rotation functions.

53 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Rotation function is a hall mark of nearly all balanced search tree; they rotate or adjust subtree heights from a pivot node. Many balanced trees have been suggested and developed: red-black tree, AVL tree, weight-balanced tree, B tree, and more.

54 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Background Information Search Tree Basics This pro ject requires a good understanding of binary trees and general serach tree basics. A binary tree has nodes and edges. Nodes are the elements in the tree and edges represent relationship between two nodes. Each node in a binary tree is connected by edgesto zero to two nodes. In general search tree, each node can have more than 2 nodes as in the case of B-tree. The node is called a parent and nodes connected by edges from this parent node are called its children. A node with no child is called a leaf node. Easy visualization of binary tree is a real tree put upside down on a paper with roots on the top and branches on the bottom.

55 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim The grandparent of a binary tree is called root. From the root, the tree branches out to its immediate children and subsequent descendents. Each node's children are designated by left child and right child. One property of binary search tree is that the value stored in the left child is less than or equal to the value stored in parent. The right child's value is, on the other hand, greater than the parent's. (Lef t <= Parent, P arent < Right)

56 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim 3.2 Search Tree Functions There are several main functions that go along with binary tree and general search trees: insertion, deletion, search, and traversal. In insertion, a data is entered into the search tree, it is compared with the root. If the value is less than or equal to the root's then the insertion functino proceeds to the left child of the root and compares again. Otherwise the function proceeds to the right child and compares the value with the node's. When the function reaches the end of the tree, for example if the last node the value was compared with was a leaf node, a new node is created at that position with the new inserted value. Deletion function works similarly to find a node with the value of interest (by going left and right accordingly).

57 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Then the funciton deletes the node and fixes the tree (changing parent children relationship etc.) to keep the property of binary tree or that of general search tree. Search function or basically data retrieval is also similar. After traversing down the tree (starting from the rot), two cases are possible. If there is a value in interest is encountered on the traversal, then the functino replys that there is such data in the tree. If the traversal ends at a leaf node with no encounter of the value in search, then the function simply returns the otherwise. There are three kinds of travesal functions to show the structure of a tree: preorder, inorder and postorder.

58 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim They are recursive functions that print the data in special order. For example in preorder traversal, as the prefix pre suggests, first the value of node is printed then the recursive repeats to the left subtree and then to the right subtree. Similary, in inorder traversal, as the prefix in suggests, first the left subtree is output, then the node's value, then the right subtree. (Thus the node's value is output in the middle of the function.) Same pattern applies to the postorder transversal.

59 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim 3.3 The Problem It is not hard to see that the structure of a binary search tree (or general search tree) that the order of data input is important. In a optimal binary tree, the data are input so that insertion occurs just right which makes the tree balanced, the size of left subtree is approximately equal to the size of right subtree at each node in the tree. In an optimal binary tree, the insertion, deletion, and search function occur in O(log N ) with N as the number of data in the tree. This follows from that whenever data comparison occurs and subsequent traversal (to the left or to the right) the number of possible subset divides in half at each turn. However that's only when the input is nicely ordered and the search tree is balanced.

60 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim It's also possible that the data are input so that only right nodes are added. (Root- > right- > right- > right...)It's obvious that the search tree now looks like just a linear array. And it is. And this give O(N ) to do insertion, deletion and search operation. This is not efficient. Thus search trees are developed to perform its functions efficiently regardless of data input. 4 Balanced Search Trees Four ma jor balanced search trees are investigated. Three of them, namely red-black tree, height-balanced tree, and weight-balanced tree are binary search trees. The fourth, B-tree, is multiple children (> 2) search tree.

61 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim 4.1 Red-black tree Red-black search tree is a special binary with a color scheme; each node is either black or red. There are four properties that makes a binary tree a red-black tree. (1) The root of the tree is colored black. (2) All paths frmo the root to the leaves agree on the number of black nodes. (3) No path from the root to a leaf may contain two consecutive nodes colored red. (4) Every path from a node to a leaf (of the descendents) has the same number of black nodes. The performance of balanced search is directly related to the height of the balanced tree. For a binary, lg (number of nodes) is usually the optimal height. In the case of Red-black tree with n nodes, it has height at most 2lg (n + 1). The proof is noteworthy, but difficult to understand.

62 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim In order the prove the assertion that Red-black tree's height is at most 2lg (n + 1) we should first define bh(x). bh(x) is defined to be the number of black nodes on any path from, but not including a node x, to a leaf.Notice that black height (bh) is well defined under the property 2 of Red-black tree. It is easy to see that black height of a tree is the black height of its root. First we shall prove that the subtree rooted at any given node x contains at least ( 2 bh(x)) - 1 nodes. We can prove this by induction on the height of a node x: The base case is bh(x) = 0, which suggests that x must be a leaf (NIL). This is true then it follows that subtree rooted at x contains = 0. The following is the inductive step. Let say node x has positive height and has two children.

63 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Note that each child has a black-height of either bx(x), if it is a red node, or bh(x)-1, if it is a black node. It follows that the subtree rooted at xcontains at least: 2(2( bh(x)) ) + 1 = 2( bh(x)) - 1. The first term refers to the minimum bounded by the sum of black height left and right. and the second term (the 1) refers to the root. Doing some algedra this leades to the right side of the equaiton. Having proved this then the maximum height of Red-black tree is fairly straightforward method. Not Let h be the height of the tree. Then by property 3 of Red-black tree, at least half of the nodes on any simple path from the root to a leaf must be black. So then the black-height of the root must be at least h/2. n >= 2( h/2) - 1 which is equivalent to n >= 2( bh(x)) - 1 n + 1 >= 2( h/2) lg (n + 1) >= lg (2( h/2)) = h/2 h <= 2lg (n + 1) 4

64 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Therefore we just proved that a red-black tree with n nodes has height at most 2lg (n + 1). 4.2 Height Balanced Tree Height balanced tree is a different approach to bound the maximum height of a binary search tree. For each node, heights of left subtree and right subtree are stored. The key idea is to balance the tree by rotating around a node that has greater than threshold height difference between the left subtree and the right subtree. All boils down to the following property: (1) At each node, the difference between height of left subtree and height of right subtree is less than threshold value. Height balanced tree should yield lg (n) height depends on the threshold value.

65 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim An intuitive, less rigorous and yet valid proof is provided. Imagine a simple binary tree in the worst case scenario, a line of nodes. If the simple binary tree were to be transformed into a height balanced tree, the following process should do it. (1) Pick some node near the middle of a given strand of nodes so that the threshold property satisfies (absolute value(leftH () - rightH ())) (2) Define this node as a parent and the resulting two strands (nearly equal in length) as leftsubtree and rightsubtree appropriately. (3) Repeat steps (1) and (2) on the leftsubtree and the rightsubtree. First note this process will terminate. It's because at each step, the given strand will be split in two halves smaller than the original tree. So this shows the number of nodes in a given strand will decrease.

66 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim This will eventually reach a terminal size of nodes determined by the threshold height difference. If a given strand is impossible to divide so that the threshold height difference holds, then that is the end for that sub recursive routine. Splitting into two halves recursively is analogous to dividing a mass into two halves each time. Dividing by 2 in turn leads to lg (n). So it follows the height of height-balanced tree should be lg (n), or something around that magnitude. It is interesting to note that height balanced tree is roughly complete binary tree. This is because height balancing allows nodes to gather around the top. There is probably a decent proof for this observation, but simple intuition is enough to see this.

67 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim 4.3 Weight Balanced Tree Weight balanced tree is very similar to height balanced tree. It is very the same idea, but just different nuance. The overall data structure is also similar. Instead of heights of left subtree and right subtree, weights of left subtree and right subtree are kept. The weight of a tree is defined as the number of nodes in that tree. The key idea is to balance the tree by rotating around a node that has greater than threshold weight difference between the left subtree and the right subtree. Rotating around a node shifts the weight balance to a favorable one, specifically the one with smaller difference of weights of left subtree and right subtree. Weight balanced tree has the following main property: (1) At each node, the difference between weight of left subtree and weight of right subtree is less than the threshold value.

68 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Similar approach used to prove height balanced tree is used to show lg (n) of weight balanced tree. The proof uses mostly intuitive argument built on recursion and induction. Transforming a line of nodes, the worst case scenario in a simple binary tree, to a weight balanced tree can be done by the following steps. (1) Pick some node near the middle of a given strand of nodes so that the threshold property satisfies (absolutev alue(lef tW () - rig htW ())) (2) Define this node as a parent and the resulting two strands (nearly equal in length) as leftsubtree and rightsubtree appropriately. (3) Repeat steps (1) and (2) on the leftsubtree and the rightsubtree. It is easy to confuse the first step in height balanced tree and weight balanced tree, but picking the middle node surely satisfies both the height balanced tree property and weight balanced tree.

69 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Maybe the weight balanced tree property is well defined, since the middle node presumably has same number of nodes before and after its position. This process will terminate. It's because at each step, the given strand will be split in two halves smaller than the original strand. So this shows the number of nodes in a given strand will decrease. This will eventually reach a terminal size of nodes determined by the threshold weight difference. Splitting into two halves recursively is analogous to dividing a mass into two halves each time. Dividing by 2 in turn leads to lg (n). So it follows the height of weight-balanced tree should be lg (n), or something around that magnitude. Like height balanced tree, weight balanced tree is roughly complete binary tree.

70 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim A New Balanced Search Tree(?) A new balanced search tree has been developed. The binary tree has no theoretical value to computer science, but probably has practical value. The new balanced search tree will referred as median-weight-mix tree for each node will have a key, zero to two children, and some sort of weight. 5.1 Background Median-weight-mix tree probably serves no theoretical purpose because its not perfect. It has no well defined behavior that obeys a set of properties. Rather it serves practical purpose mostly likely in statistics.

71 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Median-weight-mix tree is based on following assumption in data processing: (1) Given lower bound and upper bound of total data input, random behavior is assumed, meaning data points will be evenly distributed around in the interval. (2) Multiple bells is assumed to be present in the interval. The first property is not hard to understand. This is based on the idea that nature is random. The data points will be scatter about, but evenly since random means each data value has equal chance of being present in the data input set. An example of this physical modeling would be a rain. In a rain, rain drops fall randomly onto ground. In fact, one can estimate amount of rainfall by sampling a small area.

72 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Amount of rain is measured in the small sampling area and then total rail fall can be calculated by numerical pro jection, ratio or whatever method. The total rain fall would be rainfall-in-small-area * area-of- totalarea / area-of-small-area. The second assumption is based upon less apparent observation. Nature is not completely random, which means some numbers will occur more often than others. When the data values and the frequency of those data values are plotted on 2D plane, a wave is expected. There are greater hits in some range of data values (the crests) than in other range of data values (the trough). A practical example would be height. One might expect well defined bell-shaped curve based on the average height.(People tends to be 5 foot 10 inches.) But this is not true when you look at it global scale, because there are isolated populations around the world. The average height of Americans is not necessarily the average height of Chinese. So this wave shaped curve is assumed.

73 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim 5.2 Algorithm Each node will have a key (data number), an interval (with lower and upper bounds of its assigned interval) and weights of left subtree and right subtree. The weights of each subtree are calculated is based on constants R and S. Constant R represents the importance of focusing frequency heavy data points. Constant S represents the importance of focusing frequency weak data points. So the ratio R/S consequently represents the relative importance of frequency heavy vs. frequency weak data points. Then tree will be balanced to adjust to a favorable R/S ratio at each node by means of rotating, left rotating and right rotating.

74 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Methodology Idea Evaluating binary search trees can be done in various ways because they can serve number of purposes. For this pro ject, a binary search tree was developed to take some advantage of random nature of statistics with some assumption. Therefore it is reasonable to do evaluation on this basis. With this overall purpose, several behaviors of balanced search trees will be examined. Those are: (1) Time it takes to process a data set (2) Average time retrieval of data (3) Height of the binary tree The above properties are the ma jor ones that outline the analysis. Speed is important and each binary tree is timed to check how long it takes to process input data. But average time retrieval of data is also important because it is best indication of efficiency of the data structures. What is the use when you can input a number quick but retrieve it slow? Lastly, height of the binary tree is check to see if how theoretical idea works out in practical situation.

75 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim 6.2 Detail It is worthwhile to note how each behaviors are measured in C++. For measuring time it take to process a data set, the starting time and the ending time will be recorded by function clock () under time.h library. Then the time duration will be (End-Time - StartTime) / CLOCKS PER SEC. The average time retrieval of data will be calculated by first summing time it takes check each data points in the tree and dividing this sum by the number of data points in the binary tree. Height of the binary tree, the third behavior under study, is calculated by tree traversal, pre-, in- or post-order, by simply taking the maximum height/depth visited as each node is scanned. There will be several test cases (identical) to check red-black binary tree, height-balanced tree, weight-balanced tree, and median-weight-mix tree. First category of test run will be test cases with gradually increasing number of randomly generated data points. Second category of test run will be hand manipulated.

76 A Study of Balanced Search Trees: Brainforming a New Balanced Search Tree Anthony Kim Data points will still be randomly generated however under some statistical behaviors, such a "wave," a single bell curve, etc. Third category of test run will be real life data points such as heights, ages, and others. Due to immense amount of data, some proportional scaling might be used to accommodate the memory capability of the balanced binary trees. 7 Result Analysis C++ codes of the balanced search trees will be provided. Testing of balanced search trees for their efficiency and such. Graphs and table will be provided. Under construction 8 Conclusion Under Construction 9 Reference Under Construction App endix A: Other Balanced Search Trees App endix B: Co des

77 77 Linux Kernel Debugging API The purpose of this project is to create an implementation of much of the kernel API that functions in user space, the normal environment that processes run in. The issue with testing kernel code is that the live kernel runs in kernel space, a separate area that deals with hardware interaction and management of all the other processes. Kernel space debuggers are unreliable and very limited in scope; a kernel failure can hardly dump useful error information because there's no operating system left to write that information to disk.

78 Kernel Debugging User-Space API Library (KDUAL) John Livingston Abstract: The purpose of this project is to create an implementation of much of the kernel API that functions in user space, the normal environment that processes run in. The issue with testing kernel code is that the live kernel runs in kernel space, a separate area that deals with hardware interaction and management of all the other processes. Kernel space debuggers are unreliable and very limited in scope; a kernel failure can hardly dump useful error information because there's no operating system left to write that information to disk. Kernel development is quite likely the most important active project in the Linux community.

79 Kernel Debugging User-Space API Library (KDUAL) John Livingston Any aids to the development process would be appreciated by the entire kernel development team, allowing them to do their work faster and pass changes along to the end user quicker. This program will make a direct contribution to kernel developers, but an indirect contribution to every future user of Linux. Introduction and Background The Linux kernel is arguably the most complex piece of software ever crafted. It must be held to the most stringent standards of performance, as any malfunction, or worse, security flaw, could be potentially fatal for a critical application. However, because of the nature of the kernel and its close interaction with hardware, it's extremely difficult to debug kernel code.

80 Kernel Debugging User-Space API Library (KDUAL) John Livingston The goal of this project is to create a C library that provides the kernel API, but operates in ordinary user space, without actual interaction with the underlying system. Kernel code currently being tested can then be compiled against this library for testing without the risks and confusion of testing it on a live system. Process The design of this API has an extremely simple development process: Research, code, debug. Sub-tasks are somewhat difficult to define, as the library cannot do very much of use until complete. However, the rapidly growing source code, along with small demonstrations of sections of the library, is sufficient for progress reporting purposes. Development thus far has simple, no special tools have been needed beyond the vim editor, the GNU C compiler and linker, and a very large amount of work time.

81 Kernel Debugging User-Space API Library (KDUAL) John Livingston Testing of the library with simple functions will be trivial, it is the eventual goal of this project construct a small patch to the kernel using this library both as a demonstration of the library's effectiveness and to solve an existing problem. This patch would allow seamless use of the Andrew File System (AFS) with the 2.6.x kernel, greatly benefiting the lab's workstations by allowing an immediate migration to 2.6, which has large improvements. On a more detailed level, I have been implementing sections of the Linux VFS, as well as math processing. VFS is necessary to handle "file interaction" in the virtual kernel, while most of the mathematical work has been to optimize basic functions (add, subtract, compare, etc.) using x86 assembly.

82 Kernel Debugging User-Space API Library (KDUAL) John Livingston Because this library attempts to simulate a program that uses hardware directly for computation, its own internal simulation of that computation must be as fast as possible. It will never reach anywhere near the speed of the actual kernel, but the speed difference between the original C syntax for addition and its equivalent in inline assembly is a tenfold increase. These two sections of the kernel library will be my primary contribution to this project. Current code for the two spans several thousand lines. The majority of the codebase is written; however, minor changes, fixes, and improvements will still require significant effort. This project's success is dependent on efficiency as much as simple functionality.

83 Kernel Debugging User-Space API Library (KDUAL) John Livingston References The Linux Kernel Archives. The Debian distribution of Linux, the best one, not to start a flame war or anything. The archive of the Linux Kernel Mailing List, the primary method of communication for kernel developers An open source implementation of the Andrew File System.

84 84 An Analysis of Sabermetric Statistics in Baseball For years, baseball theorists have pondered the most basic question of baseball statistics: which statistic most accurately predicts which team will win a baseball game. With this information, baseball teams can rely on technological, statistical-based scouting organizations. The book, Moneyball addresses the advent of sabermetric statistics in the 1980s and 1990s and shows how radical baseball thinkers instituted a new era of baseball scouting and player analyzation. This project analyzes which baseball statistic is the single most important. It has been found that new formulas, such as OBP, OPS, and Runs Created correlate better with the number of runs a team scores than traditional statistics such as batting average.

85 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay Abstract: For years, baseball theorists have pondered the most basic question of baseball statistics: which statistic most accurately predicts which team will win a baseball game. With this information, baseball teams can rely on technological, statistical-based scouting organizations. The book, Moneyball addresses the advent of sabermetric statistics in the 1980s and 1990s and shows how radical baseball thinkers instituted a new era of baseball scouting and player analyzation. This project analyzes which baseball statistic is the single most important. It has been found that new formulas, such as OBP, OPS, and Runs Created correlate better with the number of runs a team scores than traditional statistics such as batting average.

86 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay Introduction: For some time, a baseball debate has been brewing. Newcomers and sabermetricians (the “Statistics Community”) feel that baseball can be analyzed as a scientific entity. The Sabermetric Manifesto by Bill James serves as the Constitution for these numbers-oriented people. Also, Moneyball by Michael Lewis serves as the successful model of practical application of their theories. Traditional scouts (the “Scouting Community”) contend that baseball statistics should not over-analyzed and stress the importance of intangibles and the need for scouts. The debate can also be interpreted in terms of statistics. Baseball lifers feel that stats such as batting average are the most important.

87 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay Meanwhile, the Statistics Community feels that complex, formulaic stats can better predict a player’s contributions to a team. The discussion continues in the offices of baseball teams around the country: are computer algorithms better than human senses? From a statistical sense, baseball is an ideal sport. Plate appearances are discrete events with few, distinct results. In fact, results can be limited to a few distinct outcomes: hit, walk, or out. Outcomes can also be expressed more specifically: single, double, triple, home run, walk, strike-out, fly-out… etc. Most importantly, the outcomes of past plate appearances can accurately predict the outcomes of future plate appearances. Baseball statisticians continue to desire more information in their field in order to become better at analyzing the past and predicting the future.

88 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay Definition of Terms BA – Batting Average OBP – On Base Percentage OPS – On Base Percentage Plus Slugging OPS Adjusted – On Base Percentage * 1.2 Plus Slugging Percentage Runs Created – On Base Percentage * Slugging Percentage

89 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay Theory: Sabermetric Teachings (1) Best Pitching Statistic DIPS is Defensive Independent Pitching Statistic - "Looking mainly at a pitcher's strikeouts, walks and home runs allowed per inning does a better job of predicting ERA than even ERA does. It's very counterintuitive to see that singles and doubles allowed don't matter a whole lot moving forward." (Across the Great Divide) The Defense Independent Pitching Statistic was invented to provide by sabermetricians as a alternate statistic to ERA. Sabermetricians think that ERA does a poor job of future prediction because it is greatly altered by stadium characteristics, the opposing team, luck, and defense. Hence, the invention of DIPS.

90 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay (2) Best Hitting Statistic OBP and OPS as more indicative hitting statistics than the current default: Batting Average. OBP is on-base percentage, which is essential a measure of batting average and plate discipline. OPS is Slugging Percentage plus OBP, which gives a measure of power and plate discipline. In fact, it has been shown by my research that OPS does the best job of any conventional statistics in correlating to wins. Old-school scouts say that Batting Average is a better predictor of a player's potential, because plate discipine can be learned. The ultimate example is that one team could hit three home runs to get three runs, while another team could have two walks, followed by a home run, to get three runs. In this case, it is shown that walks are important!

91 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay (3) The lack of need for Scouts “The scouts have only a limited idea of what the guy's gonna do. He might do this, he might do that, he might be somewhere in the middle. What you're trying to do is you're trying to take the guys who you think have the best chance. I fully admit that you can't tell the future via stats. My point is that scouting has that equal amount of unpredictability. You can only know so much. You're scouts, you're not fortune tellers.” It is many sabermetricians view that real scouts are important: nowadays, “scouts” can operate from a laptop looking at a baseball's player's stats.

92 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay (4) Draft College Players “A player who is 21 is simply closer to his peak abilities than a player who's 18.” Therefore, it is extremely risky to draft younger, usually high-school players. How they play in high school may be far away from how they play in the pros. Meanwhile, the best College Players can sometimes be plugged into a Major League Baseball team's rotation a season or two after being drafted. Simply, college players are more proven.

93 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay (5) Use Minor League Statistics to predict Major League numbers Guys who have higher on-base percentages in Triple-A tend to have higher on-base percentages in the major leagues. However, this area is very underdeveloped, and no studies have been conducted in the field.

94 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay Method My own analysis had two parts. First, I obtained statistical data about teams in the past ten years and entered it into a Microsoft Excel spreadsheet. I calculated the correlation between certain statistics of teams and the number of runs they scored that season. The second part of my research consists of a computer program in C++. Right now, I have the “engine” of my program working. The program plays a game between two teams, then outputs a full box score displaying many hitter statistics. With this framework in place, I can tell the program to play the game many times, and store the statistics in variables that I can output to a different file.

95 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay Ways In the future in which I will produce graphs based on my C++ program: Graph of percentage of games won versus number of games played... should even out at the correct percentage when sample size is larger enough. Graph of correlation percentage between OBP and games won, SLUG and games won... and more. Should be a bar graph because the correlation is just a number from -1 to 1. Bar graph with effect of artificially changing OBP and SLUG... what is the effect on runs scored. Which has a bigger effect?

96 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay Findings: Correlation Data Correlation = 0.824

97 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay Findings: Correlation Data Correlation = (better)

98 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay Findings: Correlation Data Correlation = (best)

99 Saber-what? An Analysis of the Use of Sabermetric Statistics in Baseball Jack McKay References “Across the Great Divide.” lan&id= lan&id= Lewis, Michael. Moneyball. W.W. Norton, New York “Sabermetrics.” “Sabermetrics.” “Sabermetric Revolution Sweeping the Game.” &id=

100 100 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms There is an immense amount of genetic data generated by government efforts such as the human genome project and by organization efforts such as The Institute for Genomic Research (TIGR). there exist large amounts of unused processing power in schools and labs across the country. Harnessing some of this power is a useful problem not just for the specific application in Bioinformatics of DNA sequence pattern matching.

101 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols Abstract The BLAST (Basic Local Alignment Search Tool) algorithm of genetic comparison is the main tool used in the Bioinformatics community for interpreting genetic data. Existing implementations of this algorithm (in the form of programs or web interfaces) are widely available and free. Therefore, the most significant limiting factor in BLAST implementations is not accessibility but computing power. My project deals with possible methods of alleviating this limiting factor by harnessing computer resources which go unused in long periods of idle time.

102 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols The main methods used are grid computing, dynamic load balancing, and backgrounding. Background There is an immense amount of genetic data generated by government efforts such as the human genome project and by organization efforts such as The Institute for Genomic Research (TIGR). The task of extracting useful information from this data requires such processing power that it overwhelms current computational resources. However, there exist large amounts of unused processing power in schools and labs across the country; most computers are never being used all of the time, and most of the time that computers are being used their processors are nowhere near 100% load.

103 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols Harnessing some of this unused power is a useful problem not just for the specific application in Bioinformatics of DNA sequence pattern matching, but for many computationally intensive problems which could be solved more accurately and faster with increased resources. Procedure The first step in harnessing unused processor power is to clearly establish and document the existence and magnitude of that unused power. Accomplishing this task

104 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols requires that we establish some metrics for describing computer load and develop a way to keep a record of those metrics over time. Perl is an ideal language with which to write a program which could perform this task because of its text manipulation capabilities and high speed. The program "cpuload" uses the Linux "uptime" command every second, parses the output, and writes the results to a file which is then plotted using gnuplot. The graph shows the results over one execution of the BLAST algorithm comparing two strains of e-coli bacteria.

105 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols Remote machine tests have the following procedure: ssh to target processor Record test number, processor name, and any users Ask any users to notice performance changes Run ~/web-docs/techlab/BLAST/formatdb -iEcK12.FA -pT -oT -nK12-Prot Run ~/techdocs/cpuload for 5 data points Record start time Run ~/web-docs/techlab/BLAST/blastall -pblastp -dK12-Prot -iEcSak.FA -ok12vssak -e.001 Record end time Allow cpuload run for approximately 5 more data points

106 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols (cont.) vim runstats :w tests/testX Record any user-reported performance changes The use of grid computing to optimize BLAST implementations is not an original idea; a program called mpiblast has already been written and made available to the public. However, implementing mpiblast in any given environment is not a trivial task. For example, our systems lab, although it has mpi installed on several computers, has not maintained a list of which computers are available to run parallel programs. My next task was to compile this list using essentially trial and error and running a test mpi program, mpihello.c.

107 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols My next task was to compile this list using essentially trial and error and running a test mpi program, mpihello.c. See poster for pictures of the old, obsolete lamhosts list and the updated working version. Here are the results for single remote machine tests, including selected graphs of cpuload output: Test 1: tess No users Start: 9:09 End: 9:16 Test 2: beowulf Jack McKay Start: 8:57 End: 9:04 User report: "I experienced no slow down or loss of performance. But if I had a loss of performance that persisted for over thirty six hours, rest assured, I would have contacted my doctor." oedipus: no route to host Test 3: antigone No users Start: 8:43 End: 8:51 Test 4: agammemnon Jason Ji Start: 9:53 End: 10:01 User report: Did you experience any slow down at all? "No".

108 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols Test 5: loman Michael Drukker Start: 8:44/src/redirect.php End: 8:51 User report: "I'm not noticing anything, but I'm not doing anything computationally intensive, so..." Test 6: lordjim Robert Staubs Start: 8:57 End: 9:04 User report: "I wasn't really using the computer during that time." Test 7: faustus Caroline Bauer Start: 9:25 End: 9:34 User report: "I haven't noticed anything, so..." Test 8: okonokwo Alex Volkovitski Start: 10:10 End: 10:19 User report: Test 9: joad No users Start: 9:15 End: 9:23 Analysis Tests I run on single remote machines generate two dependent variables: running time and CPU load over the test's duration. So far nine tests have been run, six with users on the target machine and three without users on the target machine.

109 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols As is visible from the graphs above, the tests have similar results with similar durations, indicating that performance for grid computing in the systems lab is indeed predictable and repeatable. Furthermore, the user testimonials so far unanimously agree that no change in performance was noticed. Further Testing Plans In future tests of multiple machines running simultaneously, I could look at how effectively each test used its resources by creating an "efficiency" metric. A formula for this metric could perhaps be E = 1/(t*n) efficiency = 1/((running time) * (# of machines)) Because of the transfer time involved in MPI programming, one machine will probably be the most efficient.

110 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols The interesting question I will address, though, is how much more efficient is one machine than two? Three? How many machines can you utilize before realizing a huge drop in efficiency? In general, there is also an optimum balance between transfer time and processing power for any given algorithm to run in the shortest time. At this point, adding more processors actually slows down the program because the increase in transfer time outweighs the added processing power. The ideal number of processors is generally higher for more complex algorithms; adding two numbers together is clearly fastest when run on only one computer, while BLAST algorithms can benefit from more processors.

111 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols It will interesting to see whether or not I can surpass this "optimal number" for BLAST algorithms with the number of processors available in the Systems Lab. A third dependent variable my tests could possibly generate would be accuracy of output. If I could develop a method of measuring this variable, it would probably be the most interesting of all to investigate. For now, however, I will leave it as a possibility while I focus on the other tests.

112 An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols is a potential application of Grid Computing to the implementation of BLAST algorithms. The idea is to distribute implementations of BLAST on personal or institutional computers and run those implementations during down time or even in the background, while computers are being used. To justify such a program to users, it is necessary to demonstrate that such a program will not interfere with use of the computer or slow down the computer's performance in any noticeable way. References - The National Center for Biotechnology Information's website, where I obtained several implementations of BLAST. - The Institute for Genomic Research's website, which contains helpful background information on genetic algorithms. - The primary site for

113 113 Part-of Speech Tagging with Corpora The aim of this project is to create and analyze various methods of part-of- speech tagging. The corpora used are of extremely limited size thus offering less occasion to rely entirely upon tagging patterns gleamed from predigested data. Methods used to analyze the data and resolve tagging ambiguities include Hidden Markov Models and Bayesian Networks. Results are analyzed by comparing the system-tagged corpus with a professionally tagged one.

114 Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Abstract The aim of this project is to create and analyze various methods of part-of-speech tagging. The corpus used­the Susanne Corpus­is of extremely limited size thus offering less occasion to rely entirely upon tagging patterns gleamed from predigested data. Methods used focus on the comparison of tagging correctness among general and genre-specific training with limited training corpora. Results are analyzed by comparing the system-tagged corpus with a professionally tagged one.

115 Part-of-Speech Tagging with Limited Training Corpora Robert Staubs 1.1 Introduction Problem Part-of-speech (POS) tagging is a subfield within corpus linguistics and computational linguistics. POS taggers are designed with the aim of analyzing texts of sample language use­corpora­to determine the syntactic categories of the words or phrases used in the text. POS tagging serves as an underpinning to two fields above others: natural language processing and corpus linguistics. It is useful to natural language processing­the interpretation or generation of human language by machines­in that it provides a way of preparing processed texts to be interpretted syntactically. It is also useful in the academic field of corpus

116 Part-of-Speech Tagging with Limited Training Corpora Robert Staubs linguistics in the statistical analysis of how humans use their language. The intention of this pro ject is to use and compare various methods of POS tagging using a small amount of statistical training. 1.2 Scope This project aims to try to achieve the best results with a limited training corpus. Training data is limited to 53 of the 57 documents making up the Susanne Corpus, the other 4 being reserved for testing purposes. Each of the four testing segments will be used both with general training (with all 53 of the others) and with genre-specific training (with only one-fourth of those).

117 Part-of-Speech Tagging with Limited Training Corpora Robert Staubs 1.3 Background Many different methods of POS tagging have been advanced in the past but no attempts give hope of "perfect" tagging at the current stage. Accuracy of over 90% on ambiguous words is typical for most methods in current use (1), often well exceeding that. POS taggers cannot at the current time mimic human methods for distinguishing part of speech in language use. Work to get taggers to approach the problem from all the expected human methods­semantic prediction, syntactic prediction, lexical frequency, and syntactical category frequency being the most prominent­have not yet reached full fruition.

118 Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Hidden Markov Model The Hidden Markov Model (HMM) method of POS tagging is probably the most traditional. It generally requires extensive training on one or more pre-tagged corpora. Decisions on the part of speech of words or semantic units are made by analyzing by analyzing the probability that one tag would follow another wP a P f (i,j ) f (i,w) nd the probability that a certain word or unit has a certain tag tr ansition = f (i) lexical = f (i) here f(i, j) represents the number of transitions from tag i to tag j in the corpora, f(i, w) represents the total number of words w with tag i, f(i) represents the frequency of tag i, and f(w) represents the frequency of word w. Transitions and tags not seen are given a small but non-zero probability (2). The HMM method converges on its maximum accuracy, as opposed to some methods (most not usable in this situation) which

119 Part-of-Speech Tagging with Limited Training Corpora Robert Staubs converge to an accuracy level smaller than one attained earlier. HMMs have a close affinity for neural network methods. 2.1 Procedure Training Training data consists of: tags represented in the corpus, words represented in the corpus, transitions represented in the corpus, and the frequency of each. Words and tags are read in from the corpus and stored alphabetically or in parallel in a series of arrays and matrices. This data form the basis for the statistical information extracted by taggers for making decisions on a unit's tag.

120 Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Training Implementation Corpus data is stored in text files with each subsequent word on a seperate line. Word, base word form, tag, etc. are stored on each line tab-delineated. A data-extractor was created using the C++ programming language. The extractor stores each encountered word in an ordered array of structs. If a struct for that word already exists, an internal variable representing word-frequency is incremented. The tag associated with that word is added to an internal array or, if the tag is already stored there, its frequency is incremented. A similar process is followed for encountered tags. Each tag is added to an associated struct in an ordered array. Its frequency is incremented if it is encountered more than once. The tag that occurred before the added one is added to an array of preceding tags or, if that tag is already present, its frequency is incremented.

121 Part-of-Speech Tagging with Limited Training Corpora Robert Staubs References 1. D. Elworthy (10/1994). Automatic Error Detection in Part of Speech Tagging. 2. D. Elworthy (10/1994). Does Baum-Welch Re-estimation Help Taggers? 3. M. Maragoudakis, et al. Towards a Bayesian Stochastic Part-of- Speech and Case Tagger of Natural Language Corpora 4. M. Marcus, et al. Building a large annotated corpus of English: the Penn Treebank 5. G. Marton and B. Katz. Exploring the Role of Part of Speech in the Lexicon

122 Part-of-Speech Tagging with Limited Training Corpora Robert Staubs References (cont.) 6. T. Nakagawa, et al. (2001). Unknown Word Guessing and Part- of-Speech Tagging Using Support Vector Machines 7. V. Savova and L. Pashkin. Part-of-Speech Tagging with Minimal Lexicalization 8. K. Toutanova, et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network

123 123 Benchmarking of Cryptographic Algorithms The author intends to validate theoretical numbers by constructing empirical sets of data on cryptographic algorithms. This data will then be used to give factual predictions on the security and efficiency of cryptography as it applies to modern day applications.

124 Benchmarking of Cryptographic Algorithms Alex Volkovitsky Abstract The author intends to validate theoretical numbers by constructing empirical sets of data on cryptographic algorithms. This data will then be used to give factual predictions on the security and efficiency of cryptography as it applies to modern day applications. 1 Introduction Following is a description of the project and background information as researched by the author.

125 Benchmarking of Cryptographic Algorithms Alex Volkovitsky Background Origins of Cryptography Cryptography is a field of study into how two parties can exchange valuable information over an insecure channel. Historically speaking, the first use of encryption is attributed to Julius Caesar when he used a ROT(3) algorithm to transfer military order within his empire. The algorithm was based on the simple premise of 'rotating' letters (hence the abbreviation ROT) by 3 characters such that 'a' became 'd', 'b' became 'e', etc. Decryption was the reverse of this process in that the receiving party needed merely to "un"-rotate the letters.

126 Benchmarking of Cryptographic Algorithms Alex Volkovitsky Basic Terminology and Concepts At it's core cryptography assumes that two parties must communicate over some insecure channel. The sender (generally referred to as Alice) agrees on some encryption algorithm E(p,k) with the receiver on the other end (Bob). E(p,k) is generally some function of two or more variables, often mathematical in nature (such as all computer algorithms), but not necessarily (as was the case of the notorious Enigma machine used in World War II). The two variables in question are 'p', the plain-text or data that must get across without being read by any third party, and 'k', the key, some shared secret which both Alice and Bob have agreed to over a previously established secure connection.

127 Benchmarking of Cryptographic Algorithms Alex Volkovitsky On the receiving end, Bob must possess a decryption function such that p=D(E(p,k),k). Meaning that if Bob knows the secret ('k'), he can retrieve the original message 'p'. The most important aspect of cryptography is the existence of the key which is able to transform seemingly random gibberish into valuable information Symmetric Algorithms Symmetric key algorithms are algorithms which are most often used to transfer large amounts of data. Symmetric key algorithms use the same key 'k' to encrypt and decrypt, and are generally based on relatively quick mathematic functions such as XOR. The downside of symmetric algorithms is the that since both parties must know the exact same key, that key needs to have been transfered securely in the past.

128 Benchmarking of Cryptographic Algorithms Alex Volkovitsky This means that for Alice and Bob to communicate using a symmetric key algorithm, they must first either meet in person to exchange slips of paper with the key, or alternatively (as is done over the Internet) exchange a symmetric key over an established public/private-key connection. The most common modern symmetric algorithm is DES (Digital Encryption Standard) Private/Public-Key Algorithms Public-key cryptography is based on the concept that Alice and Bob do not share the same key. Generally, Alice would generate both the private key and the public key on her computer, save the private key and distribute the public key.

129 Benchmarking of Cryptographic Algorithms Alex Volkovitsky If Bob would like to send a message to Alice, he first encrypts it with her public key, making her the only person able to decrypt the message. He sends the encrypted message (which even he himself can no longer decrypt) and Alice is able to read it using her private key. If Alice wishes to respond, she uses Bob's public key and follows a similar procedure. Alternatively, if Bob wishes to verify that it is Alice speaking and no one else, she can sign her messages. Signing is using your own private key to encrypt a message, such that anyone else may decrypt it and know that you were the only person who could've encrypted it.

130 Benchmarking of Cryptographic Algorithms Alex Volkovitsky She would encrypt her message with her private key, then encrypt it with Bob's public key. Upon receiving the message he would be the only person able to decrypt it (being the only person knowing Bob's private key) and then he would verify Alice's signature by decrypting the actual message with her public key. The most common modern day public-key algorithm is RSA, developed in 1977 by Ron Rivest, Adi Shamir and Len Adleman (hence the abbreviation RivestShamirAdleman, or RSA), which is based on a factoring problem. 1.2 Purpose of the Pro ject Whereas much research has been done into theoretical cryptography, very little has been done to prove simple formula numbers and to look into the speeds at which various algorithms operate.

131 Benchmarking of Cryptographic Algorithms Alex Volkovitsky My project seeks to observe several modern day algorithms and to compute empirical data on the time it takes to encrypt and/or decrypt different amounts of data, and how different algorithms perform with varied key lengths, modes of operation, and data sizes. Ideally my program could be run on different types of machines to identify if certain architectures give an advantage to the repeating mathematical computations required by cryptographic algorithms. My pro ject also seeks to try to break several algorithms using unrealistically small key lengths (using real key lengths such as 64-bit could take years to break using brute-force methods), this way I could extrapolate my data and give predictions on the security afforded by actual key lengths.

132 Benchmarking of Cryptographic Algorithms Alex Volkovitsky 1.3 Scope Development Results Conclusions Summary References 1. "Handbook of Applied Cryptography" University of Waterloo. 27 Jan "MCrypt" Sourceforge. 27 Jan "Cryptography FAQ." sci.crypt newsgroup. 27 June


Download ppt "Computer Systems Lab TJHSST Current Projects 2004-2005 First Period."

Similar presentations


Ads by Google