Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Data Structures Vamshi Ambati

Similar presentations


Presentation on theme: "Introduction to Data Structures Vamshi Ambati"— Presentation transcript:

1 Introduction to Data Structures Vamshi Ambati vamshi@andrew.cmu.edu

2 Overview  Java you need for the Project  Search Engine and Data Structures  THIS Code Structure  On the Data Structure front Dictionaries (Dictionary Structures) Java Collections Linked List Queue [c] Vamshi Ambati2

3 Java you will need for the Project  Core Programming + I/O and Files  OOPS Inheritance Packages Encapsulation  Java API Collections [c] Vamshi Ambati3

4 What is a Search Engine?  A sophisticated tool for finding information on the web  An Index for the World Wide Web Analogous to the Index on a textbook Just Imagine a world without Search Engine! [c] Vamshi Ambati4

5 Why Index in the first place?  Which list is easier to search?  sow fox pig eel yak hen ant cat dog hog  ant cat dog eel fox hen hog pig sow yak  A Sorted list always helps Permits binary search. About log2n probes into list  log2(1 billion) ~ 3 [c] Vamshi Ambati5

6 How search engines work  The search engines maintain data of web sites in its database.  Use programs (often referred to as "spiders" or "robots") to collect information.  The information is then indexed by the search engine.  It allows users to look for the words or combination of words found in the index

7

8 Inverted Files A file is a list of words and this file contains words at various positions. Each entry of the word is associated with a position. [c] Vamshi Ambati8 POS 1 10 20 30 36 FILE a (1, 4, 24…) entry (17…) file (2, 10) contains(11,….) position (25…) positions (15…) word (20….) words (6,12..). INVERTED FILE

9 Inverted Files for Multiple Documents [c] Vamshi Ambati9 DOCID OCCUR POS 1 POS 2...... “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document 56... LEXICON WORD INDEX

10 A comprehensive form of Inverted Index [c] Vamshi Ambati10 SOURCE: http://www.searchtools.com/slides/bestsearch/bls-24.htmlhttp://www.searchtools.com/slides/bestsearch/bls-24.html

11 THIS  Search engine for the website http://www.hinduonnet.com/  Website for the news paper The Hindu  Not for the entire web  Results are confined to only one web site [c] Vamshi Ambati11

12 Index Structure for our Project (THIS) http://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2004091500081100.htm&date=2004/09 /15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2004091500081100.htm&date=2004/09 /15/&prd=bl :: 4 http://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2002102700140200.htm&date=2002/10 /27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2002102700140200.htm&date=2002/10 /27/&prd=mag :: 7.. … http://www.hindu.com/2004/10/09/stories/2004100 904051900.htmhttp://www.hindu.com/2004/10/09/stories/2004100 904051900.htm :: 23 http://www.hindu.com/2004/10/09/stories/2004100 910970300.htmhttp://www.hindu.com/2004/10/09/stories/2004100 910970300.htm :: 3.. …. http://www.hinduonnet.com/thehindu/ gallery/0166/016606.htmhttp://www.hinduonnet.com/thehindu/ gallery/0166/016606.htm :: 2 http://www.hinduonnet.com/thehindu/ gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/ gallery/0048/004807.htm :: 1.. … … … … … [c] Vamshi Ambati12 India ManMoh an Cricket Bollywo Sharukh Sachin … ….

13 Search Engines

14 Search Engine Differences  Coverage (What part of the web do they really cover?)  Crawling algorithms Frequency of crawl depth of visits  http://www.msitprogram.net/ Depth -0 http://www.msitprogram.net/  http://www.msitprogram.net/admissions.html/ http://www.msitprogram.net/admissions.html/  Depth -1  Indexing policies Data Structures Representation  Search interfaces  Ranking [c] Vamshi Ambati14

15 [c] Vamshi Ambati15 Search Engine

16 Index [c] Vamshi Ambati16 Crawl Search

17 Index [c] Vamshi Ambati17 Query retrieve ResultSet FinalResult Sort by Rank ResultPage makePage TheWeb Spider Parser URLList crawl parse getNextUrl addUrls addPage Indexer store retrieve

18 Index [c] Vamshi Ambati18 Query retrieve ResultSet FinalResult Sort by Rank ResultPage makePage TheWeb Spider Parser URLList crawl parse getNextUrl addUrls addPage Indexer store retrieve Where are our data structures and algorithms lying? Queue Priority Queue Hashtable BinaryTree LinkedList MergeSort& InsertionSort

19 Code Structure (THIS) [c] Vamshi Ambati19 PageImgPageHref PageElement Spider WebSpider PageWord Queue SearchDriver PageLexer HttpTokenizerURLTextReader CrawlerDriver TreeDictionary Query addPage ListDictionary Indexer Index HashDictionary Index Save Restore Crawl Parse DictionaryInterface Inheritance Uses Calls DictionaryDriver

20 Dictionary Structures (Lexicon)  A Dictionary is an unordered container that contains key- element pairs Ordered Dictionary has the elements in sorted order  Keys are unique, but the values could be any [c] Vamshi Ambati20

21 Dictionary ADT  size(): returns the number of items in D Output: Integer  isEmpty(): Test whether D is empty. Output: Boolean  elements(): Return the elements stored in D. Output: iterator of elements (objects)  keys(): Return the keys stored in D. Output: iterator of keys (objects)  findElement(k): if D contains an item with key == k, then return the element of that item, else return NO_SUCH_KEY. Output: Object  findAllElements(k): Output: Iterator of elements with key k  insertItem(k,e): Insert an Item with element e and key k into D.  removeElement(k): Remove an item with key == k and return it. If no such element, return NO_SUCH_KEY Output: Object (element)  removeAllElements(k): Remove from D the items with key == k. Output: iterator of elements [c] Vamshi Ambati21 Also see the Java Standard API for Dictionary http://java.sun.com/j2se/1.4.2/docs/api/java/util/Dictionary.html http://java.sun.com/j2se/1.4.2/docs/api/java/util/Dictionary.html

22 Dictionary ADT in THIS Project  size(): returns the number of items in D Output: Integer  isEmpty(): Test whether D is empty. Output: Boolean  getKeys(): Return all the keys of the elements stored in D. Output: String array (Ideally it should be Vector!!)  getValue(k): if D contains an item with key == k, then return the element of that item, else return NULL. Output: Object  insertItem(k,e): Insert an Item with element e and key k into D.  remove(k): Remove an Item with key k from D.  We have customized the Dictionary a bit as we would be inserting only elements of the type !! [c] Vamshi Ambati22

23 Java Collections  java.util.* (A quite helpful library) Has implementations for most of the Data Structures They make life really easy You can not use the data structures inbuilt unless specified (Eg:Task1 Tasklet-A)  Use them for non-data structural purposes - Collections Eg: Arrays,Vectors, Iterators,Lists, Sets etc You would definitely be using “Iterator” atleast as you would be dealing with many Objects at a time! http://java.sun.com/j2se/1.4.2/docs/api/java/util/Iterat or.html. http://java.sun.com/j2se/1.4.2/docs/api/java/util/Iterat or.html [c] Vamshi Ambati23 See: http://java.sun.com/docs/books/tutorial/collections/http://java.sun.com/docs/books/tutorial/collections/

24 Other Data structures  Queue  LinkedList Beware! there are no Pointers in Java However there are “references”  Learn more about References in Java  Do not use the java.util package for DataStructures or Sorting Algorithms! You are expected to code them [c] Vamshi Ambati24

25 Summary  Learn data structures by implementing THIS  Mini version of a real search engine  Frame work is provided  More details in the next video [c] Vamshi Ambati25

26 THANK YOU [c] Vamshi Ambati26


Download ppt "Introduction to Data Structures Vamshi Ambati"

Similar presentations


Ads by Google