Presentation is loading. Please wait.

Presentation is loading. Please wait.

ICE0534 – Web-based Software Development ICE1338 – Programming for WWW Lecture #3 Lecture #3 In-Young Ko iko.AT. icu.ac.kr iko.AT. icu.ac.kr Information.

Similar presentations


Presentation on theme: "ICE0534 – Web-based Software Development ICE1338 – Programming for WWW Lecture #3 Lecture #3 In-Young Ko iko.AT. icu.ac.kr iko.AT. icu.ac.kr Information."— Presentation transcript:

1 ICE0534 – Web-based Software Development ICE1338 – Programming for WWW Lecture #3 Lecture #3 In-Young Ko iko.AT. icu.ac.kr iko.AT. icu.ac.kr Information and Communications University (ICU) - Summer

2 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Announcements Check the course schedule for the presentation- topic assignment Check the course schedule for the presentation- topic assignment Your Web server accounts will be created soon Your Web server accounts will be created soon A bulletin board has been created A bulletin board has been created FTP site for homework submission FTP site for homework submission URL: ftp://webeng.icu.ac.kr URL: ftp://webeng.icu.ac.krftp://webeng.icu.ac.kr ID: wwwstudent ID: wwwstudent Password: ice0534 Password: ice0534 Please name your file properly Please name your file properly Format: Format: Example: hw1_inyoungko_ pdf Example: hw1_inyoungko_ pdf Project teams? Project teams?

3 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Last Lecture WWW Concepts WWW Concepts Internet & HTTP Internet & HTTP Client-side Information Presentation Client-side Information Presentation

4 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University This Lecture Cascade Style Sheets (CSS) Cascade Style Sheets (CSS) Basic UNIX Commands Basic UNIX Commands Concepts and Examples of Web-based Information Integration Concepts and Examples of Web-based Information Integration Web Wrappers Web Wrappers

5 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Publishing Web Pages on the Server Copy your files to the ‘public_html’ directory under your home directory in the server Copy your files to the ‘public_html’ directory under your home directory in the server Use FTP to copy your files in a local directory to the server directory Use FTP to copy your files in a local directory to the server directory ftp vega.icu.ac.kr (login with your user ID) cd public_html lcd d:\myweb put index.html (mput *.html) quit Your homepage is now accessible from Your homepage is now accessible fromhttp://vega.icu.ac.kr/~yourid

6 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Web-based Information Integration Retrieve document collections from various Web resources (e.g. search engines, news video archives) Retrieve document collections from various Web resources (e.g. search engines, news video archives) Analyze the document collections using various document analysis services (characterize, sort, partition, filter, etc.) Analyze the document collections using various document analysis services (characterize, sort, partition, filter, etc.) Visualize analysis results using various visualization services to help users make sense of them Visualize analysis results using various visualization services to help users make sense of them Impose structure on the resulting document collection to define a customized, task- oriented information space Impose structure on the resulting document collection to define a customized, task- oriented information space

7 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Daily News Analysis ISI’s GeoTopics

8 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Large-scale Web-based Information Integration Example GeoTopics: Daily News Analysis Portal Generator (www.isi.edu/geoworlds/geotopics/) News Sources Extracted Articles News Compilation Results Document Analyses Document filtering Topic and place name extractions Topic and place-based Document classifications Topic ranking and sorting Cross-product between topics and places Geographical mapping of the articles Requires 92 component services Requires 92 component services Needs to generate portals customized for different news sources and regions Needs to generate portals customized for different news sources and regions

9 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Travel Planner ISI’s Heracles Project

10 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Intelligent WorldInfo Assistant ISI’s Heracles Project

11 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Heracles’ Information Sources Schedule : Outlook Calendar Address Info : Outlook Contact Schedule : Outlook Calendar Address Info : Outlook Contact Weather : Yahoo Weather (weather.yahoo.com) Weather : Yahoo Weather (weather.yahoo.com) Geocodes : MapBlast (www.mapblast.com) Geocodes : MapBlast (www.mapblast.com) Driving Map : MapQuest (www.mapquest.com) Driving Map : MapQuest (www.mapquest.com) Map(airports) : YahooMap (maps.yahoo.com) Map(airports) : YahooMap (maps.yahoo.com) Flight Info : ITA Software (www.itasoftware.com) Flight Info : ITA Software (www.itasoftware.com) Airport Info : Travelocity (www.travelocity.com) Airport Info : Travelocity (www.travelocity.com) Airport Parkging Info (www.airwise.com) Airport Parkging Info (www.airwise.com) Taxi Fare Info : Washington Post (www.whshingtonpost.com) Taxi Fare Info : Washington Post (www.whshingtonpost.com) Hotel : ITN Hotel (www.itn.com) Hotel : ITN Hotel (www.itn.com) Car Rental : ITN Car (www.itn.com) Car Rental : ITN Car (www.itn.com) Flight Tracking : ITN flight tracking system(www.itn.com) Flight Tracking : ITN flight tracking system(www.itn.com)

12 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University CIA – The World Factbook

13 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Search Engines Open Directory:

14 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Online Bookstores

15 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Problems in Integrating Heterogeneous Information Heterogeneity in information formats (e.g., search engines, catalog pages) Heterogeneity in information formats (e.g., search engines, catalog pages) Heterogeneity in data types (e.g., a temperature value as an integer or a floating point) Heterogeneity in data types (e.g., a temperature value as an integer or a floating point) Heterogeneity in underlying units (e.g., price information in US Dollars or in Korean Wons) Heterogeneity in underlying units (e.g., price information in US Dollars or in Korean Wons) Heterogeneity in semantics (e.g., date information as a creation date or a last-updated date) Heterogeneity in semantics (e.g., date information as a creation date or a last-updated date)

16 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University A Solution to Integrate Heterogeneous Information Information Mediation Information Mediation Linking information sources and application programs Linking information sources and application programs Providing value-added services of accessing, abstracting and integrating information Providing value-added services of accessing, abstracting and integrating information Gio Wiederhold, Mediators in the Architecture of Future Information Systems, Computer, March Gio Wiederhold, Mediators in the Architecture of Future Information Systems, Computer, March 1992.

17 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Information Mediators Provides an intermediate layer between information sources and users/applications Provides an intermediate layer between information sources and users/applications Queries to a mediator are in a uniform language Queries to a mediator are in a uniform language Determines which data sources to use, how to obtain the desired information, and how to manipulate the information Determines which data sources to use, how to obtain the desired information, and how to manipulate the information e.g., ISI’s SIMS, Stanford’s TSIMMIS e.g., ISI’s SIMS, Stanford’s TSIMMIS Knoblock & Minton, IEEE Intelligence, Sep/Oct 1998 Source Wrapper Mediator Queries Users, Applications

18 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Mediation Services Transformation and subsetting of databases to reorganize base data into new configurations appropriate to specific users and applications Transformation and subsetting of databases to reorganize base data into new configurations appropriate to specific users and applications Gathering an appropriate amount of data by specializing or generalizing the search Gathering an appropriate amount of data by specializing or generalizing the search Accessing and merging data from multiple databases Accessing and merging data from multiple databases Abstraction of data to bring them to a higher level Abstraction of data to bring them to a higher level Maintaining derived data for efficiency Maintaining derived data for efficiency

19 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Information Wrappers Accept queries from the mediator Accept queries from the mediator Translate the query into the appropriate query for the individual source Translate the query into the appropriate query for the individual source Perform any additional processing if necessary Perform any additional processing if necessary Return the results to the mediator Return the results to the mediator Web Wrappers: make Web sources look like databases that can be queried through the mediator Web Wrappers: make Web sources look like databases that can be queried through the mediator Ashish & Knoblock, COOPIS, 1997

20 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Web Wrapper for Naver.com

21 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Web Wrapper Generation Steps 1.Analyze the local querying mechanism e.g., A search query to Naver.com where=webkr&query=www&xc=&qt=df&f=al l&r=&st=s&fd=1&start=101&display=10&do main=&dftf=&qf=1&qvt=0 Host AddressLocal Path Information Category Search Query Result start index Result page size

22 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Web Wrapper Generation Steps 2.Analyze result page structure URLSummary Title

23 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Web Wrapper Generation Steps 3.Develop a mechanism to translate a user query into a local query 4.Develop a result parser to extract information blocks from result pages 5.Integrate the information blocks retrieved from the result pages 6.Convert the integrated information into the format that a mediator or a client can accept

24 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Connections Between Web Clients and Servers A Web Browser A Web Server Listen 80 Accept A Web server is a daemon process that executes in the background waiting for some event to occur Process Return Connect Write Read

25 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Sockets A Web Browser A Web Server Listen 80 Accept Process Return Connect Write Read Sockets A socket is an end point for communication between two machines A socket is an association of a protocol, address and process to an end point of communication

26 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Accessing Web Contents from Java Programs via Sockets import java.net.*; import java.io.*; … Socket sk = new Socket(www.icu.ac.kr, 80); OutputStream os = sk.getOutputStream(); PrintWriter pw = new PrintWriter(os); pw.println("GET /index.html"); pw.println();pw.flush(); InputStream is = sk.getInputStream(); InputStreamReader ips = new InputStreamReader(is); BufferedReader in = new BufferedReader(ips); String line; while ((line=in.readLine()) != null) { System.out.println(line);} Socket Creation Write Request Read Results

27 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Accessing Web Contents from Java Programs via URL Connections import java.net.*; import java.io.*; … URL url = new URL(“http://www.icu.ac.kr”); URLConnection urlc = url.openConnection(); InputStream is = urlc.getInputStream(); InputStreamReader ips = new InputStreamReader(is); BufferedReader in = new BufferedReader(ips); String line; while ((line=in.readLine()) != null) { System.out.println(line);} URL Object Creation URL Connection Creation Read Results

28 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Java String Manipulation Methods for Result Parsing int indexOf(String str, int fromIndex) int indexOf(String str, int fromIndex) int lastIndexOf(String str, int fromIndex) int lastIndexOf(String str, int fromIndex) boolean startsWith(String prefix) boolean startsWith(String prefix) boolean endsWith(String suffix) boolean endsWith(String suffix) boolean matches(String regex) boolean matches(String regex) String[] split(String regex) String[] split(String regex) String substring(int begineIndex, int endIndex) String substring(int begineIndex, int endIndex) String toLowerCase() String toLowerCase() String toUpperCase() String toUpperCase()

29 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Result Parsing Strategies Structure-based Parsing Structure-based Parsing Analyzes Web pages based on tag hierarchies Analyzes Web pages based on tag hierarchies Cannot be used for ill-formed HTML documents Cannot be used for ill-formed HTML documents Pattern-based Parsing Pattern-based Parsing Search for a unique string pattern to locate a result item Search for a unique string pattern to locate a result item Needs to identify such unique string patterns first Needs to identify such unique string patterns first

30 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Structure-based Result Parsing

31 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Pattern-based Result Parsing 1.Find out a unique pattern to locate a result item e.g., “

32 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Java Implementation of Web Wrapper public void WebWrapper(String host, String path, String query, int startIndex, int pageSize) { try { String address = "http://" + host + path + "?where=webkr" + "&query=" + query + String address = "http://" + host + path + "?where=webkr" + "&query=" + query + "&start=" + startIndex + "1" + “&display=" + pageSize; URL url = new URL(address); URL url = new URL(address); URLConnection urlc = url.openConnection(); URLConnection urlc = url.openConnection(); urlc.setRequestProperty("Accept", "*/*"); urlc.setRequestProperty("Accept", "*/*"); urlc.setRequestProperty("User-Agent", "Mozilla/4.0"); urlc.setRequestProperty("User-Agent", "Mozilla/4.0"); InputStream is = urlc.getInputStream(); InputStream is = urlc.getInputStream(); InputStreamReader ips = new InputStreamReader(is); InputStreamReader ips = new InputStreamReader(is); BufferedReader in = new BufferedReader(ips); BufferedReader in = new BufferedReader(ips); String line; String line; while ((line=in.readLine()) != null) { while ((line=in.readLine()) != null) {//System.out.println(line);// } } catch(Exception e) { e.printStackTrace(); e.printStackTrace();} } Parsing Results Query Translation

33 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Web Robots A Web robot is a program (agent) that collects information while following all the links on a Web page A Web robot is a program (agent) that collects information while following all the links on a Web page Web Robots = Crawlers = Spiders Web Robots = Crawlers = Spiders Web search engines use Web robots to collect and index Web documents Web search engines use Web robots to collect and index Web documents A tag to tell Web robots not to index a page: A tag to tell Web robots not to index a page: Crawling methods: Crawling methods: Breadth-first crawling Breadth-first crawling Depth-first crawling Depth-first crawling

34 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Breadth First Crawlers

35 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Depth First Crawlers

36 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University For each map layer displayed, get the set of place names and classify the documents based on the place names Classify documents based on the disaster types mentioned Cross-product between place names and the disaster-type categories Plot the document clusters on the map to figure out the major flooding areas An Web document collection about ‘China disasters’ Web-based Information Management Applications (Example Scenario) Identify Recurring Disaster Areas in China, e.g. Locations of Floods

37 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Keyword Editor Keyword Extractor Search Engines Place Name Generator Place Name Extractor Product Categories Mapping Clusters Pipelined components : Sequential connection : Pipelined connection Generate multiple sets of place names Web-based Information Management Applications (Example App. Design)

38 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Problems in Composing Large-scale Information Management Applications Time-consuming to explore and test a large number of options Time-consuming to explore and test a large number of options Hard to choose appropriate services for collections Hard to choose appropriate services for collections Hard to quickly substitute and test a service within a sequence of steps Hard to quickly substitute and test a service within a sequence of steps Difficulties of capturing and reusing shared patterns of information management steps Difficulties of capturing and reusing shared patterns of information management steps Difficult to record and recurrently perform information management steps Difficult to record and recurrently perform information management steps Necessity of extracting abstract patterns of information management steps and reusing them Necessity of extracting abstract patterns of information management steps and reusing them Hard to cope with dynamic aspects of Web resources Hard to cope with dynamic aspects of Web resources

39 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Characteristics of Large-scale Information Management Tasks Incremental development of information management steps for an abstract task goal Incremental development of information management steps for an abstract task goal Recurrent executions of the steps Recurrent executions of the steps Evolving requirements of users Evolving requirements of users Shared patterns of management steps Shared patterns of management steps Collection-based information processing Collection-based information processing Dynamic aspects of information sources and services Dynamic aspects of information sources and services Large and growing number of component services Large and growing number of component services

40 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Improvement Goals Significantly reduce construction time, keeping costs low Significantly reduce construction time, keeping costs low Enable very rapid construction/adaptation of new applications Enable very rapid construction/adaptation of new applications Provide static and run-time diagnostic tools, facilitating debugging and performance tuning tasks Provide static and run-time diagnostic tools, facilitating debugging and performance tuning tasks Rapid Composition and Reconfiguration of Large-scale Custom Applications

41 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Programming Homework #2 Due date: July 12, 2005 Due date: July 12, 2005 Create a Web wrapper Create a Web wrapper Pick a Web site (e.g., a search engine, an on- line bookstore) that accepts a query and returns text-based results Pick a Web site (e.g., a search engine, an on- line bookstore) that accepts a query and returns text-based results Develop a Web wrapper program that Develop a Web wrapper program that Accepts a user’s input via a different user interface Accepts a user’s input via a different user interface Generates a query based on the input Generates a query based on the input Connects to the Web site and sends the query Connects to the Web site and sends the query Receives and parses the result documents Receives and parses the result documents Stores the parsed results into a formatted data file Stores the parsed results into a formatted data file

42 Summer ICE 0534/ICE1338 – WWW © In-Young Ko, Information and Communications University Project Proposal Assignment Due Date: July 14, 2005 Due Date: July 14, 2005 Develop a Web-based information integration scenario that includes the following: Develop a Web-based information integration scenario that includes the following: Which Web sources to access Which Web sources to access Which information will be collected from the sources Which information will be collected from the sources How the information will be integrated and presented How the information will be integrated and presented Submit a short (less than 5 pages) proposal document that includes the following contents: Submit a short (less than 5 pages) proposal document that includes the following contents: Objectives of the project Objectives of the project Web-based information integration scenario Web-based information integration scenario Development schedule Development schedule


Download ppt "ICE0534 – Web-based Software Development ICE1338 – Programming for WWW Lecture #3 Lecture #3 In-Young Ko iko.AT. icu.ac.kr iko.AT. icu.ac.kr Information."

Similar presentations


Ads by Google