Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Slides:



Advertisements
Similar presentations
Keys to Building a Multilingual Search Engine Thierry Sourbier.
Advertisements

Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.
UserSupport Help Desk System at CCIN2P3 Jean-René Rouet IN2P3 Computing Center
CSC 450/550 Part 6: The Application Layer Example: The World Wide Web.
Automatic Information Retrieval from Bioinformatics Websites Kang Peng.
People Technical AdvisorsAcademic AdvisorFinal Project By Prof. Shlomi Dolev Prof. Ehud Gudes Boaz Hilemsky Dr. Aryeh Kontorovich Moran Cohavi Gil Sadis.
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
Python and Web Programming
1 CS6320 – Why Servlets? L. Grewe 2 What is a Servlet? Servlets are Java programs that can be run dynamically from a Web Server Servlets are Java programs.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.
Overview of Search Engines
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
Unicode & W3C Jataayu Software C. Kumar January 2007.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Dynamic Web Pages (Flash, JavaScript)
October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.
TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Understand the components of a browser and a server Understand.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
How P3P Works Lorrie Faith Cranor P3P Specification Working Group Chair AT&T Labs-Research 4 February 2002
Copyright 2012 & 2015 – Noah Mendelsohn Introduction to: The Architecture of the World Wide Web Noah Mendelsohn Tufts University
Internationalization in PHP: PmWiki’s approach Dr. Patrick R. Michaud September 13, 2005.
Lesson 7 – World Wide Web. What is the World Wide Web?  The content of the worldwide web is held on individual web pages gathered together to form websites.
Overview Web Session 3 Matakuliah: Web Database Tahun: 2008.
CSU - DEO Introduction to CGI - Fort Collins, CO Copyright © XTR Systems, LLC Introduction to the Common Gateway Interface (CGI) Instructor: Joseph DiVerdi,
İsmail Özdemir Hüseyin Tüfekçilerli Advisor: Dr. Arzu Baloğlu.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
WWW: an Internet application Bill Chu. © Bei-Tseng Chu Aug 2000 WWW Web and HTTP WWW web is an interconnected information servers each server maintains.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
WEB SERVER Mark Kimmet Shana Blair. The Project Web Server Application  Receives request for web pages or images from a client browser via the internet.
Syntax of the HTML HyperText Markup Language. HTML Syntax  What is it?  Helps computer know how to display  What goes into it?  U+FEFF BYTE ORDER.
Intro About Web. Web Definitions Web means the following: –HTTP (or HTTPS) protocol; HTTP server is called Web-server, HTTP clients are e.g. browsers.
CSI 3125, Preliminaries, page 1 SERVLET. CSI 3125, Preliminaries, page 2 SERVLET A servlet is a server-side software program, written in Java code, that.
The World Wide Web: Information Resource. How a Search Engine works… How Search Works - YouTube
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Objective: To describe the evolution of the Internet and the Web. Explain the need for web standards. Describe universal design. Identify benefits of accessible.
File Input and Output Chapter 14 Java Certification by:Brian Spinnato.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Unit 3 — Advanced Internet Technologies Lesson 10 — Introduction to XHTML.
Building Database-backended Multilingual, Multimedia Data Repositories: The aAQUA Experience.
How Web Database Architectures Work CPS181s April 8, 2003.
COMP2322 Lab 2 HTTP Steven Lee Jan. 29, HTTP Hypertext Transfer Protocol Web’s application layer protocol Client/server model – Client (browser):
Creating Your 1 st Web Page. Tags Refers to anything between on a webpage Most appear in pairs surrounding content Some appear as empty tags (no closing.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Website Design and Construction Services and Standards.
UTF-8, Perl and You By Rafael Almeria. Chapter 1: Introduction.
CSE541: Web Applications Special Thanks to M. Abdur Rahman.
Added Value to XForms by Web Services Supporting XML Protocols Elina Vartiainen Timo-Pekka Viljamaa T Research Seminar on Digital Media Autumn.
REEM ALMOTIRI Information Technology Department Majmaah University.
Chapter 10: Web Basics.
Building Web Apps with Servlets
The Hypertext Transfer Protocol
EMBEDDED WEB TECHNOLOGY
COMP2322 Lab 2 HTTP Steven Lee Feb. 8, 2017.
Sec (4.3) The World Wide Web.
Structuring Content in a Web Document
Chapter 27 WWW and HTTP.
HTTP Request Method URL Protocol Version GET /index.html HTTP/1.1
XML Problems and Solutions
17th APAN Meetings & Joint Techs Workshop
CIS 133 mashup Javascript, jQuery and XML
The Internet and Electronic mail
Mobile Internet and WAP
Internet Skills ELEC135 Alan Noble Room 504 Tel:
Presentation transcript:

Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich

Agenda Project Goals Background Preliminary Examination Unicode Normalize Design Application Analyses Summary and conclusion

Project Goals Recognition of web pages’ encoding. Translation of web page to Utf-8. Normalize the web into a single encoding standard- Utf-8.

Background - Definitions Character Set – collection of characters that can be represented. Character Encoding – bit representation of a character set. Unicode – character set which includes most of the world‘s writing systems characters. Utf-8 - character encoding of Unicode used in the web.

Recognizing Encodings HTML meta tag HTTP protocol Content-Type: text/html; charset =windows-1255 BOM (byte order mark) tag - EF BB BF ("  ") Auto detection – based on Firefox.

Preliminary Examination System 100 first results of Google search All languages supported by Google Goals: Success rate of each recognition method Contradiction cases Encodings supported by java

Examination Results Bom tag is very reliable. In case of contradiction between Http and Meta tag – Http is mostly correct. Auto detection is very reliable when recognizing Utf-8. Except Utf-8 Auto detection is reliable only when language indication is given.

Translation Decision HTML HTTP Header URL Bom tag Auto Detection METAHTTP Unicode Output

Unicode Normalize Design Recognition System Four mentioned methods Heuristic decision tree Translation System Translates a web page into utf-8. Using java translation mechanism.

Class Diagram

Recognition System

Decision heuristic

Class Diagram

Translation System

Problems and solutions Left to right : The encoding ISO (Hebrew visual) specification defines that a Hebrew character will be written in an invert order. Solution: The system checks for ISO encoding, and when it is detected we invert the order of the Hebrew characters

Translate Example beforeafter

Application Analyses Two kinds of analyses were performed in our application: Google analysis This analysis checks the 100 first results of Google in each language Google supports. This analysis checked about web pages. The average detection of all languages is about 97 percent.

Application Analyses- cont ’ ODP analysis Open Directory Project (ODP) is a widely distributed data base of Web content classified by humans. This analysis checks about random pages of the odp database. The average detection of all languages is about percent.

Google analysis

ODP analysis

Application Usage Client usage – client browser can use this system to show the different web page in one encoding format – utf8. Server usage – web server can use this system to translate the different storage pages into utf8. Processing usage – different web page processing systems, like search engines, can use our system to convert different pages into the standard Unicode encoding.

Future Project Proposals Implementation of the application on Firefox Browser Implementation of the application on Apache Server Design of a new auto-detection method (based on a encoding dictionary)

Summary and Conclusion We build an efficient system which translates a page to utf8-encoding. Analyses show 93 percent of Success. Implementation of the application will improve the web surfing experience for millions of users all over the world.

Questions THANK YOU!