Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Similar presentations


Presentation on theme: "Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich."— Presentation transcript:

1 Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich

2 Agenda Project Goals Background Preliminary Examination Unicode Normalize Design Application Analyses Summary and conclusion

3 Project Goals Recognition of web pages’ encoding. Translation of web page to Utf-8. Normalize the web into a single encoding standard- Utf-8.

4 Background - Definitions Character Set – collection of characters that can be represented. Character Encoding – bit representation of a character set. Unicode – character set which includes most of the world‘s writing systems characters. Utf-8 - character encoding of Unicode used in the web.

5 Recognizing Encodings HTML meta tag HTTP protocol Content-Type: text/html; charset =windows-1255 BOM (byte order mark) tag - EF BB BF ("  ") Auto detection – based on Firefox.

6 Preliminary Examination System 100 first results of Google search All languages supported by Google Goals: Success rate of each recognition method Contradiction cases Encodings supported by java

7 Examination Results Bom tag is very reliable. In case of contradiction between Http and Meta tag – Http is mostly correct. Auto detection is very reliable when recognizing Utf-8. Except Utf-8 Auto detection is reliable only when language indication is given.

8 Translation Decision HTML HTTP Header URL Bom tag Auto Detection METAHTTP Unicode Output

9 Unicode Normalize Design Recognition System Four mentioned methods Heuristic decision tree Translation System Translates a web page into utf-8. Using java translation mechanism.

10 Class Diagram

11 Recognition System

12 Decision heuristic

13 Class Diagram

14 Translation System

15 Problems and solutions Left to right : The encoding ISO-8859-8 (Hebrew visual) specification defines that a Hebrew character will be written in an invert order. Solution: The system checks for ISO-8859-8 encoding, and when it is detected we invert the order of the Hebrew characters

16 Translate Example beforeafter

17 Application Analyses Two kinds of analyses were performed in our application: Google analysis This analysis checks the 100 first results of Google in each language Google supports. This analysis checked about 10000 web pages. The average detection of all languages is about 97 percent.

18 Application Analyses- cont ’ ODP analysis Open Directory Project (ODP) is a widely distributed data base of Web content classified by humans. This analysis checks about 150000 random pages of the odp database. The average detection of all languages is about 92.615685 percent.

19 Google analysis

20 ODP analysis

21 Application Usage Client usage – client browser can use this system to show the different web page in one encoding format – utf8. Server usage – web server can use this system to translate the different storage pages into utf8. Processing usage – different web page processing systems, like search engines, can use our system to convert different pages into the standard Unicode encoding.

22 Future Project Proposals Implementation of the application on Firefox Browser Implementation of the application on Apache Server Design of a new auto-detection method (based on a encoding dictionary)

23 Summary and Conclusion We build an efficient system which translates a page to utf8-encoding. Analyses show 93 percent of Success. Implementation of the application will improve the web surfing experience for millions of users all over the world.

24 Questions THANK YOU!


Download ppt "Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich."

Similar presentations


Ads by Google