Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tokeniser Francisco Miguel Pérez Romero University of Sevilla.

Similar presentations


Presentation on theme: "Tokeniser Francisco Miguel Pérez Romero University of Sevilla."— Presentation transcript:

1 Tokeniser Francisco Miguel Pérez Romero University of Sevilla

2 Roadmap Introduction Class Diagram Libraries Conclusions

3 Roadmap Introduction Class Diagram Libraries Conclusions

4 Web Wrapping Information retrieval VerifierOntologiser Extractor Query NavigatorFormFiller

5 Tokeniser Tokenisation Rules Configuration File Web Page Parser

6 Tokeniser Usage Web Page Classification Information Extraction Learners Information Extraction

7 Example Config File Token List Web Page Tokeniser XML File Token List

8 Concepts Configuration File Token Tokenisation types

9 Roadmap Introduction Class Diagram Libraries Conclusions

10 Example

11 Class Diagram: Tokenisation

12 Tokenisation Example

13 Class Diagram: Tokeniser

14 Roadmap Introduction Class Diagram Libraries Conclusions

15 Comparison Features 1  Comparison Features: Javadoc documentation? Support UNICODE UTF-8 Support UNICODE UTF-16 Named Groups Indexable Groups > 9 Negative Groups Nested groups Lazy qualifications?

16 Comparison Features 2  Comparison Features: Fuzzy matching? Support POSIX? Support Ignore Case? Support New Line Option? Use State Machine? Support accent?

17 Libraries Tabla 1

18 Libraries Tabla 2

19 Libraries Tabla 3

20 Benchmark 1 Regular Expression List String List Matching all one another Time in ms

21 Benchmark 1: 10000 Iterations org.apache: -> 7078 ms com.stevesoft : -> 19782 ms kmy.regex : -> 781 ms java.util : -> 1266 ms jregex.Pattern : -> 1000 ms org.apache.oro : -> 2156 ms dk.brics.automaton : -> 265 ms com.karneim.util.collection : -> 407 ms

22 Benchmark 1: 20000 Iterations org.apache: -> 11796 ms com.stevesoft : -> 26641 ms kmy.regex : -> 906 ms java.util : -> 1891 ms jregex.Pattern : -> 1422 ms org.apache.oro : -> 3375 ms dk.brics.automaton : -> 312 ms com.karneim.util.collection : -> 610 ms

23 Benchmark 1: 50000 Iterations org.apache: -> 28656 ms com.stevesoft : -> 63297 ms kmy.regex : -> 1781 ms java.util : -> 4281 ms jregex.Pattern : -> 3219 ms org.apache.oro : -> 7641 ms dk.brics.automaton : -> 531 ms com.karneim.util.collection : -> 1312 ms

24 Diagram

25 Benchmark 2 Source Code Matching tags

26 Benchmark 2: Amazon org.apache : -> 218 ms com.stevesoft : -> 63 ms kmy.regex : ->94 ms java.util : -> 0 ms jregex.Pattern : -> 93 ms org.apache.oro : -> 32 ms dk.brics.automaton : -> 0 ms com.karneim.util.collection : -> 47 ms

27 Benchmark 2: Marca org.apache : -> 62 ms com.stevesoft : -> 47 ms kmy.regex : ->93 ms java.util : -> 0 ms jregex.Pattern : -> 94 ms org.apache.oro : -> 16 ms dk.brics.automaton : -> 0 ms com.karneim.util.collection : -> 62 ms

28 Benchmark 2: Ebay org.apache : -> 31 ms com.stevesoft : -> 125 ms kmy.regex : ->266 ms java.util : -> 0 ms jregex.Pattern : -> 156 ms org.apache.oro : -> 47 ms dk.brics.automaton : -> 0 ms com.karneim.util.collection : -> 172 ms

29 Diagram

30 To sum up… Dk.brics.automaton is the faster Dk.brics and com.karneim fail with URL Kmy.regex or java.util

31 Roadmap Introduction Class Diagram Libraries Conclusions

32 Tokenisation test Searching information A real project Experience

33 Thanks!


Download ppt "Tokeniser Francisco Miguel Pérez Romero University of Sevilla."

Similar presentations


Ads by Google