Presentation is loading. Please wait.

Presentation is loading. Please wait.

Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING.

Similar presentations


Presentation on theme: "Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING."— Presentation transcript:

1 Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING DEPARTMENT

2 Baskent University Text Filtering 2 introduction Information filtering (IF) –Incoming non-relevant documents are filtered out. Information retrieval (IR) –Provides a list of ordered documents based on the similarity with the user query

3 Baskent University Text Filtering 3 introduction ( continued... ) Linear Separation - partitions relevant and non-relevant into distinct blocks Optimal Queries - all relevant documents are ahead of non- relevant ones. Steepest Descent Algorithm (SDA)

4 Baskent University Text Filtering 4 preliminaries Information retrieval system (S) can be defined as 5 tuple S =(T,D,Q,V,f) -T set of ordered index terms -D set of documents -Q set of queries -V set of real numbers -f:DxQ  V retrieval function

5 Baskent University Text Filtering 5 preliminaries ( continued ) Vector Space Model - Transformation of raw text into more computationally useful forms - Documents and queries are represented as vectors of weighted terms d=(t 1,w d1 ;t 2,w d2 ;... ;t n,w dn ) ti  T  d q = (q 1, w q1 ; q 2, w q2,... ; q m, w qm ) qi  T  q

6 Baskent University Text Filtering 6 preliminaries ( continued ) Rnorm value for effectiveness It measures up how relevant documents are distributed over non-relavent ones.  rank matters.

7 Baskent University Text Filtering 7 preliminaries ( continued ) Rnorm value for effectiveness It measures up how relevant documents are distributed over non-relavent ones.  rank matters. S + number of document pairs where preferred document is ranked higher S - number of document pairs where non-preferred document is ranked higher S + max maximal number of S +  =(rnrn | rnnnnn ) S + =10 S - =2 S + max =21

8 Baskent University Text Filtering 8 preliminaries ( continued ) predictedactual relevantnon-relevant relevant ab non-relevant cd Contingency Table Precision =a / (a+b)Recall =a / (a+c) Breakeven point Where precision and recall are equal

9 9 overview of experiment Training With SDA Optimal query... train test Reuters -21578 Data set Topics Effectiveness measures Preprocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing

10 Baskent University Text Filtering 10 overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures Preprocessing Consists of 21578 economic news stories that originally appeared on the Reuters newswire in 1987 Each story has been manually assigned one or more indexing labels from a fixed list There are 135 TOPIC labels for classification. In order to use a text corpus for machine learning research it splited into sets of training and testing examples Reuters 21578 train test Reuters -21578 Data set

11 Baskent University Text Filtering 11 overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="9944" NEWID="5031"> 13-MAR-1987 15:45:35.38 livestock carcass usa ec U.S. MEAT GROUP TO FILE TRADE COMPLAINTS WASHINGTON, March 13 - The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards. Reuter Sample Reuters 21578 Document train test Reuters -21578 Data set

12 Baskent University Text Filtering 12 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Parsing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U.S. MEAT GROUP TO FILE TRADE COMPLAINTS The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards

13 Baskent University Text Filtering 13 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Parsing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U S MEAT GROUP TO FILE TRADE COMPLAINTS The American Meat Institute AME said it intended to ask the U S government to retaliate against a European Community meat inspection requirement AME President C Manly Molpus also said the industry would file a petition challenging Korea's ban of U S meat products Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section of the General Agreement on Tariffs and Trade against an EC directive that effective April will require U S meat processing plants to comply fully with EC standards

14 Baskent University Text Filtering 14 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing Stop Words HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U.S. MEAT GROUP FILE TRADE COMPLAINTS The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards

15 Baskent University Text Filtering 15 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Topics labels Effectiveness measures PrePocessing After Removing Stop Words HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body:. MEAT GROUP FILE TRADE COMPLAINTS American Meat Institute AME intended ask government retaliate European Community meat inspection requirement. AME President Manly Molpus industry file petition challenging Korea's ban U.S. meat products Molpus Senate Agriculture subcommittee AME livestock farm groups intended file petition Section General Agreement Tariffs Trade EC directive effective April require meat processing plants comply fully EC standards

16 Baskent University Text Filtering 16 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Stemming HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: MEAT GROUP FILE TRADE COMPLAINT American Meat Institute AME intend ask government retaliate European Community meat inspection require. AME President Manly Molpus industry file petition challeng Korea ban meat product Molpus Senate Agriculture subcommittee AME livestock farm group intended file petition Section General Agreement Tariff Trade EC direct effect April require meat process plant compli fulli EC standard Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing train test Reuters -21578 Data set

17 Baskent University Text Filtering 17 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Transform To Vectors HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing meat 5 group 1... Molpus 1... standard 1 train test Reuters -21578 Data set

18 Baskent University Text Filtering 18 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Create Dictionary (only in training) Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing approv 1236 chairman 1225... ptd 5 train test Reuters -21578 Data set

19 Baskent University Text Filtering 19 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Reducing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5 Molpus... standard 1... train test Reuters -21578 Data set

20 Baskent University Text Filtering 20 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Reducing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5... standard 1... train test Reuters -21578 Data set

21 Baskent University Text Filtering 21 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Normalizing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5... standard 1... train test Reuters -21578 Data set w k =t k x log (N D /n k ) t k term frequency N D Number of documents in collection n k number of documents containing t k is normalized weight of term k unnormalized weight of term k

22 Baskent University Text Filtering 22 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Normalizing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 0.127 meat 0.278... standard 0.012... train test Reuters -21578 Data set w k =t k x log (N D /n k ) t k term frequency N D Number of documents in collection n k number of documents containing t k is normalized weight of term k unnormalized weight of term k

23 Baskent University Text Filtering 23 overview of experiment train test... Reuters -21578 Data set Topics Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Training 1.Choose a starting query vector Q 0 ; let k = 0. 2. Let Q k be a query vector at the start of the (k+1)th iteration; identify the following set of difference vectors:  (Q k ) ={b=d- d’ :d  d’ and f(Q k,b)  0}; if  (Q k )= , Q opt = Q k is a solution and exit, otherwise, 3. Let Q k+1 = Q k + 4. k = k+1; go back to Step (2). Training With SDA Optimal query

24 Baskent University Text Filtering 24 overview of experiment train Optimal query test... Reuters -21578 Data set Topics Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing Training All the category examples as positive examples Random 60% from other topics as negative examples If maximum Rnorm value (1) is not reached at maximum 150 iterations set optimal query as the query that produces maximum Rnorm value available Training With SDA

25 Baskent University Text Filtering 25 overview of experiment Training With SDA Optimal query... train test Reuters -21578 Data set Topics Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing There are 135 topics Topic# of + earn2877 acq1650 moneyfx538 grain433 crude389 trade369 interest347 wheat212 ship197 corn182 Topic# of earn1087 acq719 moneyfx179 grain149 crude189 trade118 interest131 wheat71 ship89 corn56 train test

26 Baskent University Text Filtering 26 overview of experiment Training With SDA Optimal query... train test Reuters -21578 Data set Topics Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing Create contingency tables Find breakeven points

27 Baskent University Text Filtering 27 Results TopicFindismNbayesSDABnetsTreesSVM earn92,995,9 96,32 95,897,898,0 acq64,787,8 85,26 88,389,793,6 money-fx46,756,6 68,72 58,866,274,5 grain67,578,8 71,81 81,485,094,6 crude70,179,5 82,54 79,685,088,9 trade65,163,5 65,25 69,072,575,9 interest63,464,9 61,07 71,367,177,7 wheat68,969,7 76,06 82,792,591,9 ship49,285,4 65,17 84,474,285,6 corn48,265,3 75,00 76,491,890,3 Avg.Top 10 64,681,584,5485,088.492,0 Avg.All61,775,276,3780,0N/A87,0 breakevens

28 Baskent University Text Filtering28 Thank you!


Download ppt "Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING."

Similar presentations


Ads by Google