Presentation is loading. Please wait.

Presentation is loading. Please wait.


Similar presentations

Presentation on theme: "DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,"— Presentation transcript:

1 DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul, Turkey (Visiting Professor at TALP Research Center, UPC)

2 OUTLINE INTRODUCTION LITERATURE SURVEY ▫Search Engines and Query Types ▫Automatic Analysis of Documents ▫Automatic Summarization OVERVIEW OF METHODOLOGY ▫System Architecture ▫Implementation ▫Data Collection STRUCTURAL PROCESSING ▫Rule-based Approach ▫Machine Learning Approach SUMMARY EXTRACTION DISCUSSION FUTURE RESEARCH


4 Introduction Rapid growth of information sources ▫World Wide Web ▫“information overload” 50% of documents viewed in search engine results ▫not relevant (Jansen and Spink, 2005) Users are interested in different types of search ▫rather than queries with commonplace answers  e.g. capital city of Sweden ▫specific and complex queries  e.g. best countries for retirement ▫tasks such as background search  e.g. literature survey on Mexican air pollution

5 Introduction (cont.) Available search engines ▫results in response to a user query ▫each presented with a short ‘summary’  2-3 line extracts  document fragments containing query words  fail to reveal their context within the whole document The users ▫scroll down the results ▫click those that seem relevant to their real information need ▫inadequate summaries  missing relevant documents  spending time with irrelevant documents  not feasible to open each link

6 Example Output of Google

7 Introduction (cont.) Automatic summarization ▫as successful as humans  long-term research direction (Sparck Jones, 1999) ▫improve effectiveness of other tasks  e.g. information retrieval Traditionally, automatic summarization research: ▫general-purpose summaries  e.g. the “abstract page” of a report  But, need to bias towards user queries  in an information retrieval paradigm ▫a document is seen as a flat sequence of sentences  ignoring the inherent structure  But, Web documents  complex organization of content  sections and subsections with different topics and formatting

8 Research Goals a novel summarization approach for Web search ▫combining these two aspects  Document structure  Query-biased techniques ▫not investigated together in previous studies Intuition ▫providing the context of searched terms ▫preserving the structure of the document  Sectional hierarchy and heading structure ▫may help the users to determine the relevancy of results better Two-stage approach ▫Structural processing ▫Summary extraction

9 Research Goals (cont.) Web documents ▫no domain restriction ▫typically heterogeneous  images, text in different formats, forms, menus, etc. ▫diverse content  with sections on different topics, advertisements, etc. Structural and semantic analysis of Web documents ▫Heading-based sectional hierarchy Use of this structural and semantic information ▫during summarization process ▫in the output summaries ▫query-biased techniques

10 Part of an Example Web Document


12 Search Engines Information retrieval (IR) ▫storage, retrieval and maintenance of information differences on the Web ▫distributed architecture ▫the heterogeneity of the available information ▫its size and growth rate, etc. Search engine ▫allows the user to enter search terms (queries)  run against a database ▫retrieves Web pages that match the search terms

13 Query Types Boolean search ▫keywords separated by (implicit or explicit) Boolean operators Phrase search ▫a set of contiguous words Proximity search Range searching Field searching Natural language search ▫Thesaurus search ▫Fuzzy search

14 Information Needs of Users Categorization (Ingwersen & Järvelin, 2005) ▫intentionality or goal of the searcher ▫the kind of knowledge currently known by the searcher ▫the quality of what is known ▫well-defined knowledge of the user  specific information sources are searched ▫in ill-defined (muddled) cases  the search process is exploratory Types of information need in Web search (White et al., 2003) ▫search for a fact ▫search for a number of items ▫decision search ▫background search

15 General Document Analysis physical components ▫paragraphs, words, figures, etc. logical components ▫titles, authors, sections, etc. as a syntactic analysis problem physical and logical components of a document ▫ordered tree transformation-based learning generalized n-gram model probabilistic grammars incremental parsing ▫syntactic parsing (Collins and Roark, 2004) ▫generating table-of-contents for a long document (Branavan et al., 2007)

16 Web Document Analysis Web documents ▫HTML (Hypertext Markup Language)  presentation of content ▫semi-structured documents Motivations ▫to filter important content ▫to convert HTML documents into semantically-rich XML documents ▫obtaining a hierarchical structure for the documents ▫display content in small-screen devices such as PDAs ▫more intelligent retrieval of information, summarization, etc Approaches ▫HTML tags and DOM tree ▫rule-based or machine learning-based ▫certain domain or domain-independent

17 Web Document Analysis (cont.) Different from most previous work ▫section and subsection headings HTML ▫Markup tags, attributes and attribute values ▫e.g. Two types of HTML tags ▫container tags (e.g.,,, etc.)  contain other HTML tags or text ▫format tags (e.g.,,,, etc.)  usually concerned with the formatting of text DOM (Document Object Model) ▫provides an interface as a tree

18 Automatic Summarization Process of distilling the most important information ▫from a source (or sources) to produce a shortened version ▫for particular users and tasks Uses ▫as an aid for browsing  single large documents or sets of documents ▫in sifting process  to locate useful documents in a large collection ▫as an aid for report writers  by providing abstracts related to and influenced by ▫information retrieval ▫information extraction ▫text mining

19 Automatic Summarization (cont.) Types of summaries ▫“Extract” vs “abstract” ▫“Generic” vs “query-relevant” ▫“Single-document” vs “multi-document” ▫“Indicative” vs “informative” Phases of summarization ▫Analysis of input text ▫Transformation into a summary representation ▫Synthesis of output summary

20 Automatic Summarization (cont.) Approaches ▫Surface-level approaches  use shallow features to identify important information in the text  thematic features, location, background, cue words and phrases, etc. ▫Entity-level approaches  build an internal representation of the text  by modeling text entities and their relationships  e.g. using graph topology ▫Discourse-level approaches  global structure of the text and its relation to communicative goals ▫Hybrid approaches Evaluation ▫intrinsic  the summary itself is evaluated ▫extrinsic  i.e. task-based evaluation

21 Recent Work on Summarization Mostly generic summaries ▫based on sentence weighting Tombros & Sanderson, 1998 ▫query-biased summaries in information retrieval Google, Altavista White et al, 2003 longer query-biased summaries ▫summary window Alam et al, 2003 ▫structured and generic summaries  “table of content”-like hierarchy of sections and subsections

22 Recent Work on Summarization (cont.) Yang & Wang, 2008 ▫fractal summarization ▫hierarchical structure of document  levels, chapters, sections, subsections, paragraphs, sentences and terms ▫generic summaries Varadarajan & Hristidis, 2005 ▫adding structure  document is divided into fragments (paragraphs)  connecting related fragments as a graph (implicit structure) ▫query-biased In this research, combining ▫explicit document structure and query-biased techniques


24 System Architecture

25 Structural Processing Rule-based and machine learning-based approaches Input ▫a Web document in HTML format Output ▫a tree representing the sectional hierarchy of the document  intermediate nodes: headings and subheadings,  leaves: other text units

26 Summarization Using the output of structural processing ▫document tree indicative summaries ▫extractive approach longer summaries ▫in a separate frame

27 Implementation GATE (A General Architecture for Text Engineering) ▫open source project using component-based technology in Java ▫commonly used natural language functionalities  Tokeniser, Sentence Splitter, Stemmer, etc. Cobra Java HTML Renderer and Parser ▫open source project ▫supports HTML 4, Javascript and Cascading Style Sheets (CSS) Implemented modules ▫Structural analysis of HTML documents ▫Summarization engine

28 Data Collection 1Hubble telescope achievements 2best retirement country 3literary/journalistic plagiarism 4Mexican air pollution 5antibiotics bacteria disease 6abuses of e-mail 7declining birth rates 8human genetic code 9mental illness drugs 10literacy rates 11robotic technology 12creativity 13tourism, increase 14newspapers electronic media 15wildlife extinction 16R&D drug prices 17Amazon rain forest 18Osteoporosis 19alternative medicine 20health and computer terminals 1Tsunami (tsunami) 2ekonomik kriz (economic crisis) 3Türkiye'de meydana gelen depremler (earthquakes in Turkey) 4sanat ödülleri (art awards) 5bilişim eğitimi ve projeleri (IT education and projects) English queries Turkish queries Users mostly Boolean queries with 2-3 words Current search interests various domains English Collection Turkish Collection Extended English Collection


30 The Method A heuristic approach based on DOM processing ▫Heading-based sectional hierarchy identification nontrivial task ▫heterogeneity of Web documents ▫the underlying HTML format Three steps ▫DOM tree processing ▫Heading identification ▫Hierarchy restructuring

31 Step 1: DOM Tree Processing Semantically related parts ▫same or neighboring container tags Traverse DOM tree in a breadth-first way ▫Sentence boundaries ▫Format tags such as are passed as features ▫Output: a simplified version of the original tree

32 DOM Tree of an Example Document

33 Example Output of DOM Tree Processing

34 Step 2: Heading Identification Heading tags in HTML ▫ through ▫rarely used for this purpose Headings ▫formed by formatting them differently from surrounding text ▫more emphasized than following content Heuristics ▫if-then rules

35 Features for Identifying Text Format FeatureDescriptionData Type h1, level-1 headingBoolean h2, level-2 headingBoolean h3, level-3 headingBoolean h4, level-4 heading Boolean h5, level-5 headingBoolean h6, level-6 headingBoolean B, boldBoolean strong, strong emphasis Boolean em, emphasisBoolean A, hyperlinkBoolean U, underlinedBoolean I, italic Boolean f_size, font size Integer f_color, font color String f_face, font face String allUpperCaseall the letters of the words are in uppercase Boolean cssIdCSS id attribute if used String cssClassCSS class attribute if used String alignmentalign attribute String li, different levels of list elements Integer

36 Step 3: Hierarchy Restructuring Headings + feature set ▫to differentiate different levels of heading Restructure the document tree ▫bottom-up approach

37 Step 3: Hierarchy Restructuring (cont.)

38 Performance Measures Golden Standard HeadingNon-heading Proposed Method HeadingTPFP Non-headingFNTN Hierarchy Extraction Parent-child relationships in the document tree Heading-subheading Heading- underlying text Heading Extraction

39 English Collection Document Set Actual Number Proposed Sys. Recall Proposed Sys. Precision Proposed Sys. F-measure Baseline Recall 16.500.940.600.690.51 211.300.800.650.670.34 38.200.910.560.660.68 43.600.890.640.730.38 59.300.890.580.660.57 618.100.820.700.730.39 75.400.840.590.670.27 86.900.980.570.680.56 912.700.930.760.820.38 106.200.840.750.770.24 Average8.820.880.640.710.43 Heading extraction Baseline using only heading tags through High value for heading recall Precision is lower cluttered organization in Web documents

40 English Collection (cont.) Document Set DOM Tree Proposed Sys. Hierarchy Baseline Hierarchy Actual Hierarchy 115.805.503.403.70 220.808.203.104.20 312.107.303.904.10 413.904.903.403.90 513.206.103.704.00 613.007.003.604.40 719. 812.806.103.704.20 917.507.103.304.00 1013.807.002.904.80 Average15.216.543.414.11 Document Set Baseline (only h tags) Proposed System 10.570.58 20.520.81 30.640.74 40.400.66 50.510.66 60.400.65 70.540.74 80.550.69 90.480.77 100.360.78 Average0.500.71 Hierarchy extraction a significant improvement to accuracy compared to the baseline

41 Turkish Collection Document Set Number of Headings RecallPrecisionF-measure 17.600.810.560.64 25.400.670.630.61 35.100.840.490.66 44.900.890.540.68 59.200.890.680.73 Average5.400.790.570.65 Document Set DOM Tree Depth Hierarchy Depth Hierarchy Accuracy 320.47.50.78 418.85.60.80 519.25.10.81 Average17.26.10.70 Heading extractionHierarchy extraction Baseline method failed no tags used Additional analysis 50 documents on domain 71% accuracy


43 Machine learning ▫can be more flexible ▫by combining several features using a training corpus  rather than predefined rules Extraction of sectional hierarchy of a Web document ▫A tree-based learning approach needed  as in syntactic parsing ▫exponential search space incremental algorithm ▫making a sequence of locally optimal choices ▫to approximate a globally optimal solution Document ▫as a sequence of text units The Approach

44 Example HTML document

45 Heading Extraction Model Binary classification ▫As a sequence of text units ▫Headings: positive examples ▫Non-headings: negative examples

46 Learn a mapping from X (a set of documents) to Y (a set of possible sectional hierarchies of documents) ▫Training examples (x i, y i ) for i = 1…n ▫A function GEN(x) enumerating a set of possible outputs for an input x ▫A representation Φ mapping each (x i,y i ) to a feature vector Φ(x i, y i ) ▫A parameter vector α ▫Estimate α such that it will give highest scores to correct outputs: Hierarchy Extraction Model

47 Features Unit features ▫Formatting features  e.g. font size, boldness, color, etc. ▫DOM tree features  e.g. DOM address, DOM path, etc. ▫Content features  e.g. cue words / phrases, number of characters, punctuation mark, etc. ▫Other features  Visual position in the rendered Web document Contextual features ▫composite features of two units in context  distance and difference between features  u ij : unit i levels above a unit u, and j units to its left Global features ▫e.g. the depth of sectional hierarchy

48 Incremental Learning Approach Document graph ▫left to right based on the order of appearance ▫Positive and negative examples  Parent-child relationships (based on golden standard hierarchy) ▫Two constraints  Document order  Projectivity rule  “When searching for the parent of a unit u j, consider only the previous unit (u j-1 ), the parent of u j-1, that unit’s parent, and so on to the root of the tree.

49 Incremental Learning Approach (cont.) Training set Web documents and corresponding golden standard hierarchies Algorithm works on units sequentially

50 Testing Approach Beam search ▫Set of partial trees ▫Beam width ▫Two operations  ADV (i.e. Advance)  potential attachments of current unit to partial trees  FILTER  to prevent exponential growth of the set

51 Variations M 1 ▫probability value M 2 ▫Run the algorithm in two levels M 3 ▫integer ranks ▫the times a tree obtains rank ‘1’ M 4 ▫integer ranks ▫sum ranks obtained at each step Testing Approach (cont.)

52 Implementation ▫Support Vector Machines  SVM-light (Joachims, 1999) ▫Perceptron Testing Approach (cont.) Update α Process a unit

53 Evaluation 5-fold cross-validation Heading Extraction Number of documents500 Avg. number of text units110.7 Avg. hierarchy depth4.1 Avg. number of headings10.6 Feature Set Features Number of Features Φ1Φ1 F n, F n(n+1) 58 Φ2Φ2 F n, F n(n+1), F n(n-1) 86 Φ3Φ3 F n, F n(n+1), F n(n+2) 82 Φ4Φ4 F n, F n(n+1), F n(n+2), F n(n-1) 110 Φ5Φ5 F n, F n(n+1), F n(n+2), F n(n-1), F n(n-2) 134 MethodFeature Set RecallPrecisionF- measure SVM – LinearΦ1Φ1 0.850.780.81 Φ2Φ2 0.830.780.80 Φ3Φ3 0.810.770.79 Φ4Φ4 0.830.780.80 Φ5Φ5 0.830.780.80 SVM – PolynomialΦ1Φ1 0.870.800.83 Φ2Φ2 0.850.800.82 Φ3Φ3 0.870.820.84 Φ4Φ4 0.850.800.82 Φ5Φ5 0.870.840.85 SVM – RBFΦ1Φ1 0.840.760.80 Φ2Φ2 0.840.790.81 Φ3Φ3 0.870.810.84 Φ4Φ4 0.880.830.85 Φ5Φ5 0.870.830.85 PerceptronΦ1Φ1 0.710.770.74 Φ2Φ2 0.700.780.74 Φ3Φ3 0.710.840.77 Φ4Φ4 0.780.820.80 Φ5Φ5 0.770.810.79 Statistics for Extended English Collection

54 Comparing with related work ▫Xue et al, 2007  extraction of the main title (i.e. a single heading) from HTML documents  SVM, CRF  a maximum f-measure of 0.80 a more general and challenging problem ▫extraction of all the headings in a given HTML document ▫obtained an f-measure of 0.85 Evaluation (cont.) MethodRecallPrecisionF-measure SVM0.870.840.85 Perceptron0.780.820.80 Rule-based Approach0.720.640.68

55 Evaluation (cont.) Hierarchy extraction Feature SetFeaturesNumber of Features Φ1Φ1 F 10 17 Φ2Φ2 F 10, F 01 40 Φ3Φ3 F 10, F 01, F 20 57 Φ4Φ4 F 10, F 01, F 20, F 02 73 Learning AlgorithmFeature Set Φ1Φ1 Φ2Φ2 Φ3Φ3 Φ4Φ4 SVM – Linear0.420.61 SVM – Polynomial0.570.63 0.65 SVM – RBF0.580.660.67 Perceptron0.510.46 Learning AlgorithmBeam width 1102050100 SVM – Polynomial0.640.65 SVM – RBF0.66 0.67

56 Evaluation (cont.) Error analysis ▫heading extraction  false negatives  false positives ▫heuristic-based incremental approach ▫cluttered Web documents with complex layouts ▫errors made by Web document authors acceptable results as a fully automatic approach Method Model 1 headings Manual headings Rule-based Approach0.610.81 Perceptron0.510.82 SVM0.680.79 Learning AlgorithmMethod M0M0 M1M1 M2M2 M3M3 M4M4 SVM – Polynomial0.650.670.590.640.68 SVM – RBF0.67 0.590.670.66


58 Summarization Method Structural information ▫to determine important sentences and sections ▫preserved in the output summaries Two levels of scoring ▫Sentence scoring  to determine important sentences  adapted to utilize the output of structural processing  Heading method  Location method  Term frequency method  Query method ▫Section scoring  to determine important sections  sum of scores of sentence in that section s sentence = s heading × w heading + s location × w location + s tf × w tf + s query × w query

59 Unstructured vs Structured Document

60 Example Sentence Score Calculation Query: antibiotics bacteria disease Sentence: “These are the bacteria that are usually involved with bacterial disease such as ulcers, fin rot, acute septicaemia and bacterial gill disease.” w heading = w location = w tf = 1 and w query = 3

61 Summarization Experiment

62 Task-based evaluation ▫information retrieval tasks  according to usefulness in a search engine ▫queries and documents used in structural processing experiments Four types of summaries ▫Google – Query-biased extracts provided by Google ▫Unstructured – Query-biased summaries without use of structural information ▫Structured1 – Structure-preserving and query-biased summaries  using output of structural processing step ▫Structured2 – Structure-preserving and query-biased summaries  using manually identified structure The summaries are about the same size ▫except Google ▫to make them comparable

63 Example TREC Query

64 Example Summary of Proposed System for the query “Antibiotics Bacteria Disease”

65 Experimental Methodology Within-subjects (i.e. repeated measures) design ▫to minimize the effects of differences among subjects ▫summary type and documents were presented in a random order  to reduce carryover effects ▫original full-text document is not displayed  until all the summaries for that document are displayed ▫4-10 subjects Using a web-based interface ▫Decision times of users recorded automatically User poll ▫Helpfulness of summaries ▫Likert scale (1: not helpful, 5: very helpful)

66 Performance Measures Relevance prediction (Hobson et al, 2007) ▫compare the subject’s judgment on a summary with his or her own judgment on the original full-text document ▫more suitable for a real-world scenario Original document judgment RelevantIrrelevant Summary judgment RelevantTPFP IrrelevantFNTN

67 Experiment Results English Collection SystemTPFPFNTNAPRF Google1073860950.670.730.620.63 Unstructured13128361050.790.820.760.77 Structured113725301080.820.850.80 Structured213823291100.830.850.830.82 System FNRFPR Google 0.360.29 Unstructured 0.220.21 Structured1 0.180.19 Structured2 0.17 SystemAPRFFNRFPR Google+22.39%+16.44%+29.03%+26.98%-50%-34.48% Unstructured+3.80%+3.66%+5.26%+3.90%-18.18%-9.52% System Time (seconds) Size (words) Google14.5841 Unstructured27.24278 Structured127.60264 Structured228.58253 Original41.431566 Improvement of proposed system over other methods Repeated measures ANOVA: p<0.001 for f-measure

68 Experiment Results (cont.) Turkish Collection SystemTPFPFNTNAPRF Google452010750.800.690.820.75 Unstructured431312820.830.770.780.77 Structured 14986870.910.860.890.88 Structured 247108850.880.820.850.84 SystemFNRFPR Google0.180.21 Unstructured0.220.14 Structured 10.110.08 Structured 20.150.11 SystemAPRFFNRFPR Google+13.75%+24.64%+8.54%+17.33%-38.89%-61.90% Unstructured+9.64%+11.69%+14.10%+14.29%-50%-42.86% System Time (seconds) Size (words) Google11.0430 Unstructured19.96216 Structured119.96230 Structured219.71235 Original24.53900 Improvement of proposed system over other methods Repeated measures ANOVA: p<0.05 for f-measure

69 Experiment Results (cont.) Extended English Collection SystemTPFPFNTNAPRF Google118361201260.570.720.470.52 Unstructured117954591080.720.770.750.73 Unstructured217653621090.720.770.730.72 Structured118550531120.740.780.770.76 Structured218340551220.750.820.760.77 System FNRFPR Google0.500.23 Unstructured10.230.32 Unstructured20.240.30 Structured10.200.30 Structured20.220.24 SystemAPRFFNRFPR Google+30.68%+9.66%+63.88%+44.97%-59.65%+29.80% Unstructured1+3.60%+1.31%+2.98%+3.35%-9.90%-4.91% Unstructured2+3.14%+1.79%+5.42%+4.90%-16.31%-0.30% System Time (seconds) Size (words) Rating Google10.20302.60 Unstructured117.702982.77 Unstructured218.443062.77 Structured117.512773.03 Structured217.022743.12 Original23.5913403.10 Improvement of proposed system over other methods Repeated measures ANOVA: p<0.05 for f-measure


71 Discussion Longer summaries ▫significant performance improvement ▫compared to Google Structured summaries ▫increased performance ▫compared to unstructured summaries ▫by providing an overview of the document Summary size ▫15-25% of the document on the average ▫75-90% correct relevance judgments Proposed system summaries (Structured1) ▫a fully automatic approach ▫can be incorporated into a search engine

72 Discussion (cont.) 6-9 times longer than Google extracts ▫less than two times increase in response times to balance the time spent and the accuracy ▫Tradeoff ▫ Time Overhead = Number of Results Viewed · T summary + FP · (T page_load + T document ) Common-place queries ▫by viewing a few of the top results Complex queries and background search ▫the accuracy becomes more important ▫Proposed system  Reduced number of missed items (false negative rates)  Users usually spend less time in viewing irrelevant results (false positive rates)

73 Discussion (cont.) High user ratings Analysis of time complexity ▫Structural processing stage  performed once beforehand similar to indexing phase of search engines ▫Summary extraction stage  Linear time complexity


75 Future Research Related to the research goals ▫Automatic analysis of domain-independent Web documents  to obtain a hierarchy of sections and subsections together with the headings  rule-based approach  machine learning approaches ▫A novel summarization approach  based on document structure and query-biased techniques

76 Future Research (cont.) Extending structural processing ▫Identify some document components  e.g. menus, references and advertisements  using machine learning techniques Summarization engine ▫linguistic and semantic processing  expanding the queries using WordNet  ontology-driven search (e.g. Cyc ontology) ▫more sophisticated query-biased methods ▫different types of search tasks  e.g. searching for a particular fact or searching for background information about a subject etc. ▫different document types (i.e. genre) and formats (e.g. XML) ▫automatic evaluation

77 Future Research (cont.) Search engine integration ▫Automatic display of hierarchical summaries  summary of each search result in a separate window  indexing mechanism  development of a user interface Adapting to other languages (e.g. Spanish) ▫using NLP resources of different languages ▫generating new knowledge sources for these languages  e.g. semantic knowledge base, ontology


79 Alam, H., A. Kumar, M. Nakamura, A. F. R. Rahman, Y. Tarnikova and C. Wilcox, “Structured and Unstructured Document Summarization: Design of a Commercial Summarizer Using Lexical Chains”, Proceedings of the Seventh International Conference on Document Analysis and Recognition, pp. 1147-1150, 2003. Branavan, S. R. K., P. Deshpande and R. Barzilay, “Generating a Table-of-Contents”, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, 2007. Collins, M. and B. Roark, “Incremental Parsing with the Perceptron Algorithm”, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 2004. Hobson, S. P., B. J. Dorr, C. Monz and R. Schwartz, “Task-Based Evaluation of Text Summarization Using Relevance Prediction”, Information Processing and Management, Vol. 43, No. 6, pp.1482-1499, 2007. Ingwersen, P. and K. Järvelin, The Turn: Integration of Information Seeking and Retrieval in Context, Springer, Dordrecht, 2005. Jansen, B. J. and A. Spink, “An Analysis of Web Searching by European Users”, Information Processing and Management, Vol. 41, No. 2, pp. 361-381, 2005. Joachims, T.,“Making Large-Scale SVM Learning Practical”, in B. Schölkopf, C. Burges and A. Smola (eds.), Advances in Kernel Methods - Support Vector Learning, MIT Press, 1999. Pembe, F. C. and T. Güngör, “A Tree Learning Approach to Web Document Sectional Hierarchy Extraction”, 2nd International Conference on Agents and Artificial Intelligence (ICAART 2010), Valencia, January 2010. References

80 References (cont.) Pembe, F. C. and T. Güngör, “Structure-Preserving and Query-Biased Document Summarization for Web Search”, Online Information Review, Vol.33(4), 2009, p.696-719. Sparck Jones, K., “Automatic Summarizing: Factors and Directions”, in I. Mani and M. T. Maybury (eds.), Advances in Automatic Text Summarization, pp. 1-12, MIT Press, Cambridge, 1999. Tombros, A. and M. Sanderson, “Advantages of Query Biased Summaries in Information Retrieval”, Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, pp. 2-10, 1998. Varadarajan R. and V. Hristidis, “Structure-Based Query-Specific Document Summarization”, Proceedings of the 14th ACM international conference on Information and Knowledge Management, 2005. Xue, Y., Y. Hu, G. Xin, R. Song, S. Shi, Y. Cao, C. Y. Lin and H. Li, “Web Page Title Extraction and Its Application”, Information Processing and Management, Vol. 43, No. 5, pp. 1332-1347, 2007. White, R. W., J. M. Jose and I. Ruthven, “A Task-oriented Study on the Influencing Effects of Query-biased Summarization in Web Searching”, Information Processing and Management, Vol. 39, No. 5, pp. 707-733, 2003. Yang, C. C. and F. L. Wang, “Hierarchical Summarization of Large Documents”, Journal of the American Society for Information Science and Technology, Vol. 59, No. 6, pp. 887-902, 2008.

81 Thank you

Download ppt "DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,"

Similar presentations

Ads by Google