Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introducing Natural Language Program Analysis Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, Kishen Maloor.

Similar presentations


Presentation on theme: "Introducing Natural Language Program Analysis Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, Kishen Maloor."— Presentation transcript:

1 Introducing Natural Language Program Analysis Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, Kishen Maloor

2 NLPA Research Team Leaders Lori Pollock “Team Captain” University of Delaware K. Vijay-Shanker “The Umpire”

3 Problem Modern software is large and complex object oriented class hierarchy Software development tools are needed

4 Successes in Software Development Tools object oriented class hierarchy Good with local tasks Good with traditional structure

5 object oriented class hierarchy Scattered tasks are difficult Programmers use more than traditional program structure Issues in Software Development Tools

6 public interface Storable{... activate tool save drawing update drawing undo action public void Circle.save() //Store the fields in a file.... object oriented system Key Insight: Programmers leave natural language clues that can benefit software development tools Observations in Software Development Tools

7 Studies on choosing identifiers Impact of human cognition on names [Liblit et al. PPIG 06] Metaphors, morphology, scope, part of speech hints Hints for understanding code Analysis of Function identifiers [Caprile and Tonella WCRE 99] Lexical, syntactic, semantic Use for software tools: metrics, traceability, program understanding Carla, the compiler writerPete, the programmer I don’t care about names. So, I could use x, y, z. But, no one will understand my code.

8 Our Research Path [MACS 05, LATE 05] [AOSD 06] [ASE 05, AOSD 07, PASTE 07]  Motivated usefulness of exploiting natural language (NL) clues in tools  Developed extraction process and an NL- based program representation  Created and evaluated a concern location tool and an aspect miner with NL-based analysis

9 pic Name: David C Shepherd Nickname: Leadoff Hitter Current Position: PhD May 30, 2007 Future Position: Postdoc, Gail Murphy Stats Yearcoffees/dayredmarks/paper draft 20020.1500 20072.2100

10 Aspect Mining Aspect-Oriented Programming Aspect Mining Task Locate refactoring candidates Applying NL Clues for Molly, the Maintainer How can I fix Paul’s atrocious code?

11 Timna: An Aspect Mining Framework [ASE 05] Uses program analysis clues for mining Combines clues using machine learning Evaluated vs. Fan-in Precision (quality) and Recall (completeness) P R 37 2 62 60 Fan-In Timna

12 iTimna (Timna with NL) Integrates natural language clues Example: Opposite verbs (open and close) P R 37 2 62 60 81 73 Fan-In Timna iTimna Integrating NL Clues into Timna Natural language information increases the effectiveness of Timna [Come back Thurs 10:05am]

13 Concern Location 60-90% software costs spent on reading and navigating code for maintenance* (fixing bugs, adding features, etc.) *[Erlikh] Leveraging Legacy System Dollars for E-Business Applying NL Clues for Motivation

14 Key Challenge: Concern Location Find, collect, and understand all source code related to a particular concept Concerns are often crosscutting

15 State of the Art for Concern Location Mining Dynamic Information [Wilde ICSM 00] Program Structure Navigation [Robillard FSE 05, FEAT, Schaefer ICSM 05] Search-Based Approaches RegEx [grep, Aspect Mining Tool 00] LSA-Based [Marcus 04] Word-Frequency Based [GES 06] Reduced to similar problem Slow Fast Fragile Sensitive No Semantics

16 Limitations of Search Techniques 1.Return large result sets 2.Return irrelevant results 3.Return hard-to- interpret result sets

17 The Find-Concept Approach concept Find-Concept Concrete query Recommendations Source Code Method a Method b Method c Method d Method e NL-based Code Rep Result Graph Natural Language Information 1. More effective search 2. Improved search terms 3. Understandable results

18 Underlying Program Analysis Action-Oriented Identifier Graph (AOIG) [AOSD 06] Provides access to NL information Provides interface between NL and traditional Word Recommendation Algorithm NL-based Stemmed/Rooted: complete, completing Synonym: finish, complete Combining NL and Traditional Co-location: completeWord()

19 Experimental Evaluation Research Questions Which search tool is most effective at forming and executing a query for concern location? Which search tool requires the least human effort to form an effective query? Methodology: 18 developers complete nine concern location tasks on medium-sized (>20KLOC) programs Measures: Precision (quality), Recall (completeness), F-Measure (combination of both P & R) Find Concept, GES, ELex

20 Overall Results Effectiveness FC > Elex with statistical significance FC >= GES on 7/9 tasks FC is more consistent than GES Effort FC = Elex = GES FC is more consistent and more effective in experimental study without requiring more effort Across all tasks

21 Natural Language Extraction from Source Code Key Challenges: Decode name usage Develop automatic extraction process Create NL-based program representation Molly, the Maintainer What was Pete thinking when he wrote this code?

22 Natural Language: Which Clues to Use? Software Maintenance Typically focused on actions Objects are well-modularized Maintenance Requests

23 Natural Language: Which Clues to Use? Software Maintenance Typically focused on actions Objects are well-modularized Focus on actions Correspond to verbs Verbs need Direct Object (DO)  Extract verb-DO pairs

24 Extracting Verb-DO Pairs Two types of extraction class Player{ /** * Play a specified file with specified time interval */ public static boolean play(final File file,final float fPosition,final long length) { fCurrent = file; try { playerImpl = null; //make sure to stop non-fading players stop(false); //Choose the player Class cPlayer = file.getTrack().getType().getPlayerImpl(); … } Extraction from comments Extraction from method signatures

25 public UserList getUserListFromFile( String path ) throws IOException { try { File tmpFile = new File( path ); return parseFile(tmpFile); } catch( java.io.IOException e ) { throw new IOrException( ”UserList format issue" + path + " file " + e ); } Extracting Clues from Signatures 1.POS Tag Method Name 2.Chunk Method Name 3.Identify Verb and Direct-Object (DO) get User List From File POS Tag Chunk

26 pic Name: Zak Fry Nickname: The Rookie Current Position: Upcoming senior Future Position: Graduate School Stats Yeardiet cokes/daylab days/week 200612 200768

27 Developing rules for extraction For many methods: Identify relevant verb (V) and direct object (DO) in method signature Classify pattern of V and DO locations If new pattern, create new extraction rule verbDO verbDO verbDO

28 Our Current Extraction Rules 4 general rules with subcategories: URL parseURL() void mouseDragged() void Host.onSaved() Left Verb Right Verb Generic Verb Unidentified Verb void message() message- hostsaved mousedragged URLparse DOVerb

29 Example: Sub-Categories for Left-Verb General Rule Look beyond the method name: Parameters, Return type, Declaring class name, Type hierarchy Verb-DO pair: Left Verb

30 Representing Verb-DO Pairs Action-Oriented Identifier Graph (AOIG) verb1verb2verb3DO1DO2DO3 verb1, DO1verb1, DO2verb3, DO2verb2, DO3 source code files use

31 Action-Oriented Identifier Graph (AOIG) playaddremovefileplaylistlistener play, fileplay, playlistremove, playlistadd, listener source code files use Representing Verb-DO Pairs

32 Evaluation of Extraction Process Compare automatic vs ideal (human) extraction 300 methods from 6 medium open source programs Annotated by 3 Java developers Promising Results Precision: 57% Recall: 64% Context of Results Did not analyze trivial methods On average, at least verb OR direct object obtained

33 pic Name: Emily Gibson Hill Nickname: Batter on Deck Current Position: 2nd year PhD Student Future Position: PhD Candidate Stats Yearcokes/daymeetings/week 20030.21 200725

34 Program Exploration Purpose: Expedite software maintenance and program comprehension Key Insight: Automated tools can use program structure and identifier names to save the developer time and effort Ongoing work:

35 Dora the Program Explorer * * Dora comes from exploradora, the Spanish word for a female explorer. Dora Natural Language Query Maintenance request Expert knowledge Query expansion Natural Language Query Maintenance request Expert knowledge Query expansion Relevant Neighborhood Program Structure Representation Current: call graph Seed starting point Relevant Neighborhood Subgraph relevant to query Query

36 State of the Art in Exploration Structural (dependence, inheritance) Slicing Suade [Robillard 2005] Lexical (identifier names, comments) Regular expressions: grep, Eclipse search Information Retrieval: FindConcept, Google Eclipse Search [Poshyvanyk 2006]

37 Motivating need for structural and lexical information Program: JBidWatcher, an eBay auction sniping program Bug: User-triggered add auction event has no effect Task: Locate code related to ‘add auction’ trigger Seed: DoAction() method, from prior knowledge Example Scenario

38 DoNada() Using only structural information DoAction() has 38 callees, only 2/38 are relevant Relevant Methods Irrelevant Methods Looking for: ‘add auction’ trigger DoAction() DoAdd() DoPasteFromClipboard() And what if you wanted to explore more than one edge away?  Locates locally relevant items, but many irrelevant

39 Using only lexical information 50/1812 methods contain matches to ‘add*auction’ regular expression query Only 2/50 are relevant  Locates globally relevant items, but many irrelevant Looking for: ‘add auction’ trigger

40 DoNada() Combining Structural & Lexical Information Structural: guides exploration from seed Looking for: ‘add auction’ trigger Relevant Neighborhood DoAction() DoPasteFromClipboard() DoAdd() Lexical: prunes irrelevant edges

41 The Dora Approach Determine method relevance to query Calculate lexical-based relevance score Low-scored methods pruned from neighborhood Recursively explore Prune irrelevant structural edges from seed

42 Calculating Relevance Score: Term Frequency Score based on query term frequency of the method 6 query term occurrences Only 2 occurrences Query: ‘add auction’

43 Weigh term frequency based on location: Method name more important than body Method body statements normalized by length Calculating Relevance Score: Location Weights Query: ‘add auction’ ?

44 Dora explores ‘add auction’ trigger From DoAction() seed: Correctly identified at 0.5 threshold DoAdd() (0.93) DoPasteFromClipboard() (0.60) With only one false positive DoSave() (0.52)

45 Summary NL technology used Synonyms, collocations, morphology, word frequencies, part-of-speech tagging, AOIG Evaluation indicates Natural language information shows promise for improving software development tools Key to success Accurate extraction of NL clues

46 Our Current and Future Work Basic NL-based tools for software Abbreviation expander Program synonyms Determining relative importance of words Integrating information retrieval techniques

47 Posed Questions for Discussion What open problems faced by software tool developers can be mitigated by NLPA? Under what circumstances is NLPA not useful?


Download ppt "Introducing Natural Language Program Analysis Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, Kishen Maloor."

Similar presentations


Ads by Google