AutoCog: Measuring the Description-to-permission Fidelity in Android Applications Zhengyang Qu1, Vaibhav Rastogi1, Xinyi Zhang1,2, Yan Chen1, Tiantian.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Large-Scale Entity-Based Online Social Network Profile Linkage.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Android Permission Presenter: Zhengyang Qu.
MINING FEATURE-OPINION PAIRS AND THEIR RELIABILITY SCORES FROM WEB OPINION SOURCES Presented by Sole A. Kamal, M. Abulaish, and T. Anwar International.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Introduction to Machine Learning Approach Lecture 5.
Flash talk by: Aditi Garg, Xiaoran Wang Authors: Sarah Rastkar, Gail C. Murphy and Gabriel Murray.
WHYPER: Towards Automating Risk Assessment of Mobile Applications Rahul Pandita, Xusheng Xiao, Wei Yang, William Enck, and Tao Xie ♠ Department of Computer.
Tao Xie University of Illinois at Urbana-Champaign 0
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
A METHODOLOGY FOR EMPIRICAL ANALYSIS OF PERMISSION-BASED SECURITY MODELS AND ITS APPLICATION TO ANDROID.
Enhancing User Privacy on Android Devices Bachelor of Computer Science (Honours) Name: Quang Do Supervisor: Raymond Choo Associate Supervisor: Ben Martini.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Study of Automated Extraction of Security Policy from Natural-Language Software Documents * Nov. 21, 2013, Kaidi Ma, Man Sun Computer Information Science.
SUPOR : Precise and Scalable Sensitive User Input Detection for Android Apps Jianjun Huang, Zhichun Li, Xusheng Xiao, Zhenyu Wu, Kangjie Lu, Xiangyu Zhang,
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真.
AppShield: A Virtual File System in Enterprise Mobility Management Zhengyang Qu 1 Northwestern University, IL, US,
A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky (Technion) Eugene Agichtein (Emory) Evgeniy Gabrilovich (Yahoo!
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
COLING 2012 Extracting and Normalizing Entity-Actions from Users’ comments Swapna Gottipati, Jing Jiang School of Information Systems, Singapore Management.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.
Multilingual Opinion Holder Identification Using Author and Authority Viewpoints Yohei Seki, Noriko Kando,Masaki Aono Toyohashi University of Technology.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.
Semi-automatic Product Attribute Extraction from Store Website
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Link Distribution on Wikipedia [0407]KwangHee Park.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
AppContext: Differentiating Malicious and Benign Mobile App Behavior Under Contexts Tao Xie Joint Work w/ David Yang, Sihan Li (Illinois) Xusheng Xiao,
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Towards the privacy leakage and user fraud detection of Android applications Zhengyang Qu 1 Northwestern University, IL, US,
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
CS371m - Mobile Computing Runtime Permissions.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
AppShield: Enabling Multi-entity Access Control Cross Platforms for Mobile App Management Zhengyang Qu1, Guanyu Guo2, Zhengyue Shao2, Vaibhav Rastogi3,
Topic Oriented Semi-supervised Document Clustering
Automatic Detection of Causal Relations for Question Answering
CS246: Information Retrieval
Presentation transcript:

AutoCog: Measuring the Description-to-permission Fidelity in Android Applications Zhengyang Qu1, Vaibhav Rastogi1, Xinyi Zhang1,2, Yan Chen1, Tiantian Zhu3, and Zhong Chen4 She worked on this project mainly during her summer internship at northwestern university. 1Northwestern University, IL, US, 2Fudan University, Shanghai, China, 3Zhejiang University, Hangzhou, China, 4Wind Mobile, Toronto, Canada

Outline Problem Statement Approach & Design Evaluation Conclusions

Outline Problem Statement Approach & Design Evaluation Conclusions

Motivations Android Permission System Access control by permission system Few users can understand security implications from requested permissions User expectation v.s. Application Behavior User expectation based on application description Permission defines application behavior Assess how well permission align with description Android is the most popular mobile operating system. The open nature bolster its market share. However, the open nature also employs some security issues. User privacy access control by permission system Few users are discreet enough or have the professional knowledge to understand security implications from requested permissions Even from description, user do not know what to expect. Usability problem. AutoCog output questionable permissions and the sentences annotated the usage of other permissions to assist user to matching description to permission.

Desired Systems Application developers End users Requirements: Rich semantic information Independent of external resource Automation Application developers use this tool to receive an early, automatic feedback on the quality of descriptions of stating the usage of permissions. End users use this system to understand if an application is over-privileged and risky to use. Great diversity of natural language, should have a good coverage on them. Independent of external resource: availability of resource ,whether it is complete? Example, API document. Ours only requires the information from Google Play Automation, is error-pron and inefficient if manual efforts is needed.

Challenge & Contributions Inferring description semantics Similar meaning may be conveyed in a vast diversity of natural language text “friends”, “contact list”, “address book” Correlating description semantics with permission semantics A number of functionalities described may map to the same permission “enable navigation”, “display map”, “find restaurant nearby” 1. Leverage stat-of-the-art NLP techniques In the domain of smartphone contact book. 2. Design a learning-based algorithm

System Prototype Available on Google Play https://play.google.com/store/apps/details?id=com.version1.autocog

Outline Problem Statement Approach & Design Evaluation Conclusion Description Semantics (DS) Model Description-to-Permission Relatedness (DPR) Model Evaluation Conclusion

System Overview

System Overview

System Overview

Ontology modeling Logical dependency between verb phrase and noun phrase <“scan”, “barcode”> for CAMERA, <“record”, “voice”> for RECORD_AUDIO Logical dependency between noun phrases <“scanner”, “barcode”>, <“note”, “voice”> Noun phrase with possessive <“your”, “camera”>, <“own”, “voice”>

Description Semantics Model (Contribution 1) Extract Abstract Semantics Explicit Semantic Analysis (ESA) Computing the semantic relatedness of texts Leverage a big document corpus (Wikipedia) as the knowledge base and constructs a vector representation Advantages: Rich semantic information, Quantitative representation of semantics Understand how different words and phrases in a vocabulary related to each other Wikipedia, as the knowledge base on ESA is much richer than other created dictionary or database. Example, Google Map, social networking. ESA offers a much more detailed and quantitative representation of semantics. It maps the meaning of words/phrases to a weighted combination of concepts, while mapping a word in WordNet amounts to simple lookup, without any weight

Description-to-Permission Relatedness (DPR) Model (Contribution 2) Learning-based method Input: application permission, application description Output: <np-counterpart, noun phrase> correlated with each sensitive permission To construct the DPR model : measure how closely a pair of noun phrase and np-counterpart related to permission, our learning-based algorithm is composed of three steps. Input, output 3 stages

Samples in DPR Model Permission Semantic Patterns WRITE_EXTERNAL_STORAGE <delete, audio file>, <convert, file format> ACCESS_FINE_LOCATION <display, map>, <find, branch atm>, <your location> ACCESS_COARSE_LOCATION <set, gps navigation>, <remember, location> GET_ACCOUNTS <manage, account>, <integrate, facebook> RECEIVE_BOOT_COMPLETED <change, hd paper>, <display, notification> CAMERA <deposit, check>, <scanner, barcode>, <snap, photo> READ_CONTACTS <block, text message>, <beat, facebook friend> RECORD_AUDIO <send, voice message>, <note, voice> WRITE_SETTINGS <set, ringtone>, <enable, flight mode> WRITE_CONTACTS <wipe, contact list>, <secure, text message> READ_CALENDAR <optimize, time>, <synchronize, calendar>

Learning Algorithm for DPR S1: Grouping noun phrases Create semantic relatedness score matrix <“map”, [(“map”, 1.00), (“map view”, 0.96), (“interactive map”, 0.89), …]> S2: Selecting Noun Phrases Correlated with Permissions Not biased to frequently occurring noun phrases Jointly consider conditional probabilities: P(perm | np) and P(np | perm) Create semantic relatedness score matrix for high-frequency noun phrase High frequency noun phrase: present in the number of application description above the threshold As the small number of samples cannot provide enough condence in our frequency-based measurement. If a low-frequency phrase is similar to a high-frequency phrase, our decision process will not be affected as the decision module employs DS model.

Learning Algorithm for DPR(cont’d) S3: Pairing np-counterpart with Noun Phrase “Retrieve Running Apps permission is required because, if the user is not looking at the widget actively (for e.g. he might using another app like Google Maps)” To explore the context and semantic dependencies Read the whole sentence here

Outline Problem Statement Approach & Design Evaluation Conclusions

Evaluation Training set: 36,060 applications Validation set: 1,785 applications (150-200 for each permissions), 11 sensitive permissions We ask human readers to annotate these sentences. When AutoCog aligns with human readers decide a permission if true positive. Similarly, is fp, fn, tn.

Closely Related Work Whyper, Pandita et al., USENIX Security 2013 Leverages API documentation to generate a semantics model APIs are mapped to permissions using PScout Limitations Limited semantic information “Blow into the mic to extinguish the flame…” for RECORD_AUDIO permission not in API document Lack of associated APIs RECEIVE_BOOT_COMPLETED has no associated APIs Lack of automation Use API document to pickup resource names and actions to construct the semantic engine For the mobile banking requesting CAMERA permission, it supports deposits checking by taking a photograph of the check、、 Whyper's extraction of patterns from API documents involved manual selection to preserve the quality of patterns; what policies could be used to automate this process in a systematic manner is an open question. .

Accuracy Comparison System Precision (%) Recall (%) F-score (%) AutoCog 92.6 92.0 92.3 93.2 Whyper 85.5 66.5 74.8 79.9 We run AutoCog and Whyper in parallel and evaluate how well they are aligning with human reader’s groundtruth. Precision higher than whyper by 7 percentage. Recall higher than whyper over 20 percentage.

Results Case Studies: Latency: 4.5 s check an application AutoCog TP/ Whyper FN: “Filter by contact, in/out SMS”, “5 calendar views” AutoCog TN/Whyper FP “Saving event attendance status now works on Android 4.0” AutoCog FN/Whyper TP “Ability to navigate to a Contact if that Contact has address” AutoCog FP/Whyper TN “Set recording as ringtone” Latency: 4.5 s check an application To analyze in depth about the difference on the performance: The difference in the fundamental method to find semantic patterns related to permissions, (2) we include the logical dependency between noun phrases as extra ontology. Whyper is limited by the use of a fixed and limited set of vocabularies derived from the Android API documents and their synonyms. Our correlation of permission with pairs of noun phrase and np-counterpart is based on clustering results from a large application dataset, which is much richer than that extracted from API documents. One major reason for this difference in detection is that Whyper is not able to accurately explore the meaning of noun phrase with multiple words In the training process, some patterns may not be included in the training set and not be able to be related in semantics by ESA Some patterns dominant in statistics but not make sense to human could not be fully excluded by AutoCog, but whyper would not pick them up in the API document

Conclusions AutoCog is a system to measure the description-to-permission fidelity Learning-based algorithm to generate DPR model, better accuracy performance, ability to extend over other permissions Ongoing work Optimize the training algorithm to improve the scalability Simplify our semantics models

AutoCog App

Thank you! Questions? http://list.cs.northwestern.edu/mobile/

NLP Module Sentence boundary disambiguation (SBD) Description is split into sentences for subsequent sentence structure analysis (Stanford Parser) Grammatical structure analysis Stanford Parser outputs typed dependencies and PoS tagging of each word Extract pairs of noun phrase and np-counterpart Remove stopwords and named entities; Normalized by lowercasing and lemmatization Characters such as period, comma, and some others like star that may start bullet points are treated as sentence separators. Regular expressions are used to annotate email addresses, URLs, IP addresses, Phone numbers, decimal numbers, abbreviations, and ellipses, which interfere with SBD as they contain the sentence separator characters. PoS part of speech tagging

Description-to-Permission Relatedness (DPR) Model (Contribution 2) To construct the DPR model : measure how closely a pair of noun phrase and np-counterpart related to permission, our learning-based algorithm is composed of three steps. 3 stages

Decision Extract all pairs of noun phrase and np-counterpart Condition: Here, gamma and theta are the thresholds of the semantic relatedness score for np-counterparts and noun phrases. The sentences indicating permissions will be annotated. Besides, AutoCog finds all the questionable permissions, which are not warranted in description.

Deployment Blue color

DPR Model (cont’d) Pairing np-counterpart with Noun Phrase To explore the context and semantic dependencies SP: total number of descriptions where the pair <nc, np’> is detected, the number of application requesting the permission is

Measurement Results Another 45,811 applications, DPR model trained in accuracy evaluation Negative correlation between the number of questionable permissions of one application by a specific developer with the total number of applications published by that developer: r = -0.405, p < 0.001 Only 9.1% of applications are clear of question- able permissions.

Backup

Back up