Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA-Intensive systems Department of computer science

Similar presentations


Presentation on theme: "DATA-Intensive systems Department of computer science"— Presentation transcript:

1 DATA-Intensive systems Department of computer science

2 Outlier detection BY active learning
identifying deviating, “odd” data records Study outlier detection as a semi-supervised learning problem learn from past examples of outliers (training data – supervised part) identify new types of outliers (no training examples - unsupervised part) work with an existing proposal for semi-supervised outlier detection Apply active learning on it: Algorithm determines “interesting” data records and asks user for labels to maximize learning while minimizing user effort Related work: B. Micenková, B. McWilliams, I. Assent: Learning representations for outlier detection on a budget - arXiv preprint arXiv: , 2015 Burr Settles: Active Learning. 02/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/2019

3 Word Embeddings For Natural Language Processing
represent words in vector spaces word2vec (Google) and GloVe (Standord) Project: study word embeddings for different NLP tasks treat out-of-vocabulary words consider which features to extract from text Related work: Y. Pinter, R. Guthrie, J. Eisenstein: Mimicking Word Embeddings using Subword RNNs. Img source 02/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201904/12/2017

4 Content-based Influence Maximization by linear threshold model
Social networks facilitate the spread of influence. Influence maximization problem: Choose best initial adopters for a given item Objective: reach as many nodes as possible Another problem: Choose best content features for given initial adopters. Same objective Solve problem by linear threshold model of information diffusion. Related work: Goyal et al. : SIMPATH: An Efficient Algorithm for Influence Maximization under the Linear Threshold Model. ICDM 2011. Ivanov t al.: Content Recommendation for Viral Social Influence. SIGIR 2017.

5 Similarity Search with Dynamic Time Warping
Time Series databases: finance, energy, health care, space. Similarity search problem: Given a query record, find the most similar ones How do store data for efficient retrieval? How to process query? How is similarity defined? Euclidean distance Dynamic Time Warping (DTW) Develop solution for DTW-distance. Related work: Assent et al.: Anticipatory DTW for Efficient Similarity Search in Time Series Databases. PVLDB 2(1) 2009. Kashyap & Karras: Scalable kNN search on vertically stored time series. KDD 2011.

6 How Do we Walk on Knowledge?
Knowledge graph data: Wikidata, Freebase, YAGO Preliminaries Random walk model: Start by a page and click on a random link Repeat until convergence Random walk with restart: With some probability jump back to the initial page Random walk model explains a number of phenomena in networks Question Does the Random Walk model applies seemingly to Knowledge graphs? Can we analyse the behavior of users on Knowledge graphs and compare the laws? Task Analyse Wikipedia logs and DBPedia log files and fit a random walk model Modify the random walk model to fit knowledge graphs. Related work: Bonifati, A., Martens, W. and Timm, T. An analytical study of large SPARQL query logs. PVLDB, 11(2), pp Page, L., Brin, S., Motwani, R. and Winograd, T.  The PageRank citation ranking: Bringing order to the web. Stanford InfoLab. Faloutsos, M., Faloutsos, P. and Faloutsos, C. On power-law relationships of the internet topology. SIGCOMM 1999.

7 Distributed workload-Aware Knowledge Graph summaries
Knowledge graph data: Wikidata, Freebase, YAGO Preliminaries A graph summary is a compressed representation of a graph Graph summaries can answer queries (e.g., who is the Queen of Denmark) Query-aware summaries summarise a graph based on some query Question Is is possible to learn a summary on multiple machines representing many users? Can we balance these summaries? Task Use query-aware summaries to find a first summary of the graph Generate queries from multiple users to be distributed in n machines Related work: Fan, W., Li, J., Wang, X. and Wu, Y.Query preserving graph compression. SIGMOD 2012 Papailiou N. et al. Graph-Aware, Workload-Adaptive SPARQL Query Caching. SIGMOD Conference 2015: 

8 AuthenticatinG geometric queries
DB Client The Truth? The Whole Truth? Nothing But The Truth? Results provided from untrusted parties. Query Authentication question: Are the results that got genuine? Desirable characteristics: Authentic: Every entry originated from the owner Complete: No result entry is omitted from the answer Precise: Minimum information leakage Secure: Computationally infeasible to cheat Efficient: Polynomial proof Fresh: The result is based on the most current version Develop solution for geomotric queries. Related work: Papadopoulos et al. : Taking Authenticated Range Queries to Arbitrary Dimensions. CCS 2014. Chen et al. : Authenticating Top-k Queries in Location-based Services with Confidentiality. PVLDB 7(1), 2013. Jadhav and Mukhopadhyay: Computing a Centerpoint of a Finite Planar Set of Points in Linear Time. SoCG 1993. Query Owner Results Server Untrusted!

9 02/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201902/07/201904/12/2017


Download ppt "DATA-Intensive systems Department of computer science"

Similar presentations


Ads by Google