Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa.

Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa

Outline  Scenario  Used Tools –The AWS Toolkit for Eclipse –Karmasphere Studio For Amazon –Apache PDFBox  Realization of Scenario Steps  Final Conclusions & Personal Opinion

Scenario  The sample PDF files are stored in an S3 bucket under the following Endpoint: http://exercise2.ws2011.s3-website-eu-west-1.amazonaws.com/ (first a user should be authenticated in order to access these files). http://exercise2.ws2011.s3-website-eu-west-1.amazonaws.com/  The files are read and stored in an Amazon queue for further processes.  An Amazon EC2 instance processes the queued items and extracts the paragraphs out of that as text. The result should be stored in the second Amazon S3 bucket.  As the next step, Elastic MapReduce should be applied to the resulting data of the previous step. The MapReduce process should simply make a word counting and for each paragraph calculate the top ten high frequency words. The result should be then stored in a SimpleDB.  Finally some sample queries that receives some keywords and returns the list of paragraphs that matches the best to those keywords should be provided.

4 Scenario

5  The AWS Toolkit for Eclipse –An open source plug-in for the Eclipse Java IDE that makes it easier for developers to develop, debug, and deploy Java applications using Amazon Web Services. –With the AWS Toolkit for Eclipse, you’ll be able to get started faster and be more productive when building AWS applications. –The AWS Toolkit for Eclipse features:  AWS SDK for Java  AWS Explorer  AWS Elastic Beanstalk Deployment and Debugging  Support for multiple AWS Accounts –http://aws.amazon.com/eclipse/http://aws.amazon.com/eclipse/ Used Tools I

6  Karmasphere Studio For Amazon –Graphical environment that supports the complete lifecycle for developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs. –By simplifying development, Karmasphere Studio increases the productivity of developers, saving time and effort. –Comes in versions compatible with Eclipse. –Two different licensing models  License Included (the Karmasphere software has been licensed by AWS)  Bring-Your-Own (designed for customers who prefer to use existing Karmasphere) –http://aws.amazon.com/elasticmapreduce/karmasphere/http://aws.amazon.com/elasticmapreduce/karmasphere/ –http://karmasphere.com/ksc/karmasphere-studio-for-amazon.htmlhttp://karmasphere.com/ksc/karmasphere-studio-for-amazon.html Used Tools II

7  Apache PDFBox –Java PDF Library –Open source Java tool for working with PDF documents –Used for PDF to text extraction –http://pdfbox.apache.org/http://pdfbox.apache.org/ Used Tools III

8 Scenario – part I

9 import com.amazonaws.auth.PropertiesCredentials; import com.amazonaws.services.s3.AmazonS3; import com.amazonaws.services.s3.AmazonS3Client; import com.amazonaws.services.s3.model.ListObjectsRequest; import com.amazonaws.services.s3.model.ObjectListing; import com.amazonaws.services.s3.model.S3ObjectSummary; import com.amazonaws.services.sqs.AmazonSQS; import com.amazonaws.services.sqs.AmazonSQSClient; import com.amazonaws.services.sqs.model.CreateQueueRequest; import com.amazonaws.services.sqs.model.ReceiveMessageRequest; import com.amazonaws.services.sqs.model.SendMessageRequest; Scenario – part I

10 AmazonS3 s3 = new AmazonS3Client(new PropertiesCredentials(MainClass.class.getResourceAsStream("AwsCredentials.properties"))); AmazonSQS sqs = new AmazonSQSClient( new PropertiesCredentials(MainClass.class.getResourceAsStream(“AwsCredentials.properties"))); String inputBucketName = "exercise2.ws2011"; String mainBucketName = “introduction.to.cloud.computing"; String vFolderWithParagrapfsName = "pdf.extracted.paragraph"; String queueName ="myQueue01"+UUID.randomUUID(); int numberOfSentMessages = 0; CreateQueueRequest createQueueRequest = new CreateQueueRequest(queueName); String myQueueUrl = sqs.createQueue(createQueueRequest).getQueueUrl(); Scenario – part I

11 ObjectListing objectListing = s3.listObjects(new ListObjectsRequest().withBucketName(inputBucketName)); for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) { String fileName= objectSummary.getKey(); sqs.sendMessage(new SendMessageRequest(myQueueUrl, fileName)); numberOfSentMessages++; } Scenario – part I

12 Scenario – part II

13 // create bucket s3.createBucket(mainBucketName); // create virtual folder in created bucket String tmpFileName = "tmpFile.txt"; Boolean successfullCreated= new File(tmpFileName).createNewFile(); File tmpFile = new File (tmpFileName); s3.putObject(newPutObjectRequest(mainBucketName,vFolderWithParagrapfsName+"/",tmpFile)); Scenario – part II

14 ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest (myQueueUrl); int totalNumberOfReceivedMessages=0; int numberOfReceivedMessages=0; while(numberOfSentMessages!=totalNumberOfReceivedMessages) { List messages = sqs.receiveMessage(receiveMessageRequest).getMessages(); numberOfReceivedMessages=messages.size(); for (Message message : messages) { String fileName = message.getBody(); String messageRecieptHandle = message.getReceiptHandle(); sqs.deleteMessage(new DeleteMessageRequest(myQueueUrl, messageRecieptHandle)); String sURL = s3.generatePresignedUrl(inputBucketName, fileName, null).toString(); downloadFromUrl(sURL, pdfDir+"/"+fileName); PDFTextParser pdfTextParserObj = new PDFTextParser(); String pdfToText = pdfTextParserObj.pdftoText(pdfDir+"/"+fileName); pdfTextParserObj.writeTexttoFile(pdfToText, pdfDir+"/"+fileName2); Scenario – part II

15 Scenario – part III

16 while(numberOfSentMessages!=totalNumberOfReceivedMessages){ List messages = sqs.receiveMessage(receiveMessageRequest).getMessages(); numberOfReceivedMessages = messages.size(); for (Message message : messages) { String fileName = message.getBody(); String messageRecieptHandle = message.getReceiptHandle(); sqs.deleteMessage(new DeleteMessageRequest(myQueueUrl, messageRecieptHandle)); String sURL = s3.generatePresignedUrl(inputBucketName, fileName, null).toString(); downloadFromUrl(sURL, pdfDir+"/"+fileName); PDFTextParser pdfTextParserObj = new PDFTextParser(); String pdfToText = pdfTextParserObj.pdftoText(pdfDir+"/"+fileName); pdfTextParserObj.writeTexttoFile(pdfToText, pdfDir+"/"+fileName2);... forEachParagraph: s3.putObject(new PutObjectRequest(mainBucketName, vFolderWithParagrapfsName+ "/"+ fileName3, paragrafContent)); } totalNumberOfReceivedMessages+=numberOfReceivedMessages; } sqs.deleteQueue(new DeleteQueueRequest(myQueueUrl)); Scenario – part III

17 Scenario – part III (Results)

18 Scenario – part IV

19 import com.amazonaws.services.s3.AmazonS3; import com.amazonaws.services.s3.AmazonS3Client; import com.amazonaws.services.s3.model.ListObjectsRequest; import com.amazonaws.services.s3.model.ObjectListing; import com.amazonaws.services.s3.model.PutObjectRequest; import com.amazonaws.services.s3.model.S3ObjectSummary; import com.amazonaws.services.simpledb.AmazonSimpleDB; import com.amazonaws.services.simpledb.AmazonSimpleDBClient; import com.amazonaws.services.simpledb.model.Attribute; import com.amazonaws.services.simpledb.model.BatchPutAttributesRequest; import com.amazonaws.services.simpledb.model.CreateDomainRequest; import com.amazonaws.services.simpledb.model.DeleteAttributesRequest; import com.amazonaws.services.simpledb.model.DeleteDomainRequest; import com.amazonaws.services.simpledb.model.Item; import com.amazonaws.services.simpledb.model.PutAttributesRequest; import com.amazonaws.services.simpledb.model.ReplaceableAttribute; import com.amazonaws.services.simpledb.model.ReplaceableItem; import com.amazonaws.services.simpledb.model.SelectRequest; Scenario – part IV

20 AmazonS3 s3 = new AmazonS3Client(new PropertiesCredentials(ExecuteJobs.class.getResourceAsStream("AwsCredentials.properties"))); AmazonSimpleDB sdb = new AmazonSimpleDBClient(new PropertiesCredentials(ExecuteJobs.class.getResourceAsStream("AwsCredentials.properties"))); //domainName in Amazon SimpleDB String domainName = "IntroductionToCloudComputing"; String mainBucketName = "introduction.to.cloud.computing"; String vFolderWithParagrapfsName = null; String pdfDir="pdfTemp"+UUID.randomUUID(); sdb.createDomain(new CreateDomainRequest(domainName)); Scenario – part IV

21 ObjectListing objectListing = s3.listObjects(new ListObjectsRequest().withBucketName(mainBucketName)); HadoopJob hj = new HadoopJob(); File tmpFile=null; for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) { if(objectSummary.getSize()>0) { //it is a file read from Amazon S3, not a folder //code is on the next slide } else { vFolderWithParagrapfsName = objectSummary.getKey().substring(0, objectSummary.getKey().length() - 1); } Scenario – part IV

22 if(objectSummary.getSize()>0) //it is a file read from Amazon S3, not a folder { String fileName = objectSummary.getKey(); String sURL = s3.generatePresignedUrl(mainBucketName, fileName, null).toString(); fileName = fileName.substring(vFolderWithParagrapfsName.length()+1); String dTmpFilePath = pdfDir+"/"+fileName; downloadFromUrl(sURL, pdfDir+"/"+fileName); //in pdfDir+”/”+fileName Paragraphs are stored … forEachParagraph { hj.doMyJob(pdfDir+"/"+"temp.txt", pdfDir+"/output"+"/"+fileName.substring(0, fileName.indexOf(".txt"))+"/"+fileName.substring(0, fileName.indexOf(".txt"))+"_"+n); int numberOfWords=10; MyArray array = getTopWords(hadoopOutputFilePath, numberOfWords); sdb.batchPutAttributes(new BatchPutAttributesRequest(domainName, createSampleData(fileName.substring(0, fileName.indexOf("_Paragraphs.txt")), shorterStmp,n,numberOfWords,array))); } Scenario – part IV

23 public class HadoopMapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } Scenario – part IV (Mapper)

24 public class HadoopReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Key key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } Scenario – part IV (Reducer)

25 public static void initJob(Job job) { org.apache.hadoop.conf.Configuration conf = job.getConfiguration(); conf.setJobName("wordcount"); job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class); job.setMapperClass(HadoopMapper.class); job.setMapOutputKeyClass(org.apache.hadoop.io.Text.class); job.setMapOutputValueClass(org.apache.hadoop.io.IntWritable.class); job.setReducerClass(HadoopReducer.class); job.setOutputValueClass(org.apache.hadoop.io.IntWritable.class); job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.class); ); Scenario – part IV (Driver)

26 public void doMyJob(String inputFileName, String outputFolderName) throws Exception { Job job = new Job(); initJob(job); /* Tell Task Tracker this is the main */ job.setJarByClass(HadoopJob.class);  /* This is an example of how to set input and output. */ FileInputFormat.setInputPaths(job, inputFileName); Path p = new Path(outputFolderName); FileOutputFormat.setOutputPath(job, p); /* And finally, we submit the job. */ job.submit(); job.waitForCompletion(true); } Scenario – part IV (Driver)

27 Scenario – part V

28 private static List createSampleData(String fileName, String paragraphContent, int paragraphNumber,int numberOfWords, MyArray array) throws IOException { List sampleData = new ArrayList (); sampleData.add(new ReplaceableItem(fileName+"_Paragraf_"+paragraphNumber).withAttributes( new ReplaceableAttribute("Paper",fileName+".pdf", true), new ReplaceableAttribute("Paragraph_Content",paragraphContent, true), new ReplaceableAttribute(array.getKey(0),String.valueOf(array.getNumberOfAppearances(0)), true), new ReplaceableAttribute(array.getKey(1),String.valueOf(array.getNumberOfAppearances(1)), true), new ReplaceableAttribute(array.getKey(2),String.valueOf(array.getNumberOfAppearances(2)), true), …. ))); return sampleData; } Scenario – part V

29 Scenario – part V (Final Results) Web>’1’ as=’2’ Query Query results

Final Conclusions & Personal Opinion  The Amazon Web Services (AWS) are a collection of remote computing services that together make up a cloud computing platform.  The importance and advantages of the usage of the Cloud Computing technology is proven in every day praxis.  Amazon Simple Storage Service (S3) –Folder structure among buckets is not completely supported;  Amazon Simple Queue Service (SQS) –Better for systems of a large number of sent messages;  Amazon SimpleDB –Service limits (http://thecloudtutorial.com/amazonsimpledb.html).http://thecloudtutorial.com/amazonsimpledb.html 30

Thank you!

Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa.

Similar presentations

Presentation on theme: "Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa.

Similar presentations

Presentation on theme: "Amazon Web Services – Plagiarism Application Danijel Novaković January 31 st, 2012 Supervisor: Prof. Amin Anjomshoaa."— Presentation transcript:

Similar presentations

About project

Feedback