Pro Quo Books Book Cover Scanning System

Pro Quo Books Book Cover Scanning System
Team Haplocyonopsis Jeremy Leakakos George Hodgson Elizabeth Leib Douglas Krofcheck David McClelland Jeremy 1 - 5 Liz 6-10 George 11-14 Dave 15-16 Doug 22-24 jeremy 25, 26

Synopsis Who is Pro Quo Books? What is the problem we need to solve?
You'll have a list of 30 million books. You'll be given an image of a book, and you need to find out which of the 30 million it is... in 9 seconds. What are we doing about it? Why is this a difficult problem? Pro Quo Books is an online book distributor, reselling on e-commerce sites such as Amazon and Half.com. They build their inventory by purchasing books from other distributors and other sources. Before being added to their inventory, books must be identified so that they may be priced and sold correctly. Our system will enable Pro Quo books to more easily process books without identifiable bar codes. Any book without an identifiable bar code will be weighed and scanned and then passed on to our system. Our system will take the scanned book cover and the weight and try to figure out what the book is. This is a difficult problem because of the variance that can occur in pictures. Even if we already have the picture of a book in the database, there is no guarantee that when we see the book again that the picture will be exactly the same. Even small differences like slightly different lighting, or a book being in a different part of the picture can make identifying the book much more difficult.

Deliverables Book cover preprocessing library
Book cover identification library Test application The high level deliverables of our product are the scanning library, preprocessing library, and test application. Preprocessing library: used to pre-populate a database with extra information about the books in it; gray scale images, extracted features, etc… Identification Library: used to actually identify a given book. A combination of database filters (to make the dataset smaller) and identification algorithms (used to identify a book) are used. Test Application: used to demo libraries and get the metrics we need calculated

Requirements Elicitation
Initial sponsor meeting Conference calls Application prototype Performance requirements 9 second time limit to identify a book 20% identification rate with 90% confidence Most of the requirements for this project are quite straight forward, so we did not have to spend as much time as we expected on requirements elicitation. We were able to hammer out the majority of library requirements in the first few sponsor meetings. For the test application, we were able to do an effective prototype which was close to what the sponsor wanted, and we were able to get it done in a couple of weeks. (mention stable)

Technologies C# / Visual Studio MySQL AForge.NET
Open source image processing library We need to interface with PQB’s C# and MySQL systems Aforge.NET is free, LGPL v3, is the best

Technical Process Evolutionary delivery Incremental development
2-3 week increments Test app increment Prototype increment Product increment Evolutionary Delivery: supports delivery of portions of software at different times. Basically, evolutionary delivery states that you should know what your core functionality is at the get-go, but there is uncertainty in other details, such as UI and specific algorithms. Know we’re supposed to have preprocessing and find books, but we don’t know the complete internals 2-3 week increments Test App: first pass at demo application; basic UI created, semi-functional Prototype: at least 1 filter algorithm and 1 identification algorithm each increment Product: refinement of filters and algorithms

Evolutionary Delivery
Concept – Pro Quo Books identifies a need for a software system to improve their book sorting processes. Requirements Elicitation and Analysis – We communicate with the sponsors to learn about their needs. Possible problem ares are identified and discussed. Core System and Architecture Design – The core library and testing application systems are designed. Development – Based on the current designs and requirements, a new version of the system is created. Each new version should perform better and meet Pro Quo Books' needs better than the last versions. Delivery – The latest version is given to Pro Quo Books for review. Feature Expansion – Based upon the initial requirements, new or more advanced features/filters are added for implementation in the next development phase. Customer Feedback – We discuss the last delivery and the new proposed features with the sponsors Redesign – Based on the customer feedback, the system is redesigned. Final Version – When all requirements are meant, the system is done.

Metrics Effort: Time tracking
Team and individual Progress: Average time taken to identify a book Progress: Number of true positives vs. number of false positives Time tracking as required by department Tracking avg. time because it is the required 9 seconds (Discuss differences / why false positive is really bad)

Testing Strategy Test application Simulate production environment
Deliverable Testing platform Simulate production environment Input rate Time threshold Database size One goal when we have a larger database is to have a small test set for each filter that tests that filter Can’t test using a single image, because comparing it to itself they’re the same image.

Test App Metrics Graph Note super-small sample size (n=3)

Schedule This chart shows the hours that the team has spent per week, in blue. Pink shows the estimated hours per week. This shows that we’ve consistently over-estimated the amount of time required. Looking at week 8 you’ll see that we were closer that week than any other week. As we proceed we can see if week 8 was an anomaly or if our estimation is actually getting better.

Schedule Of the hours spent this quarter, we can see here the amount contributed by each of the five team members. A perfectly even distribution of work would require each member to contribute 20% of the total. Here we can see that we are very close to achieving this, with just 3% deviation in the worst cases. We must be careful to monitor this closely as we finish up winter quarter and head into spring quarter. While the small deviation is acceptable, we must make sure that the differences do not grow larger as the project continues.

Schedule Each line on this chart shows the actual number of hours reported by each team member so far. Ignoring Pink’s exemplary week in Week 3, the team members all seem to be contributing about the same number of hours each week. Also worth noting is that the number of hours spent per week has been trending upwards. This is attributed to the increase in momentum of the project since we’ve started receiving production data from our sponsor. (*maybe?*) The gentle prodding by Professor Kuehl helps too.

Schedule Here, again, we see the SE deliverables at the left, but this time with both the Pro Quo deliverables and our team increments to the right. Pro Quo deliverables represent functionality that will be delivered/presented to Pro Quo. Increments will be maintained in SVN, and while they will be discussed with Pro Quo, a formal delivery is not scheduled. Balancing the level of learning required for this project with the strict time constraints, we decided to schedule ourselves with several short increments. These increments come in three flavors: Prototype, Product, and Product. The demo application increment consisted of our initial design and implementation of the driver program that will be used by Pro Quo to evaluate the DLL. The processing library itself remained stubbed out at this stage. Following that we have 3 prototype increments, the first of which we are in right now. Prototype increments consist of team-mates implementing different image processing algorithms. While metrics will be collected regarding the accuracy and speed of each algorithm, these phases aren’t meant to deliver implementations that obtain the performance goals outlined by pro Quo. With team-mates working by themselves or with another person, this allows for the team as a whole to evaluate many different algorithms. At the conclusion of the third prototype increment we will analyze the results and select the algorithms that will be combined to form the final product. The product increments involve improvement upon the implementation of the selected algorithms to address the quality goals specified by pro Quo. This culminates in our delivery of the finished product to Pro Quo at the end of week 8 of Spring Quarter.

Risks Unfamiliar domain Inadequate testing material Miscommunication
Image processing Inadequate testing material Test database not large enough Miscommunication Misunderstanding of terminology Most team members are completely new to the realm of image processing. A big risk of our project is not having a large enough database to run realistic performance tests on. Since we are uncertain in the details of some aspects of the system, our schedule may be poorly planned out (spent too much time on certain aspects, too little on others).

Design Pipe and filter architecture
Start with filters to whittle down the dataset Finish with identification algorithms

Sequence - Scanning Before our software can do anything, we must have a ScanBundle. ScanBundles are created by Pro Quo Books' conveyor belt system. Unidentified books are placed on a conveyor belt containing an imaging system. The book is weighed and its spine and cover are photographed. This data put into a new ScanBundle which is sent to our library for book identification. At most, a new ScanBundle is created each second and each book has nine seconds to be identified before reaching the end of the conveyor belt.

Sequence - Preprocessing
Fly Fishing In order to identify a book, one must do calculations on its images and compare them to known results. In the preprocessing phase, a ScanBundle's images and transformed into the data used by the image comparison filters. Such transformations include gray-scaling the image, edge detection, blurring, OCR, feature extraction, and calculating book dimensions (be sure to mention filters / identifiers used)

Sequence – Database Once the ScanBuncle's images have been preprocessed, the quantifiable values (such as weight and dimensions) are used to reduce the 30 million book dataset to a much more iterable size. (filters remove books from consideration, identifier selects 1 from the dataset)

Sequence – Imaging 0.093 chance of matching
Iterating over the reduced data set, image comparisons are made between the preprocessed ScanBundle images and the known books' images. Books with a low matching probability are quickly thrown out while books with a high matching probability are queued for further comparisons. 0.093 chance of matching

Sequence - Result Image comparisons are repeatedly made until a perfect match is found, the time limit is reached, or there is only one book with a possible match. When one of those conditions is reached, the book with the highest matching probability has its probability and book ID returned to the conveyor belt system. If two or more books are tied for the highest matching probability, then the library returns that it can't determine which book the ScanBundle represents. It is important to not make false positive identifications.

Progress so far… Test application
Multiple functional filters and identification algorithms Preprocessing algorithms Solid plan for spring quarter Test application to showcase the libraries developed for the project and serve as a testing platform for the dev team Functional filters (weight and aspect ratio) Identification (pixel difference) Preprocessing sepia, grayscale, cropping, pixel difference Good plan to continue into spring quarter

Reflections The good stuff Communication Test application
Working filters and identification algorithms Well-defined scope Communication between our faculty coaches, sponsors, and team members has been outstanding. With a weekly team meeting in addition to our weekly conference call, everyone knows where the project stands at all times. The test application provides a platform with which to test algorithm prototypes and gather metrics and demonstrates the library’s functionality. Although the focus of the prototype increments laid on developing proof-of-concept algorithms, the success the developed algorithms have reached has surpassed our original expectations

Reflections The not so good stuff Time estimation Production problems
Our team has consistently predicted our weekly project times poorly. This can be attributed to a lack of a basis to use to estimate times. Entering spring quarter, the team can use our newfound knowledge into the domain to more accurately estimate our time spent. The Pro Quo development team has run into delays getting a production camera set up on their assembly line. As a result, our team has taken existing images mixed with scraping various online retail sites in an effort to create a dataset to test against. As this dataset is nowhere near the estimated final size of 30 million entries, efficiency metrics that have been gathered are not reflective of their real production values. (If PQB gets a camera before the presentation, mention that. If not, mention their current estimate)

Spring Quarter Prototype increments Products increments
Finish development

Questions?

Pro Quo Books Book Cover Scanning System

Similar presentations

Presentation on theme: "Pro Quo Books Book Cover Scanning System"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pro Quo Books Book Cover Scanning System

Similar presentations

Presentation on theme: "Pro Quo Books Book Cover Scanning System"— Presentation transcript:

Similar presentations

About project

Feedback