Presentation is loading. Please wait.

Presentation is loading. Please wait.

Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim.

Similar presentations

Presentation on theme: "Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim."— Presentation transcript:

1 Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim Pharmaceuticals, Inc. Ridgefield, CT USA SLA PHT 2013

2 Question #1 2 Who here is involved with News Alerting activities in their jobs?

3 3 About Us Project Background & Critical Issues Organizational Drivers Technical & Process Overviews Demonstration (pre-recorded) Continuing Challenges Lessons Learned User Feedback Next Steps Q&A / Discussion Topics

4 4 About Us Scientific Knowledge Discovery (SKD) is made up of Computational Biology professionals and Knowledge Management experts who support BI* Pharmaceutical Research and Corporate areas in the US by supplying relevant information and analysis. We focus our work on: Delivering data and information in a short timeframe Streamlining information gathering and processing through computational methods including Text Mining Turning information into knowledge that drives impact RAUL * BI will refer to Boehringer Ingelheim throughout this presentation

5 5 BI US Library has been involved with news alerting for ~20 years: 1990s: Library staff generated daily & weekly electronic newsletters on various therapeutic area & business topics Early 2000s: Executives & Competitive Intelligence (CI) requested a more systematic, early morning alerting of significant news; Code Red Alert developed & managed by 1 Info Scientist (~1-1.5 hours per day manual curation time) Late 2000s: Service evolved to include many sources (fee & free) but not as time-critical; executives alerted by other routes; CI no longer part of workflow; distribution list broadened to include Public Affairs & Communications group; work distributed among 3 Library staff for various weekdays after lead Info Scientist retired (~1-1.5 hours per day manual curation time); renamed to Daily News Brief in 2010 Project Background A very valuable service…but extremely time-consuming

6 Vendor Products: Critical Issues Ongoing search for a tool to assist in newsletter generation for many years; various vendor products* tested & used, but none met all requirements for success: Duplication: Similar stories from various sources Timeliness: Sometimes 24 hour delay experienced Cost: Some aggregators required fees for each recipient in addition to base annual subscription Other Issues: Some subscription sources had limited user access Some products lacked focus on particular areas of interest to BI Implementation always more challenging than anticipated Technical issues usually required much interaction with vendors 6 No significant time savings realized (~1+ hour per day curation) * No names will be disclosed

7 2010: First In-House Tool developed 7 Strengths: Fast & free to build; simple to maintain (HTML page with links); customizable; comprehensive coverage Weaknesses: No newsletter-generating tool; much manual scanning of many websites; required much manual curation (i.e. copying/pasting/formatting into template); duplication among sources Global News (Fee & Free) Press Releases Blogs Local News BI News After 2 years, still no significant time savings (~1+ hour per day curation)

8 2011: Organizational Drivers Major departmental reorganization in Q Limited staff to support news monitoring; needed to significantly reduce time spent on Daily News Brief Unsuccessful paid trial with vendor product New management prefers automated computational methods over manual processes Clients desire human filtering due to their lack of time 8 A perfect storm

9 Q1 2012: Daily News Brief re-launched Provide a daily morning snapshot of BI and pharmaceutical industry headlines with a US focus Minimize curation time to under 30 minutes per day Leverage internal expertise in Web Scraping Utilize cost-effective news sources whenever possible: 9 BI Press Releases (US & DE) Google News Yahoo News Elsevier Business Intelligence *global subscription FirstWord FiercePharma / FierceBiotech Reuters Bloomberg Medical Marketing & Media

10 Question #2 10 Who here knows what Web Scraping is?

11 Typical Content BI press releases & major news on all BI marketed Pharma products BI & subsidiaries (Vetmedica, Roxane Labs, Ben Venue Labs, Bedford Labs) in major & local news sources Competitor products: Phase 3 trial announcements, major trial published studies, approvals, launches Major Competitor, FDA, & Conference announcements Pharma & Healthcare industry trends 11 GOAL: Select & distribute ~12 relevant news items each business day before 8:00 am ET

12 Technical Overview 12 Web crawling agent (cURL) Parse news items & components Filter Standardized display for selection Relevancy & Minimum date Newsletters Manual selection / curation Output presentation (HTML) RSS Feeds News Websites Real Time Sources gathered on the fly Multiple input formats Manages RSS feeds, news websites, online newsletters Handles password-protected sites Automatic login Uses lightweight code Adaptable script language (Perl) Copyright compliant Only scraping/extracting content that is free or globally licensed by BI

13 Technical Overview: Perl Scripting 13

14 Process Overview: Curation 14 2) Select categories for news items to include using drop-down menus 1) Login to DNB interface on internal BI server; Enter # of days to review 3) Select SUBMIT to publish all selected news items to HTML output file

15 15 Process Overview: Publishing & Distribution 5) Paste into , edit, & distribute ~15 minutes from start to finish! 4) Copy HTML output

16 Demonstration 16 BI DAILY NEWS BRIEF BI DAILY NEWS BRIEF (2 minutes)

17 Continuing Challenges Duplication among sources, especially between Google News & Yahoo News Some sources dont always scrape properly, requiring minor edits before distribution Technical changes on source websites can affect results BI still running IE7; migrating to IE9 in 2013 Keeping it simple for us & our clients, i.e. Daily News BRIEF not Daily News OVERLOAD 17 Stay tuned…

18 Lessons Learned Have a focused objective (i.e. snapshot instead of all news for everyone) Look within your organization first for expertise before looking externally Change is inevitable; accept it as opportunity Regularly seek out user feedback (see next slide) 18 To eat an elephant, you must take one bite at a time

19 User Feedback The Daily News Brief has become my primary source of competitive & marketplace information. Outstanding! from an Executive Director in Marketing I read the DNB every morning. I prefer the current format to the previous one; its succinct & provides a good overview of top industry stories that I can view on my Blackberry. from a Director in Public Affairs & Communications I really like the new simplified look of the Daily News Brief, especially the clean lines and simplicity! Nice work! from an Associate Director in Public Affairs & Communications I really enjoy reading the Daily News Brief. It helps me to prepare for my day. from an Associate Director in Business Intelligence 19

20 Next Steps Currently underway: Use underlying code to develop news interfaces for monitoring other domains of interest (e.g. Therapeutic Areas, BI Products) Expand distribution list to include more senior-level management in US (currently ~125 recipients) Develop RSS feed for internal portals (recently completed) Attempt to remove duplication among sources wherever possible Explore options for delivery to mobile platforms 20

21 Acknowledgements Dr. Raul Rodriguez-Esteban Dr. Will Loging Amy Shortlidge-Cox Yirong Wang 21

22 Thank You 22 David A. Breiner, MS LinkedIn: Raul Rodriguez-Esteban, PhD LinkedIn: Now at Roche in Basel Boehringer Ingelheim Pharmaceuticals, Inc. Scientific Knowledge Discovery Ridgefield, Connecticut USA

23 Q&A / Discussion 23 What are your companies doing for News Alerting? Please share!

24 Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim Pharmaceuticals, Inc. Ridgefield, CT USA SLA PHT 2013

Download ppt "Whats in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim."

Similar presentations

Ads by Google