PDF AssociationTechnical Conference June 18-19 2013
PDF and Microsoft Sharepoint Hurdles to Overcome
Neil Pitman Aquaforest Limited
Version 1.120613
Objective
PDF as a Sharepoint “First Class Citizen”
Objectives Sharepoint Overview PDF Capture PDF Search
Agenda
iFilters Handling Image and Mixed Mode PDFs
PDF Metadata Dictionary, XMP and Entity Extraction
Configuration Sharepoint 2010 , 2013
Summary
Microsoft Sharepoint Server - 125 million licenses sold Sharepoint to be a natural target for PDF storage
What is Sharepoint? On-Premise and Cloud-based Collaboration & Document Management Platform
Sharepoint Overview
Origin - 2001 Usage Focus on MS Office Documents Typically distributed capture
Sharepoint Editions (2010, 2013)
Sharepoint Overview
Foundation Standard Enterprise
Office 365 / Sharepoint Online Ecosystem Partner Products Office / Sharepoint Marketplace
Sharepoint Architecture Overview
MS Web-based (IIS)
MS Office Integration
SQL Server Storage
List or library data in a site collection is stored in a SQL Server database table, which uses queries, indexes and locks to maintain overall performance, sharing, and accuracy.
Filtered views with column indexes (and other operations) create database queries that identify a subset of columns and rows and return this subset to your computer.
Thresholds and limits help throttle operations and balance resources for many simultaneous users.
Privileged developers can use object model overrides to temporarily increase thresholds and limits for custom applications.
Administrators can specify dedicated time windows for all users to do unlimited operations during off-peak hours.
Information workers can use appropriate views, styles, and page limits to speed up the display of data on the page.
Microsoft Technology Stack
Windows Server 2008/12 Internet Information Server (IIS) .Net Framework SQL Server MS Office
Options
PDF Capture for Sharepoint
Sharepoint UI Acrobat XI Load Tools Custom Code Workflow & Event Receivers
WebRequest request = WebRequest.Create(destUrl); request.Credentials = CredentialCache.DefaultCredentials; request.Method = "PUT"; byte[] buffer = new byte[1024]; using (Stream stream = request.GetRequestStream()) using (MemoryStream ms = new MemoryStream(fileBytes)) { for (int i = ms.Read(buffer, 0, buffer.Length); i > 0; i = ms.Read(buffer, 0, buffer.Length)) { stream.Write(buffer, 0, i); } } WebResponse response = request.GetResponse(); response.Close(); Logging.Log("Upload successful");
Acrobat XI Sharepoint Integration
http://www.adobe.com/uk/products/acrobat/pdf-version-control-sharepoint-integration.html
PDF Search in Sharepoint Overview
Item 1 Item 2
iFilters scan documents for text and attributes – primarily in support of Microsoft Search technologies.
iFilter Architecture
iFilter Configuration
Architecture Code Sample Suppliers Issues
iFilter Explorer
PDF Search in Sharepoint : iFilters
ď‚– iFilter Explorer
https://gist.github.com/jimschubert/1473904
Using iFilters directly in Code
StringBuilder Buffer=new StringBuilder(); string PDFFile = @"C:\dev\PDF Conference\s.pdf"; FilterCode f=new FilterCode(); f.GetTextFromDocument(PDFFile, ref Buffer); Console.WriteLine(Buffer);
[DllImport("query.dll", SetLastError = true, CharSet = CharSet.Unicode)] static extern int LoadIFilter(string pwcsPath, [MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter, ref IFilter ppIUnk);
public void GetTextFromDocument(string Path, ref StringBuilder Buffer) { IFilter filter = null; int hresult; IFilterReturnCodes rtn; // Initialize the return buffer to 64K. Buffer = new StringBuilder(64 * 1024); // Try to load the filter for the path given. hresult = LoadIFilter(Path, new IntPtr(0), ref filter); if (hresult == 0) { IFILTER_FLAGS uflags; // Init the filter provider. rtn = filter.Init( IFILTER_INIT.IFILTER_INIT_CANON_PARAGRAPHS | IFILTER_INIT.IFILTER_INIT_CANON_HYPHENS | IFILTER_INIT.IFILTER_INIT_CANON_SPACES | IFILTER_INIT.IFILTER_INIT_APPLY_INDEX_ATTRIBUTES | IFILTER_INIT.IFILTER_INIT_INDEXING_ONLY, 0, new IntPtr(0), out uflags); if (rtn == IFilterReturnCodes.S_OK) { STAT_CHUNK statChunk;
iFilter Test Bookmark
PDF Attachment
XMP Metadata Text
Image/OCR Text Dictionary Metadata
Annotation
iFilter Test Results
Adobe iFilter
PDFLib iFilter
FoxIt iFilter
Microsoft Format Handler
Body Text
Bookmarks
Dictionary Metadata
Annotations
XMP Metadata
PDF Attachment
*
Classify :
Dealing with Image and Mixed-Mode PDFs
Image-Only Born-Digital Part Image-Only, Part Born-Digital Previously OCRed
Objectives: Ensure Full Searchability Avoid Text to Image Processing
Process :
Dealing with Image and Mixed-Mode PDFs
Capture Time? Scheduled In-Place?
Text Search vs Metadata Search Crawled vs Managed Properies Review Requirements
Dictionary Metadata XMP Metadata Entity Extraction
PDF Metadata In Sharepoint
Consider Automation
Crawled vs Managed Properies
PDF Metadata In Sharepoint
PDF Metadata In Sharepoint : Using Event Receivers
ď‚– Event Receivers can enable Metadata assignment
Entity Extraction
PDF Metadata In Sharepoint
Configuration
Sharepoint 2010 Sharepoint 2013
ď‚– Missing icon and iFilter
Sharepoint 2010 PDF Configuration
http://www.adobe.com/devnet-docs/acrobatetk/tools/AdminGuide/Acrobat_Reader_IFilter_configuration.pdf
Sharepoint 2010 PDF Configuration
ď‚– Default for PDF : X-Download-Options: noopen' added to HTTP Response Header
Sharepoint PDF Configuration
PDF Format Handler Support Currently no iFilter Support for PDF !?!?!!
Sharepoint 2013 and PDF Configuration
Inline Viewing PDF in Sharepoint 2013
Sharepoint 2013 and PDF Configuration
http://stevemannspath.blogspot.co.uk/2012/10/sharepoint-2013-pdf-preview-in-search.html http://stevemannspath.blogspot.co.uk/2013/04/sharepoint-2013-pdf-support-and.html
Microsoft Sharepoint Server - 125 million licenses sold Sharepoint to be a natural target for PDF storage PDF as a Sharepoint “First Class Citizen”
Summary
Contact : neil.pitman@aquaforest.com