Pdf and microsoft sharepoint hurdles to overcome

PDF AssociationTechnical Conference June 18-19 2013

PDF and Microsoft Sharepoint Hurdles to Overcome

Neil Pitman Aquaforest Limited

Version 1.120613

Objective

PDF as a Sharepoint “First Class Citizen”

 Objectives  Sharepoint Overview  PDF Capture  PDF Search

Agenda

 iFilters  Handling Image and Mixed Mode PDFs

 PDF Metadata  Dictionary, XMP and Entity Extraction

 Configuration  Sharepoint 2010 , 2013

 Summary

Microsoft Sharepoint Server - 125 million licenses sold Sharepoint to be a natural target for PDF storage

 What is Sharepoint?  On-Premise and Cloud-based Collaboration & Document Management Platform

Sharepoint Overview

 Origin - 2001  Usage  Focus on MS Office Documents  Typically distributed capture

 Sharepoint Editions (2010, 2013)

Sharepoint Overview

 Foundation  Standard  Enterprise

 Office 365 / Sharepoint Online  Ecosystem  Partner Products  Office / Sharepoint Marketplace

Sharepoint Architecture Overview



MS Web-based (IIS)



MS Office Integration



SQL Server Storage



List or library data in a site collection is stored in a SQL Server database table, which uses queries, indexes and locks to maintain overall performance, sharing, and accuracy.



Filtered views with column indexes (and other operations) create database queries that identify a subset of columns and rows and return this subset to your computer.



Thresholds and limits help throttle operations and balance resources for many simultaneous users.



Privileged developers can use object model overrides to temporarily increase thresholds and limits for custom applications.



Administrators can specify dedicated time windows for all users to do unlimited operations during off-peak hours.



Information workers can use appropriate views, styles, and page limits to speed up the display of data on the page.

Microsoft Technology Stack

    

Windows Server 2008/12 Internet Information Server (IIS) .Net Framework SQL Server MS Office

 Options

PDF Capture for Sharepoint

    

Sharepoint UI Acrobat XI Load Tools Custom Code Workflow & Event Receivers

WebRequest request = WebRequest.Create(destUrl); request.Credentials = CredentialCache.DefaultCredentials; request.Method = "PUT"; byte[] buffer = new byte[1024]; using (Stream stream = request.GetRequestStream()) using (MemoryStream ms = new MemoryStream(fileBytes)) { for (int i = ms.Read(buffer, 0, buffer.Length); i > 0; i = ms.Read(buffer, 0, buffer.Length)) { stream.Write(buffer, 0, i); } } WebResponse response = request.GetResponse(); response.Close(); Logging.Log("Upload successful");

Acrobat XI Sharepoint Integration

http://www.adobe.com/uk/products/acrobat/pdf-version-control-sharepoint-integration.html

PDF Search in Sharepoint Overview

 Item 1  Item 2

iFilters scan documents for text and attributes â&#x20AC;&#x201C; primarily in support of Microsoft Search technologies.

iFilter Architecture

iFilter Configuration

 Architecture  Code Sample  Suppliers  Issues

iFilter Explorer

PDF Search in Sharepoint : iFilters

ď&#x201A;&#x2013; iFilter Explorer

https://gist.github.com/jimschubert/1473904

Using iFilters directly in Code

StringBuilder Buffer=new StringBuilder(); string PDFFile = @"C:\dev\PDF Conference\s.pdf"; FilterCode f=new FilterCode(); f.GetTextFromDocument(PDFFile, ref Buffer); Console.WriteLine(Buffer);

[DllImport("query.dll", SetLastError = true, CharSet = CharSet.Unicode)] static extern int LoadIFilter(string pwcsPath, [MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter, ref IFilter ppIUnk);

public void GetTextFromDocument(string Path, ref StringBuilder Buffer) { IFilter filter = null; int hresult; IFilterReturnCodes rtn; // Initialize the return buffer to 64K. Buffer = new StringBuilder(64 * 1024); // Try to load the filter for the path given. hresult = LoadIFilter(Path, new IntPtr(0), ref filter); if (hresult == 0) { IFILTER_FLAGS uflags; // Init the filter provider. rtn = filter.Init( IFILTER_INIT.IFILTER_INIT_CANON_PARAGRAPHS | IFILTER_INIT.IFILTER_INIT_CANON_HYPHENS | IFILTER_INIT.IFILTER_INIT_CANON_SPACES | IFILTER_INIT.IFILTER_INIT_APPLY_INDEX_ATTRIBUTES | IFILTER_INIT.IFILTER_INIT_INDEXING_ONLY, 0, new IntPtr(0), out uflags); if (rtn == IFilterReturnCodes.S_OK) { STAT_CHUNK statChunk;

iFilter Test Bookmark

PDF Attachment

XMP Metadata Text

Image/OCR Text Dictionary Metadata

Annotation

iFilter Test Results

Adobe iFilter

PDFLib iFilter

FoxIt iFilter

Microsoft Format Handler

Body Text

 

Bookmarks



Dictionary Metadata



   



Annotations

   

XMP Metadata



PDF Attachment



* 





  

Classify :    

Dealing with Image and Mixed-Mode PDFs

Image-Only Born-Digital Part Image-Only, Part Born-Digital Previously OCRed

 Objectives:  Ensure Full Searchability  Avoid Text to Image Processing

 Process :

Dealing with Image and Mixed-Mode PDFs

 Capture Time?  Scheduled In-Place?

 Text Search vs Metadata Search  Crawled vs Managed Properies  Review Requirements

 Dictionary Metadata  XMP Metadata  Entity Extraction

PDF Metadata In Sharepoint

 Consider Automation

Crawled vs Managed Properies

PDF Metadata In Sharepoint

PDF Metadata In Sharepoint : Using Event Receivers

ď&#x201A;&#x2013; Event Receivers can enable Metadata assignment

Entity Extraction

PDF Metadata In Sharepoint

Configuration

 Sharepoint 2010  Sharepoint 2013

ď&#x201A;&#x2013; Missing icon and iFilter

Sharepoint 2010 PDF Configuration

http://www.adobe.com/devnet-docs/acrobatetk/tools/AdminGuide/Acrobat_Reader_IFilter_configuration.pdf

Sharepoint 2010 PDF Configuration

ď&#x201A;&#x2013; Default for PDF : X-Download-Options: noopen' added to HTTP Response Header

Sharepoint PDF Configuration

 PDF Format Handler Support  Currently no iFilter Support for PDF !?!?!!

Sharepoint 2013 and PDF Configuration

Inline Viewing PDF in Sharepoint 2013

Sharepoint 2013 and PDF Configuration

http://stevemannspath.blogspot.co.uk/2012/10/sharepoint-2013-pdf-preview-in-search.html http://stevemannspath.blogspot.co.uk/2013/04/sharepoint-2013-pdf-support-and.html

 Microsoft Sharepoint Server - 125 million licenses sold  Sharepoint to be a natural target for PDF storage  PDF as a Sharepoint “First Class Citizen”

Summary

Contact : neil.pitman@aquaforest.com

Turn static files into dynamic content formats.

Create a flipbook