Pdf and microsoft sharepoint hurdles to overcome

Page 1

PDF AssociationTechnical Conference June 18-19 2013

PDF and Microsoft Sharepoint Hurdles to Overcome

Neil Pitman Aquaforest Limited

Version 1.120613


Objective

PDF as a Sharepoint “First Class Citizen”


 Objectives  Sharepoint Overview  PDF Capture  PDF Search

Agenda

 iFilters  Handling Image and Mixed Mode PDFs

 PDF Metadata  Dictionary, XMP and Entity Extraction

 Configuration  Sharepoint 2010 , 2013

 Summary


Microsoft Sharepoint Server - 125 million licenses sold Sharepoint to be a natural target for PDF storage

 What is Sharepoint?  On-Premise and Cloud-based Collaboration & Document Management Platform

Sharepoint Overview

 Origin - 2001  Usage  Focus on MS Office Documents  Typically distributed capture


 Sharepoint Editions (2010, 2013)

Sharepoint Overview

 Foundation  Standard  Enterprise

 Office 365 / Sharepoint Online  Ecosystem  Partner Products  Office / Sharepoint Marketplace


Sharepoint Architecture Overview

MS Web-based (IIS)

MS Office Integration

SQL Server Storage

List or library data in a site collection is stored in a SQL Server database table, which uses queries, indexes and locks to maintain overall performance, sharing, and accuracy.

Filtered views with column indexes (and other operations) create database queries that identify a subset of columns and rows and return this subset to your computer.

Thresholds and limits help throttle operations and balance resources for many simultaneous users.

Privileged developers can use object model overrides to temporarily increase thresholds and limits for custom applications.

Administrators can specify dedicated time windows for all users to do unlimited operations during off-peak hours.

Information workers can use appropriate views, styles, and page limits to speed up the display of data on the page.

Microsoft Technology Stack

    

Windows Server 2008/12 Internet Information Server (IIS) .Net Framework SQL Server MS Office


 Options

PDF Capture for Sharepoint

    

Sharepoint UI Acrobat XI Load Tools Custom Code Workflow & Event Receivers

WebRequest request = WebRequest.Create(destUrl); request.Credentials = CredentialCache.DefaultCredentials; request.Method = "PUT"; byte[] buffer = new byte[1024]; using (Stream stream = request.GetRequestStream()) using (MemoryStream ms = new MemoryStream(fileBytes)) { for (int i = ms.Read(buffer, 0, buffer.Length); i > 0; i = ms.Read(buffer, 0, buffer.Length)) { stream.Write(buffer, 0, i); } } WebResponse response = request.GetResponse(); response.Close(); Logging.Log("Upload successful");


Acrobat XI Sharepoint Integration

http://www.adobe.com/uk/products/acrobat/pdf-version-control-sharepoint-integration.html


PDF Search in Sharepoint Overview

 Item 1  Item 2


iFilters scan documents for text and attributes – primarily in support of Microsoft Search technologies.

iFilter Architecture


iFilter Configuration

 Architecture  Code Sample  Suppliers  Issues


iFilter Explorer

PDF Search in Sharepoint : iFilters

ď‚– iFilter Explorer


https://gist.github.com/jimschubert/1473904

Using iFilters directly in Code

StringBuilder Buffer=new StringBuilder(); string PDFFile = @"C:\dev\PDF Conference\s.pdf"; FilterCode f=new FilterCode(); f.GetTextFromDocument(PDFFile, ref Buffer); Console.WriteLine(Buffer);

[DllImport("query.dll", SetLastError = true, CharSet = CharSet.Unicode)] static extern int LoadIFilter(string pwcsPath, [MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter, ref IFilter ppIUnk);

public void GetTextFromDocument(string Path, ref StringBuilder Buffer) { IFilter filter = null; int hresult; IFilterReturnCodes rtn; // Initialize the return buffer to 64K. Buffer = new StringBuilder(64 * 1024); // Try to load the filter for the path given. hresult = LoadIFilter(Path, new IntPtr(0), ref filter); if (hresult == 0) { IFILTER_FLAGS uflags; // Init the filter provider. rtn = filter.Init( IFILTER_INIT.IFILTER_INIT_CANON_PARAGRAPHS | IFILTER_INIT.IFILTER_INIT_CANON_HYPHENS | IFILTER_INIT.IFILTER_INIT_CANON_SPACES | IFILTER_INIT.IFILTER_INIT_APPLY_INDEX_ATTRIBUTES | IFILTER_INIT.IFILTER_INIT_INDEXING_ONLY, 0, new IntPtr(0), out uflags); if (rtn == IFilterReturnCodes.S_OK) { STAT_CHUNK statChunk;


iFilter Test Bookmark

PDF Attachment

XMP Metadata Text

Image/OCR Text Dictionary Metadata

Annotation


iFilter Test Results

Adobe iFilter

PDFLib iFilter

FoxIt iFilter

Microsoft Format Handler

Body Text

 

Bookmarks

Dictionary Metadata

   

Annotations

   

XMP Metadata

PDF Attachment

* 

  


Classify :    

Dealing with Image and Mixed-Mode PDFs

Image-Only Born-Digital Part Image-Only, Part Born-Digital Previously OCRed


 Objectives:  Ensure Full Searchability  Avoid Text to Image Processing

 Process :

Dealing with Image and Mixed-Mode PDFs

 Capture Time?  Scheduled In-Place?


 Text Search vs Metadata Search  Crawled vs Managed Properies  Review Requirements

 Dictionary Metadata  XMP Metadata  Entity Extraction

PDF Metadata In Sharepoint

 Consider Automation


Crawled vs Managed Properies

PDF Metadata In Sharepoint


PDF Metadata In Sharepoint : Using Event Receivers

ď‚– Event Receivers can enable Metadata assignment


Entity Extraction

PDF Metadata In Sharepoint


Configuration

 Sharepoint 2010  Sharepoint 2013


ď‚– Missing icon and iFilter

Sharepoint 2010 PDF Configuration

http://www.adobe.com/devnet-docs/acrobatetk/tools/AdminGuide/Acrobat_Reader_IFilter_configuration.pdf


Sharepoint 2010 PDF Configuration


ď‚– Default for PDF : X-Download-Options: noopen' added to HTTP Response Header

Sharepoint PDF Configuration


 PDF Format Handler Support  Currently no iFilter Support for PDF !?!?!!

Sharepoint 2013 and PDF Configuration


Inline Viewing PDF in Sharepoint 2013

Sharepoint 2013 and PDF Configuration

http://stevemannspath.blogspot.co.uk/2012/10/sharepoint-2013-pdf-preview-in-search.html http://stevemannspath.blogspot.co.uk/2013/04/sharepoint-2013-pdf-support-and.html


 Microsoft Sharepoint Server - 125 million licenses sold  Sharepoint to be a natural target for PDF storage  PDF as a Sharepoint “First Class Citizen”

Summary

Contact : neil.pitman@aquaforest.com


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.