Full-Text Indexing for PDFs

In this post I’ll show how to enable full-text indexing of PDF documents using SQL Server 2008 (64-bit), and the Adobe iFilter component.

The iFilter component is required to bridge the gap between the Windows Indexing Service and the PDF format. I should point out that this solution requires the full file to be saved to the database. This may not be appropriate in some cases.

Start by creating a table in your database to store the PDF files. You will need at least 3 columns, one for the primary key, one to hold the binary data (VARBINARY(MAX)), and one to hold the file extension. This must include the “.” prefix. SQL Server will use this value to decide which iFilter to use to handle the data.

To use a third party iFilter, SQL Server must first be configured to look for them:

sp_fulltext_service 'load_os_resources', 1
sp_fulltext_service 'verify_signature', 0

Download and install the Adobe iFilter 9 for 64-bit platforms:
http://www.adobe.com/support/downloads/detail.jsp?ftpID=4025

Once installed, find the newly created “bin” folder. Add this to the “PATH” environment variable for the PC:

System Properties -> Advanced -> Environment Variables

Restart the machine.

Check that the file type has been correctly registered with SQL Server:

SELECT * from sys.fulltext_document_types

You should see:

Create a full-text index on your table by right-clicking the table and launching the Full-Text Indexing Wizard. Where you select the table columns, ensure you tick the column with the binary data and set it’s Type Column to the one containing the file extension (this will always be “.pdf” in this example). Use the default settings for the remaining steps of the wizard.

Load the PDF files into the table. The following can be used as a starting point:

var path = @"C:\Test.pdf";
byte[] bytes = File.ReadAllBytes(path);

using (var dc = new MyDatabaseDataContext())
{
  var entity = new SavedFile();
  entity.FileName = "Test";
  entity.Extension = ".pdf";
  entity.Content = bytes;

  dc.SavedFiles.InsertOnSubmit(entity);
  dc.SubmitChanges();
}

Everything should now be ready. The PDF can be searched using a query like this:

SELECT
  sf.*
FROM
  SavedFiles sf
JOIN FREETEXTTABLE(SavedFiles, Content, 'keyword') as ftt
  ON ftt.[Key] = sf.FileID
ORDER BY
  ftt.[Rank] DESC

References:
Full-Text Indexing a PDF file with Sql Server 2005 December CTP (aka Yukon)
Adobe PDF iFilter 9 for 64-bit Platforms
Sql Server Full-Text Search Protips Part 3: Getting RANKed

Advertisements
This entry was posted in Reference and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s