File Document Specifications

Dovetail Seeker uses the file document specification to enable indexing files that exist on the file system. It searches through the directories listed in the specification and indexes the content of the files it finds. Which files are indexed can controlled by explicitly including or excluding them from the specification.

File document specifications define:
Security Note: The user the Dovetail Seeker windows service or console is
running under must have access to the paths in the file document specification 
in order to index the content.
File Text Extraction

To extract the text for files. Dovetail Seeker uses a library called Apache Tika. Check their website for supported file formats.

Example File Document Specification

The following is a specification for searching for documentation files:

<fileDocumentSpecification description="paths to where your documents live" tags="docs">
  <identification displayName="documentation" />
    <directories>
      <directory path="c:\documentation">
        <include name="*pdf" />
        <exclude name="bigfiles"/>
      </directory>
    </directories>
</fileDocumentSpecification>

This file specification will looks in the directory c:\documentation for all files ending in pdf and will exclude any files that exist in a directory named bigfiles.

Directories

The file document specification can have one or more directory nodes controlling which directories are indexed as a part of this document specification. Each directory can be filtered with include or exclude patterns allowing fine grained control over which files are indexed. Regular expressions can be used for the patterns to allow for maximum flexibility.

Include

<include>

The items to include in the index:

Attribute Required Description
name Yes The pattern or file name to include
Exclude

<exclude>

The items to exclude from the index:

Attribute Required Description
name Yes The pattern of file name to exclude
Example

The following specification indexes files with the extension of pdf but excludes any that contains the path secret.

<fileDocumentSpecification description="paths to where your documents live" tags="docs">
  <identification displayName="documentation" />
  <directories>
    <directory path="c:\documentation">
      <include name="*.pdf" />
      <exclude name="secret" />
    </directory>
    <directory path="\\server\share\directory">
      <include name="*.docx" />
      <include name="*.xlsx" />
    </directory>            
  </directories>
</fileDocumentSpecification>

The c:\documentation directory will include pdfs, and will exclude any pdfs that have secret in the path.

The \server\share\directory will index only files with the extension docx or xlsx.

Here is a table with example matches based on the above patterns:

Pattern Match
*.pdf Matches all files that contain the extension pdf

Does match:
  • readme.pdf
  • documentation/manual.pdf

Does not match:
  • readme.doc
secret Matches all files that have secret in the path or filename

Does match:
  • path/secret/sensitive.doc
  • secret.txt
  • path/secret/file.pdf

Does not match:
  • file.doc
  • public/documents/readme.txt
*.xlsx Matches all .xlsx files
*.docx Matches all .docx files
If there are no explicit include or exclude nodes, all files will be indexed.

Fields

Document fields found on file documents are fixed. The following fields are present on each file document in the search index.

Identification Fields
Scheme

All Dovetail Seeker documents have a scheme field. File documents have a scheme field value of file.

Domain

All Dovetail Seeker documents have a domain field. File documents have a domain field whose value matches the identification element's displayName attribute.

Id

All Dovetail Seeker documents have a id field. Attachment documents have a domain field whose value is the full file system path to the attachment.

File Document Fields

All file documents have the following fields present.

Required Fields

Required fields are common to all search documents in your Dovetail Seeker search index.

Summary and Contents

The Summary and Content fields are populated with the text extracted from file being indexed. The length of this text copied to the summary field can be controlled via configuration.

Title

The title field is populated from the extracted file's metadata. When no title is found in the metadata the file name of the file is used.

File Detail Fields

All file documents have the following fields which allow search users to search for file documents by their file specific characteristics.

Path

All file documents have a path field whose value is the full file system path to the file which was indexed. Example: \server\share\filename.pdf

This value is used by the Dovetail Seeker Web Service to serve the file when requested to do so.

File Name

All file documents have a filename field whose value is the filename of the file which was indexed. Example: filename.pdf

Extension

All file documents have a extension field whose value is the extension of the file which was indexed. Example: pdf

Content Type

All file documents have a contentType field whose value is the MIME type of the file which was indexed. Example: application/pdf

Content Length

All file documents have a contentLength field whose value is the number of bytes representing the size of the file on disk.