File Document Specifications
Dovetail Seeker uses the file document specification to enable indexing files that exist on the file system. It searches through the directories listed in the specification and indexes the content of the files it finds. Which files are indexed can controlled by explicitly including or excluding them from the specification.
File document specifications define:
- Directories - One or more directories whose files should be indexed.
- Fields - The specification domain name which is used to administrate and filter file documents.
Security Note: The user the Dovetail Seeker windows service or console is
running under must have access to the paths in the file document specification
in order to index the content.
File Text Extraction
To extract the text for files. Dovetail Seeker uses a library called Apache Tika. Check their website for supported file formats.
Example File Document Specification
The following is a specification for searching for documentation files:
<fileDocumentSpecification description="paths to where your documents live" tags="docs">
<identification displayName="documentation" />
<directories>
<directory path="c:\documentation">
<include name="*pdf" />
<exclude name="bigfiles"/>
</directory>
</directories>
</fileDocumentSpecification>
This file specification will looks in the directory c:\documentation for all files ending in pdf and will exclude any files that exist in a directory named bigfiles.
Directories
The file document specification can have one or more directory nodes controlling which directories are indexed as a part of this document specification. Each directory can be filtered with include or exclude patterns allowing fine grained control over which files are indexed. Regular expressions can be used for the patterns to allow for maximum flexibility.
Include
<include>
The items to include in the index:
Attribute | Required | Description |
---|---|---|
name | Yes | The pattern or file name to include |
Exclude
<exclude>
The items to exclude from the index:
Attribute | Required | Description |
---|---|---|
name | Yes | The pattern of file name to exclude |
Example
The following specification indexes files with the extension of pdf but excludes any that contains the path secret
.
<fileDocumentSpecification description="paths to where your documents live" tags="docs">
<identification displayName="documentation" />
<directories>
<directory path="c:\documentation">
<include name="*.pdf" />
<exclude name="secret" />
</directory>
<directory path="\\server\share\directory">
<include name="*.docx" />
<include name="*.xlsx" />
</directory>
</directories>
</fileDocumentSpecification>
The c:\documentation directory will include pdfs, and will exclude any pdfs that have secret in the path.
The \server\share\directory will index only files with the extension docx or xlsx.
Here is a table with example matches based on the above patterns:
Pattern | Match |
---|---|
Matches all files that contain the extension pdf Does match:
Does not match:
|
|
secret | Matches all files that have secret in the path or filename Does match:
Does not match:
|
*.xlsx | Matches all .xlsx files |
*.docx | Matches all .docx files |
If there are no explicit include or exclude nodes, all files will be indexed.
Fields
Document fields found on file documents are fixed. The following fields are present on each file document in the search index.
Identification Fields
Scheme
All Dovetail Seeker documents have a scheme field. File documents have a scheme field value of file.
Domain
All Dovetail Seeker documents have a domain field. File documents have a domain field whose value matches the identification element's displayName attribute.
Id
All Dovetail Seeker documents have a id field. Attachment documents have a domain field whose value is the full file system path to the attachment.
File Document Fields
All file documents have the following fields present.
Required Fields
Required fields are common to all search documents in your Dovetail Seeker search index.
Summary and Contents
The Summary and Content fields are populated with the text extracted from file being indexed. The length of this text copied to the summary field can be controlled via configuration.
Title
The title field is populated from the extracted file's metadata. When no title is found in the metadata the file name of the file is used.
File Detail Fields
All file documents have the following fields which allow search users to search for file documents by their file specific characteristics.
Path
All file documents have a path field whose value is the full file system path to the file which was indexed. Example: \server\share\filename.pdf
This value is used by the Dovetail Seeker Web Service to serve the file when requested to do so.
File Name
All file documents have a filename field whose value is the filename of the file which was indexed. Example: filename.pdf
Extension
All file documents have a extension field whose value is the extension of the file which was indexed. Example: pdf
Content Type
All file documents have a contentType field whose value is the MIME type of the file which was indexed. Example: application/pdf
Content Length
All file documents have a contentLength field whose value is the number of bytes representing the size of the file on disk.