Skip to content

Investigating file content search issues

An overview of file content search issues.

Services covered by file content indexing

The following HCL Connections services are covered for file content indexing:

  • Files in the Files service
  • Wikis attachments
  • Activities attachments
  • Forums attachments
  • CCM files

Indexing schedule

Index scheduling is as follows:

  • Metadata from all files is indexed as part of the regular 10/15 minute indexing schedule.
  • File content extraction is handled by a separate process on its own schedule. Therefore, it might be up to 50 minutes after upload time before a file can be searched by its content.

Supported file types

search-config.xml defines the file types that are handled for file content indexing:

<mimeType name="application/msword" processor="" />  
<mimeType name="application/" processor="" />  
<mimeType name="application/" processor="" />  
<mimeType name="application/vnd.visio" processor="" />
<mimeType name="application/" processor="" />  
<mimeType name="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" processor="" />  
<mimeType name="application/vnd.openxmlformats-officedocument.presentationml.presentation" processor="" />  
<mimeType name="application/vnd.openxmlformats-officedocument.wordprocessingml.document" processor="" />  
<mimeType name="application/pdf" processor="" /> 
<mimeType name="application/postscript" processor="" />
<mimeType name="application/xhtml+xml" processor="" /> 
<mimeType name="application/xml" processor="" />
<mimeType name="text/html" processor="" />  
<mimeType name="text/htm" processor="" />  
<mimeType name="text/plain" processor="" />  
<mimeType name="text/richtext" processor="" />  
<mimeType name="text/xml" processor="" />  
<mimeType name="application/rtf" processor="" />  
<mimeType name="application/vnd.oasis.opendocument.text" processor="" />  
<mimeType name="application/vnd.oasis.opendocument.spreadsheet" processor="" />  
<mimeType name="application/vnd.oasis.opendocument.presentation" processor="" />  
<mimeType name="application/vnd.oasis.opendocument.text-master" processor="" />  

You can disable indexing of any of these file types by removing that entry from search-config.xml.

General file content indexing switches

You can disable all file content indexing by removing all file type entries from search-config.xml.

You can also temporarily disable file content indexing by disabling the 20-minute file content retrieval scheduled task.

File size cutoff

The search-config.xml file size cutoff maxAttachmentSize is a configuration setting for the maximum size of files that can have content that is indexed. Any file that exceeds the cutoff size is not indexed. By default this configuration is set to 52 MB.

Limit on indexed text

The search-config.xml limit on indexed text maxAttachmentSize is a configuration setting that limits the amount of extracted text that is indexed for a file. This limit prevents large files from adversely affecting search relevancy by pushing down smaller more relevant files in the search results. This limit is configurable and the default is 200 KB of extracted text.

Unsupported files

The following files are never indexed:

  • Encrypted files
  • Password protected files
  • Corrupted files of any type

Searching files content for accented characters

Searching file content for accented characters works for all the supported file types, except in the case of .txt files that do not have UTF-8 encoding. For example, if a .txt file has ANSI encoding, then any accented characters it contains are not found by a full text search. To resolve this, save the file using UTF-8 encoding and then upload it again.

General parameters for Apache Tika file conversions

HCL Connections uses the Apache Tika conversion libraries to convert business documents to plain text before they are indexed. There are a number of properties that could be modified if required, though the default values typically should be sufficient.

These parameters can be updated in the search-config.xml file.

Note: The following properties might not be present in search-config.xml. If a property is missing from search-config.xml, it will use the default value documented in this article. If you need to specify a different value for any of these properties, add the property to search-config.xml: Edit search-config.xml and find <propertySettings> at the end of the file. Inside this element, at the bottom of the list of <property> names, add the following:

        <property name="tikaFileConversion">
            <propertyField name="tikaFileConversion.javaOptions" value=""/>
            <propertyField name="tikaFileConversion.maxContentSize" value="204800"/>
            <propertyField name="tikaFileConversion.maxConversionSeconds" value="180"/>
            <propertyField name="tikaFileConversion.maxDocConversionsPerProcess" value="100"/>
            <propertyField name="tikaFileConversion.tempDirectory" value="${SEARCH_INDEX_DIR}/filesTemp"/>
            <propertyField name="tikaFileConversion.deleteTempFiles" value="true"/>
            <propertyField name="tikaFileConversion.maxConversionThreads" value="5"/>

Save the file, perform a full synchronization for all nodes, and restart the Search application for any new values to take effect.

Property Default value Description
tikaFileConversion.deleteTempFiles true The conversion process creates temporary files containing the plain text that are deleted by default. If debugging and it is necessary to review these files, change this to false.
tikaFileConversion.javaOptions   Update only if specific JVM options are required such as increasing the process heap size
tikaFileConversion.maxContentSize 204800 Indicates the Maximum number of bytes to index at the beginning of a document.
tikaFileConversion.maxConversionSeconds 180 Indicates the number of seconds that the Connections server waits for a tika process to respond to a conversion request.
tikaFileConversion.maxDocConversionsPerProcess 100 Indicates the number of conversions each tika server process is allowed to execute before the process is recycled and started over.
tikaFileConversion.maxConversionThreads 5 Displays the maximum number of threads that will be allowed to run conversions simultaneously. This is also the maximum number of tika processes which run simultaneously.
tikaFileConversion.tempDirectory ${SEARCH_INDEX_DIR}/filesTemp Displays the directory used to store temporary conversion file output

Parent topic:The indexing process

Related information

Troubleshooting when files content is not found after searching