Best Practices for organizing content source to improve search results

How to determine what size of content source you need

Agnes Molnar

by Agnes Molnar on 3/20/2014

Share this:

Article Details

Date Revised:

Applies to:
Crawl, Microsoft SharePoint, Search, SharePoint, spx

Years of working with search has given me a lot of experience with Content Sources, and it has raised a lot of questions about them, like "Should I create one huge content source or would it be better to split up to smaller sources?" "Can I amass my small content sources into one big heap?" "How do I schedule the crawls for each of my content sources?".

As usual, there is no silver bullet answer to these questions. he only general answer to this is: It depends.

Hey, but I know the phrase, “it depends," is not why you started to read this article, so let me give you at least some idea on what you should consider when facing these questions.

You have to know though: Sometimes you don't have any choice. You can only have one, big content source that cannot be split. A big database is an example.

You might also think about integrating all your small content sources into one big content source. For example when you have small file shares that are     similar enough (again, similarity must be based on your definition).

In many cases, the situation is that you can split your huge content source into two or more smaller ones and/or you can join the small ones into big(ger) ones. For example, a huge file share can be split into subfolders (or subfolder groups); or a SharePoint farm can be split by Web Applications, Site Collections, or even sites (although it's a rare situation); or a third-party document management system can be split by the repositories. At the opposite extreme, small file shares can be treated as one big content source, as can small SharePoint sites. You get the point.

The real question is when to keep or create one big content source and when to create multiple smaller ones--IF it is possible to split a source. Considerations to make here:

  • Content Source types - In SharePoint, the following content source types are available:

    • SharePoint Sites       

    • Web Sites       

    • File Shares

    • Exchange Public Folders       

    • Line of Business Data       

    • Custom Repository       

These types cannot be mixed. For example, you cannot have a SharePoint site and a file share in the same content source. But in the same type, you can add more than one start address of your choice. This might be a good option if you have multiple, small content sources.

  • Crawling time and schedule - The more changes you have had since the last crawl, the longer time the crawl takes. The more often you crawl, the fewer changes you have to process during an incremental. The more often you do an incremental, the less idle time your system will have.

    But, the more often you crawl, the more resources you consume on both the crawler and the source system. And the more often you crawl, the bigger the chance that you won't being able to finish the crawl before the next one needs to start. The result is worse content freshness with worse search performance than you expect.

    Moreover, if you have multiple content sources, you have to align their schedules to keep your system from overloading with multiple parallel crawls.    

  • Performance effect on the crawler components - This is an obvious one: Crawling takes resources. The more you crawl, the more resources you use. If you crawl more content sources in parallel, it takes more resources. If you run one huge crawl, it takes resources for a longer time. If you don't have enough resources, the crawl might fail or run forever, which has an effect on other crawls.    

  • Performance effect on the source system - This is usually the least considered one: Crawling uses resources on the source system, as well!

  • Bandwidth - Crawling pulls data from the source system that will be processed on the indexer components. This data should be transferred, and this takes bandwidth. In many cases, this is the bottleneck in the whole crawling process, even if the source system and crawler perform well. The more crawling processes you run at the same time and the more parallel threads they have, the more bandwidth will be needed. Serialized crawls mean more balanced bandwidth requirements.

  • Similar content sources? - At the same time, you might have similar content sources that should be treated the same. For example, if you have small file shares, you might aggregate them, collect into one content source, so that their crawls can be managed together. You definitely have to do a detailed inventory for this.

  • Live content vs. Archive - Live content changes often, while archive either doesn't change at all or changes very rarely. Live content has to get crawled often, while archive doesn't need incremental crawls to run very often.

    Remember, after the initial full crawl, content is in the index, and due to the rare changes, it can be considered pretty up-to-date. So if you have a system of any kind with both live and archive content, it would be better to split them and crawl only the live content often. The archive doesn't need any special attention after the initial full crawl.

  • Automated jobs running on the content source - On many systems, automated jobs create or update content. In most cases, these jobs are time-scheduled, running in the late evenings or early mornings, for example. As these jobs are predictive, we have two best practices here.   

It isn't an easy decision, is it?

During the planning phase of a search project, each of these points should be evaluated, and the result would be something like this table:

Source system


Amount of content

Content Source(s)


file share









file share




file share



SharePoint site


Local SharePoint Content


SharePoint site


Topic: Search

Sign in with

Or register