Back to Knowledge Center
TechnologyMarch 16, 2026Serdar Ketenci

Copilot Studio Accelerator: SharePoint to Azure AI Search Custom Connector Guide

Teams building solutions with Copilot Studio frequently hit the limits of the native SharePoint connection. Microsoft's new open-source accelerator provides a custom push connector pattern that indexes SharePoint Online documents into Azure AI Search via Microsoft Graph API. This guide walks through the pipeline architecture, key technical decisions, and recommendations for taking it to production.

Copilot Studio Accelerator: SharePoint to Azure AI Search Custom Connector Guide

Limitations of the Native Connection

Copilot Studio offers built-in support for using SharePoint as a knowledge source. However, this native connection encounters significant limitations in enterprise scenarios:

  • No private endpoint support — Cannot be used in corporate environments that require network isolation
  • Conditional Access incompatibility — Connection issues arise when conditional access policies are enforced
  • No SLA guarantee — No service-level commitment from Microsoft for the preview connector
  • Limited format control — No customization over how different file formats are processed
  • 7 MB file size limit — SharePoint files cannot exceed 7 MB in environments without a Microsoft 365 Copilot license
  • Modern pages only — Classic ASPX pages and modern pages containing SPFx components are not supported

These limitations push teams toward alternative solutions, especially in scenarios requiring advanced grounding, full control, or generative AI workflows.

Why a Custom Connector?

In response to this need, Microsoft released the SharePoint → Azure AI Search Connector accelerator as open source. It provides a starting point for building your own custom push connector.

Key differences between the native and custom connector:

  • Full control — You decide which files get indexed, how they are chunked, and what metadata is attached
  • Network compatibility — Works seamlessly with private endpoints and Conditional Access
  • Format flexibility — Over 25 file formats supported, from PDF tables to emails, ZIP archives to EPUBs
  • No size limit — Files up to 100 MB can be processed (configurable)
  • Security trimming — Entra ID permissions are stored at the index level, enabling query-time filtering
  • Production-grade — Your SLA depends on your own Azure infrastructure, not a preview dependency

Pipeline Architecture

The connector runs on an Azure Function with a timer trigger. Each run follows these steps:

    • Timer fires — Azure Functions runtime invokes sharepoint_indexer on a CRON schedule (default: every hour)
    • Index check — The search index schema is created or updated (idempotent)
    • File discovery — Microsoft Graph API lists files across configured document libraries. In incremental mode, files are filtered by lastModifiedDateTime
    • Parallel processing per file (configurable concurrency):
  • Freshness check — Verifies whether the existing index record is current (±1 second tolerance)
  • Download — File bytes are fetched via Graph API
  • Text extraction — The correct extractor is dispatched based on file extension
  • Chunking — Text is split into overlapping chunks with intelligent boundary detection
  • Embedding generation — Chunks are sent to Azure OpenAI in batches of 16
  • Delete old chunks — Previous chunks for the file are removed from the index
  • Upload new chunks — Pushed to the index with embeddings, metadata, and permissions
    • Reconciliation (full reindex only) — Compares parent_id values in the index against current SharePoint files and cleans up orphaned chunks

Technical Decisions and Rationale

The accelerator's architecture reflects deliberate choices:

Push Connector (Not Pull)

A push model is chosen over Azure AI Search's own pull connector. This gives full control over what gets indexed and when, eliminating dependency on Azure's preview connector infrastructure.

No Blob Intermediary

Files go straight from SharePoint to memory to the index. No intermediate Blob Storage layer. This simplifies the architecture, lowers cost, and removes additional storage to manage.

System-Assigned Managed Identity

All Azure services use managed identity — zero secrets to manage. Only Graph API requires special permission assignments (Sites.Read.All and Files.Read.All).

HNSW Vector Search

The industry-standard approximate nearest neighbor algorithm, offering a good balance between speed and recall.

text-embedding-3-large (1536 dimensions)

OpenAI's highest quality embedding model. 1536 dimensions provides a good quality-to-cost tradeoff (the model supports up to 3072).

Supported File Formats

The connector supports over 25 file formats:

  • PDF — Page-by-page text extraction with PyMuPDF
  • Word (.docx, .docm) — Paragraph text via python-docx
  • Excel (.xlsx, .xlsm) — All sheets via openpyxl
  • PowerPoint (.pptx, .pptm) — All text frames via python-pptx
  • Plain text (.txt, .md) — With encoding fallback chain
  • CSV, JSON, XML/KML — Using built-in libraries
  • HTML — Script/style cleanup with BeautifulSoup
  • Email (.eml, .msg) — Subject, sender, date, and body
  • EPUB, OpenDocument (.odt, .ods, .odp) — Full text extraction
  • Archives (.zip, .gz) — Extracts and processes inner files (depth limit: 3)

For production environments with complex documents (PDFs with tables, scanned images, forms), replacing this layer with Azure Document Intelligence or similar services is recommended.

Incremental Synchronization

The connector operates in two modes:

Incremental Mode

The recommended mode for production. Uses an hourly CRON with a 65-minute window (60-minute interval + 5-minute overlap). The Graph API query is filtered by lastModifiedDateTime. Only changed files are processed.

The 5-minute overlap compensates for:

  • Clock skew between SharePoint and the Function App
  • Files modified in the last seconds of the previous window
  • Timer drift in serverless environments

Full Reindex

With INCREMENTAL_MINUTES=0, all files are processed. The freshness check remains active, so current files are skipped. Additionally, reconciliation runs: chunks for files that no longer exist in SharePoint are cleaned up from the index.

Security Trimming

The connector stores Entra ID object IDs (users and groups) in the permission_ids field for each file. This enables query-time security filtering.

For each file, GET /beta/drives/{driveId}/items/{itemId}/permissions is called. Identity IDs are extracted from grantedToV2 (direct permissions) and grantedToIdentitiesV2 (sharing links).

At query time, the user's Entra object ID and group memberships are used to filter the permission_ids field. This ensures users only see results from files they have access to.

Important note: Permissions are a snapshot from when the file was indexed. Permission changes in SharePoint are not reflected until the file is re-indexed (on its next modification or next full reindex).

Recommendations for Enterprises

This accelerator is intentionally kept simple — it is a starting point. When taking it to production, consider the following:

    • Document Intelligence integration — Basic text extraction will not suffice for PDFs with tables, scanned documents, and forms. Integrate Azure Document Intelligence or Unstructured.io
    • Customize the chunk strategy — The default 2000 characters / 200 character overlap may not be optimal for every document type. Test and adjust with your own documents
    • Extend the index schema — Add additional metadata fields based on your business requirements
    • Monitoring and alerting — Monitor run summaries via Application Insights and set up alerts for error rates
    • Cost optimization — The Flex Consumption plan scales to zero when idle. However, track embedding costs; the initial full indexing of large libraries can be significant

Conclusion

While Copilot Studio's native SharePoint connection is sufficient for simple scenarios, it reaches its limits as enterprise requirements grow. Microsoft's accelerator provides a solid foundation for building your own pipeline.

The custom connector approach requires more initial setup work but delivers full control, network compatibility, broad format support, and security trimming — capabilities that are critical for enterprise deployments.

You can access the source code on GitHub and review detailed information about Copilot Studio SharePoint limits on Microsoft Learn.

Frequently Asked Questions

Can this accelerator be used directly in production?

The accelerator is designed as a starting point. While the basic text extraction layer works for simple documents, Azure Document Intelligence integration is recommended for complex documents like PDFs with tables and scanned images. Chunk size and index schema should also be tuned to your own documents.

What is the performance difference between the native and custom connector?

The custom connector gives you full control over indexing timing and prioritization through its push model. Large libraries are efficiently indexed with parallel processing (default 4 workers). Incremental synchronization processes only changed files, preventing unnecessary load.

What Azure resources are required?

At minimum, you need Azure AI Search (Basic tier or above — Free tier does not support vector search), Azure OpenAI or Foundry (with an embedding model), Azure Functions (Flex Consumption), and a Microsoft Entra ID tenant. All infrastructure can be deployed with a single command using the provided Bicep template.

What happens when permissions change in SharePoint?

Permissions are a snapshot from when the file was indexed. Permission changes are not reflected in the index until the file is re-indexed (on its next modification or next full reindex). For sensitive scenarios, scheduling more frequent full reindexes is recommended.

How are costs calculated?

Azure Functions Flex Consumption plan means zero cost when idle. Runtime costs depend on the number of files, their sizes, and embedding API calls. Azure AI Search Basic tier incurs a fixed monthly cost. The largest variable cost item is Azure OpenAI embedding calls — keep this in mind for the initial full indexing of large libraries.