Hello,
I would like to be able to count the number of source documents in my Index. Of course, the index shows the number of 'documents,' but this is really document chunks. I am using the UI in the portal to 'Import and Vectorize.' I see a discrepancy in the count between the number of documents in my Azure blob storage container and the number of documents processed by the Indexer.
I would like a way to be able to validate this with a distinct count of source documents in my Index. I have the metadata_storage_path exposed as a field. It seems pretty simple. I just want a distinct count. However, I cannot find a way to do this.
Thank you
Hi petermcnally ,
Welcome to the Microsoft Q&A Platform!
To count distinct source documents in your Azure AI Search Index using the
metadata_storage_path
field.
Ensure
metadata_storage_path
is indexed and set as
facetable
.
Use Azure Search Explorer or API to query distinct counts:
Use the
facets
parameter on
metadata_storage_path
.
"search": "*",
"facets": ["metadata_storage_path,count:1"]
The response will show the count of unique documents.
Check for skipped or failed documents if there’s a discrepancy with blob storage.
https://learn.microsoft.com/en-us/azure/search/search-faceted-navigation
https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents
Hi petermcnally ,
Thankyou for your Response.
The issue in counts suggests that the
metadata_storage_path
field might not be configured properly for the desired purpose, or the data itself might not be normalized correctly.
The Search Explorer has limitations in how it processes facet counts, and it may not be the ideal tool for this task. To obtain an accurate distinct count, you likely need to use the REST API or process data outside the index.
The REST API provides more flexibility and may return the full set of distinct values. Use this query.
POST https://<search-service-name>.search.windows.net/indexes/<index-name>/docs/search?api-version=2021-04-30-Preview
Content-Type: application/json
api-key: <api-key>
"search": "*",
"facets": ["metadata_storage_path,count:1000"],
"top": 0
You query all documents.
The facet count is set higher to account for more unique values.
The field must be set to retrievable in your index schema.
Marked as facetable to enable distinct counts.
Export the data using Azure CLI.
Process the exported data using a script (e.g., Python) to count unique metadata_storage_path values.
You can export the data by using azure CLI also
az search document list \
--service-name <search-service-name> \
--index-name <index-name> \
--api-key <api-key> > exported-documents.json
Let me know if you have any further assistances.
@Shree Hima Bindu Maganti Thanks again for following up. Unfortunately I get the same result using the REST API call as in the Storage Explorer when attempting to get a count using the facets parameter.
I am not sure you understand what I am trying to do. I want the total count of unique documents (identified by the metadata_storage_path). When I use facets as you have explained, I am getting the number of chunks per source document. This is akin to using the metadata_storage_path in the 'group by' clause of a SQL select command, i.e.
select metadata_storage_path, count(*)
from index
group by metadata_storage_path
I want something like:
select count(distinct metadata_storage_path)
from index
As a side note, I am curious what you mean by 'the data itself might not be normalized correctly.' In what sense are you using the term 'normalized?' The source data are documents in blob storage. There isn't any data normalization as in a database with tables. I assume you are referring to another way to normalize data I am unfamiliar with.
See below for my index. It is pretty straightforward to set up a field for the metadata_storage_path and make it facetable (see below). Also, by browsing the data, i see the field is populating correctly.
Hi petermcnally ,
Thankyou for your Response.
Thank you for the clarification and for sharing additional details about your requirement. I understand that you want to count the total number of unique source documents based on the metadata_storage_path field, rather than aggregating chunks per document.
Approach to Count Distinct metadata_storage_path in Azure AI Search.Azure Cognitive Search does not provide a direct way to compute a DISTINCT count in the way SQL does (COUNT(DISTINCT column)).
The facets parameter returns groups of values, but it won't provide a total unique count directly.
Instead, you can retrieve all metadata_storage_path values and process them externally to compute the distinct count.
Use the Azure CLI, SDK, or REST API to export all the metadata_storage_path values.
az search document list \
--service-name <search-service-name> \
--index-name <index-name> \
--api-key <api-key> > exported-documents.json
Use a script to parse the results and compute the distinct count.
import json
# Load exported documents
with open("exported-documents.json", "r") as file:
documents = json.load(file)
# Extract metadata_storage_path and count distinct values
unique_paths = {doc["metadata_storage_path"] for doc in documents}
print(f"Distinct metadata_storage_path count: {len(unique_paths)}")
Ensure the metadata_storage_path field is set as Retrievable to include it in search results, and Facetable to enable grouping, then confirm these settings via the Azure portal or index schema.
Regarding your query on "data normalization," I was referring to the possibility of duplicate metadata_storage_path values caused by issues like,Duplicates in blob storage.Indexer reprocessing the same document into multiple chunks. If the data appears correct, normalization concerns can be ruled out.
https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents
@Shree Hima Bindu Maganti thanks again for another response. That didn't quite work, but it got me in the right direction. Below is the code block I used to get the unique count of the values in metadata_storage_path.
Hopefully this can help others. The count of unique source documents in the Azure AI Search UI would be much more valuable than the count of document chunks that is currently there.
# Write the response JSON to a file
with open('response.json', 'w') as json_file:
json.dump(response_json, json_file, indent=4)
# Extract and count distinct values in metadata_storage_path
if '@search.facets' in response_json and 'metadata_storage_path' in response_json['@search.facets']:
distinct_values = set()
for item in response_json['@search.facets']['metadata_storage_path']:
distinct_values.add(item['value'])
print("Distinct metadata_storage_path count:", len(distinct_values))
else:
print("metadata_storage_path facet not found in the response.")
Hi petermcnally ,
Thankyou for your Response.
I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer.
Issue: How to count the distinct number of source documents in Azure AI Search Index
I would like to be able to count the number of source documents in my Index. Of course, the index shows the number of 'documents,' but this is really document chunks. I am using the UI in the portal to 'Import and Vectorize.' I see a discrepancy in the count between the number of documents in my Azure blob storage container and the number of documents processed by the Indexer.
I would like a way to be able to validate this with a distinct count of source documents in my Index. I have the metadata_storage_path exposed as a field. It seems pretty simple. I just want a distinct count. However, I cannot find a way to do this.
Solution: The count of unique source documents in the Azure AI Search UI would be much more valuable than the count of document chunks that is currently there.
# Write the response JSON to a file
with open('response.json', 'w') as json_file:
json.dump(response_json, json_file, indent=4)
# Extract and count distinct values in metadata_storage_path
if '@search.facets' in response_json and 'metadata_storage_path' in response_json['@search.facets']:
distinct_values = set()
for item in response_json['@search.facets']['metadata_storage_path']:
distinct_values.add(item['value'])
print("Distinct metadata_storage_path count:", len(distinct_values))
else:
print("metadata_storage_path facet not found in the response.")
If this answers your query, do click Accept Answer and Yes for was this answer helpful.
Hi petermcnally ,
I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution to accept the answer.
Hi petermcnally ,
I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution to accept the answer.