RCSB PDB Search API

Reference Documentation: RCSB PDB Search API Reference
Query Editor: RCSB PDB Search API Query Editor
Examples: RCSB PDB Search API Examples

Stay current with API announcements by subscribing to the RCSB PDB API mailing list:

signing in with existing google account and subscribe
or send an email to api+subscribe@rcsb.org

Introduction

The Search API accepts HTTP GET or POST requests with JSON payloads. The base URI of search endpoint is https://search.rcsb.org/rcsbsearch/v2/query. In GET method, search request should be sent as a URL-encoded query string in json parameter: https://search.rcsb.org/rcsbsearch/v2/query?json={search-request}.

Query syntax for the {search-request} is detailed in the Query Language section of this guide. See Build Your Search section for general information on how to construct the {search-request} object.

The search API is designed to return only the identifiers of relevant hits (see Return Type section for more information on the identifiers types that can be requested) and additional metadata. See Response Body section for more information. If you need to extract information on released date, macromolecules, organisms, resolution, modified residues, ligands etc., you should use RCSB Data API: https://data.rcsb.org.

Build Your Search

A search request is a complete specification of what should be returned in a result set. The search request is represented as a JSON object. The building blocks of the request are:

Context	Description
`return_type`	Required. Specifies the type of the returned identifiers, e.g. entry, polymer entity, assembly, etc. See Return Type section for more information.
`query`	Optional. Specifies the search expression. Can be omitted if, instead of IDs retrieval, facets or count operation should be performed. In this case the request must be configured via the `request_options` context.
`request_options`	Optional. Controls various aspects of the search request including pagination, sorting, scoring and faceting. If omitted, the default parameters for sorting, scoring and pagination will be applied.
`request_info`	Optional. Specifies an additional information about the query, e.g. `query_id`. It's an optional property and used internally at RCSB PDB for logging purposes. When `query_id` is sent with the search request, it will be included into the corresponding response object.

The query context may consist of two types of clauses:

Terminal node - performs an atomic search operation, e.g. searches for a particular value in a particular field. Parameters in the terminal query clause provide match criteria for finding relevant search hits. The set of parameters differs for different search services.
Group node - wraps other terminal or group nodes and is used to combine multiple queries in a logical fashion.

The simplest query requires specifying only return_type parameter and query context. With unspecified parameters property in the query object, a query matches all documents, returning PDB IDs if the return_type property is set to "entry".

Refer to Examples section for more examples.

Search Services

The RCSB PDB Search API consolidates requests to heterogeneous search services. The list of available services is below:

Service	Description
`text`	Performs attribute searches against textual annotations associated with PDB structures. Refer to Structure Attributes Search page for a full list of annotations.
`text_chem`	Performs attribute searches against textual annotations associated with PDB molecular definitions. Refer to Chemical Attributes Search page for a full list of annotations.
`full_text`	Performs unstructured searches against textual annotations associated with PDB structures or molecular definitions. Unstructured search performs a full-text searches against multiple text attributes.
`sequence`	This service employs the MMseqs2 software and performs fast sequence matching searches (BLAST-like) based on a user-provided FASTA sequence (with E-value or % Identity cutoffs). Following searches are available: `protein`: search for protein sequences `dna`: search for DNA sequences `rna`: search for RNA sequences
`seqmotif`	Performs short motif searches against nucleotide or protein sequences, using three different types of input format: `simple` (e.g., CXCXXL) `prosite` (e.g., C-X-C-X(2)-[LIVMYFWC]) `regex` (e.g., CXCX{2}[LIVMYFWC])
`structure`	Performs fast searches matching a global 3D shape of assemblies or chains of a given entry (identified by PDB ID), in either strict (`strict_shape_match`) or relaxed (`relaxed_shape_match`) modes, using a BioZernike descriptor strategy.
`strucmotif`	Performs structure motif searches on all available PDB structures.
`chemical`	Enables queries of small-molecule constituents of PDB structures, based on chemical formula and chemical structure. Both molecular formula and formula range searches are supported. Queries for matching and similar chemical structures can be performed using SMILES and InChI descriptors as search targets. Graph and chemical fingerprint searches are implemented using tools from the OpenEye Chemical Toolkit. Descriptor Matching Criteria: The following graph matching searches use a fingerprint prefilter so these are designed to find only similar molecules. These graph matching comparisons include: `graph-exact`: atom type, formal charge, bond order, atom and bond chirality, aromatic assignment, valence degree, and atom hydrogen count are used as matching criteria for this search type. Graph matching is performed on the subset of molecules satisfying a fingerprint screening search. Results will include isomorphic and substructure matches within this screened subset. `graph-strict`: atom type, formal charge, bond order, atom and bond chirality, aromatic assignment, ring membership, and valence degree are used as matching criteria for this search type. Graph matching is performed on the subset of molecules satisfying a fingerprint screening search. Results will include isomorphic and substructure matches within this screened subset. `graph-relaxed`: atom type, formal charge and bond order are used as matching criteria for this search type. Graph matching is performed on the subset of molecules satisfying a fingerprint screening search. Results will include isomorphic and substructure matches within this screened subset. `graph-relaxed-stereo`: atom type, formal charge, bond order, atom and bond chirality are used as matching criteria for this search type. Graph matching is performed on the subset of molecules satisfying a fingerprint screening search. Results will include isomorphic and substructure matches within this screened subset. `fingerprint-similarity`: Tanimoto similarity is used as the matching criteria for molecular fingerprints. Matches include molecules with scores exceeding 0.6 for TREE type fingerprints or 0.9 for MACCS type fingerprints. The following graph matching searches perform an exhaustive substructure search with no pre-screening. These substructure graph matching comparisons include: `sub-struct-graph-exact` (atom type, formal charge, aromaticity, bond order, atom/bond stereochemistry, valence degree, atom hydrogen count) `sub-struct-graph-strict` (atom type, formal charge, aromaticity, bond order, atom/bond stereochemistry, ring membership, valence degree) `sub-struct-graph-relaxed` (atom type, formal charge, bond type) `sub-struct-graph-relaxed-stereo` (atom type, formal charge, bond type, atom/bond stereochemistry)

Return Type

The search can return one of the following result types:

Type	Description
`entry`	Returns a list of PDB IDs.
`assembly`	Returns a list of PDB IDs appended with assembly IDs in the format of a [pdb_id]-[assembly_id], corresponding to biological assemblies.
`polymer_entity`	Returns a list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to polymeric molecular entities.
`non_polymer_entity`	Returns a list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to non-polymeric entities (or ligands).
`polymer_instance`	Returns a list of PDB IDs appended with asym IDs in the format of a [pdb_id].[asym_id], corresponding to instances of certain polymeric molecular entities, also known as chains. Note, that asym_id in the instance identifier corresponds to the _label_asym_id from the mmCIF schema (assigned by the PDB). It can differ from _auth_asym_id (selected by the author at the time of deposition).
`mol_definition`	Returns a list of molecular definition identifiers that include: Chemical component entries identified by the alphanumeric code, COMP ID: e.g. ATP, ZN BIRD entries identified by BIRD ID, e.g. PRD_000154

Query Language

The Search API provides a full query DSL (domain-specific language) based on JSON to define queries.

Basic Search

The query language allows to perform unstructured (basic) searches. An unstructured query refers to the search of textual annotation associated with PDB structures when the field name is unknown. Such query will search across all fields, available for search, and return a hit if match happens in any field.

To perform an unstructured search, you should send the parameters object without an explicit attribute property:

Refer to Examples section for more examples.

By default, all terms are optional, as long as one term matches. The query thymidine kinase is translated as thymidine OR kinase. You can wrap the input value with a double-quote mark to change boolean logic to AND, i.e. "thymidine kinase".

Complex boolean queries in the basic search can be built with following operators:

+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
( and ) signify precedence

For example, using a interferon + response + factor query string is equivalent to running interferon AND response AND factor search.

You can use ( and ) to signify precedence. For example, searching with a query string isopeptide + ( collagen | fibrinogen ) returns structures that contain isopeptide AND either collagen OR fibrinogen.

Attribute Search

Attribute query allows searching for terms with relation to a specific attribute. To perform an attribute search, you should send the parameters object with an explicit attribute property set to a field name, value property set to a search term, and operator property set to a search operator.

Refer to the Examples section for more examples.

When using attribute search, you must observe the following rules:

The field must be a valid field name listed in Structure Attributes Search or Chemical Attributes Search.
The operator must be compatible with the field. Full list of the operators is available in the Search Operators section.
The values entered must match the type of the field and be compatible with the operator. Date values should be specified in ISO 8601 formats:
- Date: YYYY-mm-DD
- Date and Time: YYYY-mm-DD'T'HH:MM:SS'Z', where the 'Z' means UTC

Negation

To perform negation on the operator, the negation property should be set to true in the query parameters object. The following search returns non-protein polymeric entities:

Refer to the Examples section for more examples.

Case-Sensitive Search

By default, searches performed using exact match operators are case-insensitive. You can make your search case-sensitive by setting the case_sensitive property in the query parameters object to true. This option can be useful when capitalization rules help convey additional information. For example, gene symbols can differ in capitalization between homologous from different species, i.e. human genes are always upper case.

The following search returns human kinases encoded by the ABL1 gene. It excludes results where the case doesn't match, such as non-receptor tyrosine-protein kinase from mouse encoded by the Abl1 gene.

Refer to the Examples section for more examples.

Boolean Expressions

The query language supports two boolean operators: AND and OR. Boolean operators should be added to the group node as logical_operator property. The group nodes can be used to logically combine search expressions (terminal nodes) or other group nodes:

Refer to the Examples section for more examples.

Scoring Strategy

You can customize how scores from different services impact the final relevancy ranking of your search results by setting a scoring_strategy in the request_options context. Following scoring strategies are available: combined (default), sequence, seqmotif, strucmotif, structure, chemical, and text. For example, you might want to boost search results based on the relevance score produced by sequence search service, then sequence scoring strategy should be used.

The final relevancy score is calculated as weighted sum of normalized scores produced by different search services (all search result scores are rescaled to the interval [0, 1], 0 still means it met the criteria of the search). When combined strategy is used, equal weights are applied. For other strategies, higher weight is used for select service scores making their contribution to the final score bigger and therefore promoting ranking that is influenced by select service.

Sorting

Sorting is determined by the sort object in the request_options context. It allows you to add one or more sorting conditions to control the order of the search result hits. The sort operation is defined on a per field level, with special field name for score to sort by score (the default).

Structure Attributes Search and Chemical Attributes Search pages to find all searchable attributes. Any attribute listing exact_match or equals operators can be used for sorting.

By default sorting is done in descending order ("desc"). The sort can be reversed by setting direction property to "asc". This example demonstrates how to sort the search results by release date:

Refer to the Examples section for more examples.

Pagination

By default, only first 10 hits are included in the search result list. Pagination can be configured by the start and rows parameters of the paginate object in the request_options context.

Note that the maximum number of hits that can be retrieved in a single pagination request, with start and rows, is 10,000.

To retrieve all hits use the return_all_hits parameter in the request_options context. Please note that returning all hits is generally not desirable and may be the source of performance issues.

Refer to the Examples section for more examples.

Counting Results

By default, the search results contains a list of matched identifiers and additional metadata. See Search Results for more details. The return_counts flag in the request_options context allows you to execute a search query and get back only the number of matches for that query. The following query returns a number of current structures in the PDB archive:

Refer to the Examples section for more examples.

Include Computed Models

RCSB PDB has integrated Computed Structure Models from AlphaFold DB and ModelArchive. To include Computed Structure Models into your search results, add results_content_type parameter to the request_options context. This parameter allows to specify the content type filter that can include experimental, computational structures or both.

Refer to the Examples section for more examples.

Faceted queries (or facets) provide you with the ability to group and perform calculations and statistics on PDB data by using a simple search query. Facets are the arrangement of search results into categories (buckets) based on the requested field values.

If the facets property is specified in the request_options context, the search results are presented along with numerical counts of how many matching IDs were found for each term requested in the facets. If the query context is omitted in the search request with facets specified, the response will contain only the facet counts.

This example calculates the breakdown by experimental method of PDB structures, released after 2019-08-20:

By default, searches containing a faceted query return both search hits and aggregation results. To return only aggregation results, set rows to 0 in the pagination context:

Refer to Examples section for more examples.

Terms faceting is a multi-bucket aggregation where buckets are dynamically built - one per unique value. For each bucket terms faceting counts the documents (entry, polymer_entity, etc.) that contain a given value in a given field. For example, you can run the terms aggregation on the field rcsb_primary_citation.rcsb_journal_abbrev which holds the abbreviated name of a journal associated with an entry. In return, we have buckets for each journal, each with their PDB entry counts.

You can specify a threshold value for a count associated with a bucket for that bucket to be returned. Use min_interval_population parameter, e.g. in this example only journals associated with at least 1000 entries are returned:

You can also control the returned number of buckets using max_num_intervals parameter (up to 65536 limit). Larger values of max_num_intervals use more memory to compute and, push the whole aggregation close to the limit. You’ll know you’ve gone too large if the request fails with a message about max_buckets.

Refer to Examples section for more examples.

Histogram faceting is a multi-bucket aggregation that can be applied on numeric values. It builds fixed size (a.k.a. interval) buckets over the values. For example, for the rcsb_polymer_entity.formula_weight field that holds a formula mass (KDa) of the entity, we can configure this aggregation to build buckets with interval 50 KDa:

You can use the min_interval_population parameter to request buckets with a higher or equal count associated with it.

Refer to Examples section for more examples.

This multi-bucket aggregation is similar to the histogram aggregation, but it can only be used with date values. Calendar-aware intervals are configured with the interval parameter. For example, we can configure this aggregation to build buckets with 1 year intervals:

Refer to Examples section for more examples.

A multi-bucket aggregation that enables the user to define a set of numeric ranges - each representing a bucket. Note that this aggregation includes the from value and excludes the to value for each range. Omitted from or to parameters creates a bucket with min or max boundaries. Example:

Refer to Examples section for more examples.

This multi-bucket aggregation is similar to the range aggregation but dedicated for date values. The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in date math expressions. Example:

Refer to Examples section for more examples.

Cardinality faceting is single-value metrics aggregation that calculates a count of distinct values returned for a given field. For example, you can count unique source organism name assignments in the PDB archive:

Refer to Examples section for more examples.

As its name suggests, the filter aggregation helps you filter documents that contribute to bucket count. In the example below, we are filtering only protein chains which adopt 2 different beta propeller arrangements according to the CATH classification:

Refer to Examples section for more examples.

Complex, multi-dimensional aggregations are possible as in the example below:

Refer to Examples section for more examples.

Search Operators

Search operators are commands that help you make your search more specific and focused. The following operators can be used to perform a field search:

Exact Match Operators

Exact match operators indicate that the input value should match a field value exactly (including whitespaces, special characters and case).

exact_match

You can use the exact_match operator to find exact occurrences of the input value. Comparisons with exact_match operator are case-insensitive by default. See the Case-Sensitive Search paragraph of the Attribute Search section to learn how to configure case-sensitive exact searches.

A single value input is required for this operator and must be a string.

in

The in operator allows you to specify multiple values in a single search expression. It returns results if any value in a list of input values matches. It can be used instead of multiple OR conditions. Comparisons with in operator are case-sensitive.

An input value is required for this operator and it must be a list of strings, numbers or dates.

Full-Text Operators

The full-text operators enable you to perform linguistic searches against text data by operating on words and phrases. The input text is analyzed before performing a search. The analysis includes following transformations:

Most punctuation is removed
The remaining content is broken into individual words, called tokens
Tokens are lowercased which makes search case-insensitive

The standard grammar based tokenization is used to break input text into tokens. Refer to the Unicode Text Segmentation documentation for more information on tokenization rules.

contains_words

The contains_words operator performs a full-text search by operating on words in provided text. After text is broken into tokens, more basic queries are constructed and OR boolean logic used to interpret the query. For example, "actin-binding protein" will be interpreted as "actin" OR "binding" OR "protein". The search will return results if any of these tokens match. This operator can match multiple tokens in any order.

A single value input is required for this operator and it must be a string.

contains_phrase

The contains_phrase operator performs a full-text search by operating on phrases. The operator will require the following criteria fulfilled in order to return results:

All the tokens must appear in the field
They must have the same order as in the input text

For example, "actin-binding protein" will be interpreted as "actin" AND "binding" AND "protein" occurring in a given order.

A single value input is required for this operator and it must be a string.

Comparison Operators

greater, less, greater_or_equal, less_or_equal, equals operators match fields whose values are larger, smaller, larger or equal, smaller or equal to the given input value.

A single value input is required for this operator and it must be a number or date.

Range Operator

The range operator can be used to match values within a provided range.

A single value input is required for this operator and it must be an object as follows:

By default, lower and upper bounds are excluded. They can be included by setting include_lower and include_upper to true respectively. An inclusive bound means that the boundary point itself is included in the range as well, while an exclusive bound means that the boundary point is not included in the range.

Refer to Examples section for more examples.

Exists Operator

The exists is a logical operator that allows you to check whether a given field contains any value. To be deemed as non-existent the value must be null or []. The following values will indicate the field does exist:

Empty strings, such as " " or "-"
Arrays containing null and another value, such as [null, "foo"]

The operator doesn't require a value.

Date Math Expressions

Comparison and range operators support using date math expression. The expression starts with an "anchor" date, which can be: a) now or b) a date string (in the applicable format) ending with ||. The anchor can then be followed by a math expression, supporting + and -, e.g. "2020-06-01||-12M", "now-1w".

The units supported are:

y (year)
M (month)
w (week)

Search Attributes

The attributes available for search include the annotations described by mmCIF dictionary, annotations coming from external resources and attributes added by RCSB PDB. Both internal additions to the mmCIF dictionary and external resources annotations are prefixed with rcsb_.

Refer to the Structure Attributes Search and Chemical Attributes Search pages for a full list of the attributes that are available for text searches.

Search Results

The HTTP Status 200 (OK) status code indicates that the search API request has been processed successfully and that server returns search results data. The response data is formatted in JSON and its structure is determined by parameters in the query. Query parameters can be used to structure the result set in the following ways:

Specify the granularity of the returned identifiers. See Return Type.
Order results. See Sorting.
Limit the number of hits in the results (10 by default). See Pagination.
Include only the results count. See Counting Results.
Include search facets. See Requesting Facets.

Response Body

The search response body provides details about the search execution itself as well as an array of the individual search hits. Following information is available in the search results response body:

Name	Description
`query_id`	Required. Unique query ID assigned to the request or passed as a query parameter.
`result_type`	Required. Specifies the granularity of the returned identifiers requested in the query. See Return Type.
`total_count`	Required. The total number of matched identifiers.
`explain_metadata`	Optional. Contains details on the query execution time (in milliseconds).
`result_set`	Optional. Search results set is returned as PDB identifiers and accompanying metadata.
`group_set`	Optional. Search results are returned as groups.
`facets`	Optional. Facets array contains search facets for requested attributes.

An example of search response is shown below:

Results Set

Results set is an array of objects representing search hits. Each hit contains the matching identifier, score, and metadata produced by search services.

Result Identifiers

While a search query might include a large number of attributes, only the matching PDB identifiers, representing a desired level of granularity, are included in the result set. Following notation is used for PDB identifiers:

[pdb_id] - for PDB entries (e.g. 4HHB)
[pdb_id]_[entity_id] - for polymer, branched, or non-polymer entities (e.g. 4HHB_1)
[pdb_id].[asym_id] - for polymer, branched, or non-polymer entity instances (e.g. 4HHB.A)
[pdb_id]-[assembly_id] - for biological assemblies (e.g. 4HHB-1)

Relevancy Score

The final relevancy score is calculated as weighted sum of normalized scores produced by different search services. By default, scores from all services are weighted equally. See Scoring Strategy section for more details on how to configure scoring. The higher the score, the more relevant result hit is.

Service Metadata

Different search services produce different metadata and use different scoring metrics. Set the results verbosity level to verbose return the additional metadata and raw scores reported as described below:

Name	Description
`node_id`	Required. Distinct numeric ID is assigned to results produced by each search service.
`original_score`	Required. The original (raw) score produced by a search service chosen as relevance score for this service. For example, the bit score of the alignment is chosen as raw relevance score for a sequence search service.
`norm_score`	Required. The original score transformed onto a scale between 0 and 1 using min-max normalization algorithm (higher means more significant).
`match_context`	Optional. Additional metadata produced by search services. Match context will be included only for select return types. For example, is sequence search was performed and `polymer_entity` is specified as return type, the results will include `matching_context` with additional metadata such as sequence identity, E-value, bit-score values and the residue boundary positions of the matching sequence. The `matching_context` will not be included if same search is performed, but the return type is set to `entry` or `assembly`.

The following snippet shows an example of search results for a query that combines 4 different search services. Here, the search results set contains one search hit at the granularity of PDB entry:

Results Verbosity Level

By default, search results are returned with additional metadata (see Search Results for more details). Results verbosity level can be adjusted by setting the results_verbosity parameter in the request_options context. The results' verbosity levels from the most verbose to the least are as follows:

verbose - every search hit is returned in a format described in Result Identifiers with all metadata items set
minimal (default) - every search hit is returned in a format described in Result Identifiers with only a relevancy score set
compact - every search hit is returned as a simple string, e.g. "4HHB", with no additional metadata

Empty Results

The HTTP Status 204 (No Content) status code indicates that the search API request has been processed successfully but no search hits were found.

Dealing with Redundancy

The PDB archive includes multiple structures of same molecule, providing snapshots of the structure, interactions, and functions of these particular molecules which leads to redundancy. For example, the same protein studied in different experimental conditions or with different ligands bound. This leads to data redundancy that may present some challenges in bioinformatics analyses. It is helpful to be able to remove redundancy and group search results as this helps ensuring that similar and homologous proteins that appear in high numbers in a set of results do not introduce undesirable biases. Also, as the size of the PDB continues to grow, reducing redundancy helps when one seeks to obtain smaller datasets of distinct representatives.

Redundancy occurs at many levels (such as the level of sequence or structure similarity), and different grouping methods can be applied to PDB data in order to provide a non-redundant view.

Group By Parameters

To enable results grouping, the group_by parameters must be defined in the request_options context. Different grouping methods are available for a given Return Type:

Return Type Grouping Options

Return Type	Grouping Options
`entry`	`matching_deposit_group_id` - grouping on the basis of common identifier for a group of entries deposited as a collection. Such entries enter the PDB archive via GroupDep system that allows for parallel deposition of 10s–100s of related structures (typically the same protein with different bound ligands).
`polymer_entity`	`sequence_identity` - grouping on the basis of protein sequence clusters that meet a predefined identity threshold. Six levels of sequence identity are defined: `100%`, `95%`, `90%`, `70%`, `50%`, `30%`. Mutual sequence identity is determined by MMseqs2 software. `matching_uniprot_accession` - grouping on the basis of common UniProt accession. UniProtKB assigns a unique accession for each protein products encoded by one gene in a given species.

entry

matching_deposit_group_id - grouping on the basis of common identifier for a group of entries deposited as a collection. Such entries enter the PDB archive via GroupDep system that allows for parallel deposition of 10s–100s of related structures (typically the same protein with different bound ligands).

polymer_entity

sequence_identity - grouping on the basis of protein sequence clusters that meet a predefined identity threshold. Six levels of sequence identity are defined: 100%, 95%, 90%, 70%, 50%, 30%. Mutual sequence identity is determined by MMseqs2 software.
matching_uniprot_accession - grouping on the basis of common UniProt accession. UniProtKB assigns a unique accession for each protein products encoded by one gene in a given species.

Group By Return Type

The group_by_return_type parameter in the request_options context controls the form in which the grouped results are returned. Following options are available:

representatives (default) - a single representative is selected from each group and a flat list of representatives is returned in the main results format. Representative is selected as a top ranked group member. The ranking criteria is controlled by the ranking_criteria_type parameter (see Group Members Ranking).
groups - search results are divided into groups and and each group is returned with all associated search hits (members of that group that satisfy given search constraints).

Return Grouped Results

It can be useful to study the variability among similar (redundant) search hits. You can use the group_by parameters in combination with the group_by_return_type parameter set to groups to return results as groups of similar objects. Few examples are listed below:

Group By Sequence Identity

This example groups together identical human sequences from high-resolution (1.0-2.0Å) structures determined by X-ray crystallography. Among the resulting groups, there is a cluster of human glutathione transferases in complex with different substrates.

Group By UniProt Accession

This example demonstrates how to use matching_uniprot_accession grouping to get distinct Spike protein S1 proteins released from the beginning of 2020 with. Here, all entities are represented by distinct groups of SARS-CoV, SARS-CoV-2 and Pangolin coronavirus spike proteins.

Although it’s true that a search hit will only appear once within a grouped set of search hits, it’s important to note that in some cases multiple groups can contain the same search hit. For example, when results are grouped by the UniProt accession, chimeric entities will appear in multiple groups.

Remove Redundant Results

It can be useful to remove redundant search hits from your results. You can use the group_by parameters in combination with the group_by_return_type parameter set to representatives to return only a single representative from each of resulting groups. For example, you may want to remove similar sequences with specific levels of mutual sequence identity. Non-redundant result set will consist solely of representative search hits from the original redundant search results that satisfy given search constraints.

This example shows how to retrieve a set of polymer entities from protein-protein complexes with the following constraints:

Must be from a protein-protein complex, not a single protein
Complexes must consist of proteins only
Experimental Method: X-ray or EM
Resolution: <= 2 Angstrom
R-observed <= 0.2
Sequence identity cutoff to remove redundancy: 30%

Group Members Ranking

Group members ranking is designed to order the search hits in each of the resulting groups to present most relevant, useful hits first so that you can more easily find what you’re looking for.

The ranking system is made up of a series of options:

ranking by member attribute - this option works in the same way as Sorting. You can use this option to order group members by any property that is available for sorting, for example, resolution, release date, etc.
score (default) - this option orders groups members in a way that puts the most relevant for a given search query hits on top.
ranking options specific to aggregation method - these options are predefined for each aggregation method and typically involve pre-computation based on certain metrics.

For example, you can search for rhodopsins and rhodopsin-like proteins, request all proteins related by sharing at least 50% sequence identity to be grouped and order polymer entities within each group by sequence similarity score:

Examples of ranking options specific to aggregation method are detailed below:

Ranking Options For UniProt Groups

coverage the percent coverage of the UniProt sequence by the PDB polymer entity sequence

Faceting Upon Grouped Results

By default, facet counts are based upon the original query results, not the grouped results. This means that whether or not you turn grouping on for a query, the facet counts will be the same.

To return non-redundant facet counts the group_by_return_type parameter must be set to representatives.

Sorting Grouped Results

An important aspect is the way sorting interacts with grouping. By default, all groups are sorted based upon the number of search hits in the group (in descending order by default). You can reverse the order in which groups are sorted. Inside each group, the search hits are sorted based on the ranking score. The type of the ranking score is specified by the ranking_criteria_type parameter.

Another important difference is that multi-sort operations are not enabled for grouped results.

Paging Grouped Results

The Pagination section describes how the Search API uses rows parameter to determine how many search hits to return for a search query. When grouped results are requested, this parameter is putting a limit on how many groups to return. When using start parameter with grouped results, it controls paging through available groups. There is no paging through the results within a group, all search hits per group are returned.

Counting Grouped Results

The Counting Results section of this guide describes the parameter that allows returning only the total count of hits returned by the query. When using it with grouped results, it returns a total count of all resulting groups or representatives.

API Clients

Python

The rcsb-api package provides a Python interface to the RCSB PDB Search and Data APIs (an overview has been published in Journal for Molecular Biology). Use the rcsbapi.search module to fetch lists of PDB IDs corresponding to advanced query searches, and the rcsbapi.data module to fetch data about a given set of structure IDs. RCSB PDB maintains the current version of this package on GitHub.

You can find example use cases demonstrating how to utilize this package in scripting workflows in the py-rcsb-api GitHub repository. These examples provide practical implementations of common tasks, helping users understand how to integrate the package into their own applications. The notebooks serve as a reference for building custom workflows using RCSB resources.

Examples

This section demonstrates how to use the RCSB PDB Search API to perform complex searches.

Biological Assembly Search

This query finds symmetric dimers having a twofold rotation with the DNA-binding domain of a heat-shock transcription factor.

X-Ray Structures Search

This query finds PDB structures of virus's thymidine kinase with substrate/inhibitors, determined by X-ray crystallography at a resolution better than 2.5 Å.

Protein Sequence Search

In this example, using sequence search, we find macromolecular PDB entities that share 90% sequence identity with GTPase HRas protein from Gallus gallus (Chicken).

3D-shape Search

This example demonstrates how structure search can be used to find PDB structures of calmodulin with conformational changes upon Ca²⁺ binding. Calmodulin (CaM) protein has two homologous globular domains connected by a flexible linker. Ca²⁺ binding to each globular domain causes a change from a “closed” to an “open” conformation. This query finds calmodulin structures in “open” conformation.

As a structure query input parameter we will use the crystal structure of Ca²⁺-loaded calmodulin (PDB entry 1CLL). This query is combined with the text search for CA chemical component ID. Note: if you leave out the query clause matching Ca²⁺ ions, you will also get calmodulin structures in complex with other metals (e.g. strontium in 4BW7).

Free Ligand Search

Ligands are considered “free ligands” when they interact non-covalently with macromolecules. This example shows how to find non-polymeric entities of ATP molecule that is found as “free ligand”.

Sequence Motif Search

A sequence motif search finds macromolecular PDB entities that contain a specific sequence motif. This examples retrieves occurrences of the His₂/Cys₂ Zinc Finger DNA-binding domain as represented by its PROSITE signature.

Chemical Similarity Search

This example demonstrates how to find molecular definitions chemically similar to Tylenol defined by the InChI string. Note, that the parameter match_type="graph-strict" does not imply exact structure match and you are getting acetaminophen molecules (TYL) together with methoxy (T9V) and ethoxy (N4E) analogs in the result set.

Search by UniProt Accession

This example shows how to search for PDB entities using associated UniProt accession code.

Structure Motif Search

A structure motif search finds macromolecular PDB assemblies that contain a specific arrangement of a small number of residues in a certain geometric arrangement (e.g. residue that constitute the catalytic center or a binding site). This examples retrieves occurrences of the enolase superfamily, a group of proteins diverse in sequence and structure that are all capable of abstracting a proton from a carboxylic acid. Position-specific exchanges are crucial to represent this superfamily accurately.

Combining Search Services

This example shows how to compose text, sequence, structure, and chemical queries employing the Boolean operator AND. The search yields structures (entries) matching all criteria, including co-crystal structures with the desired bound inhibitor, matching the SMILES string for a small-molecule inhibitor designated 7J (QYS).

Sequence Cluster Statistics

This example shows how to get the number of distinct protein sequences in the PDB archive.

Newly Released Structures

This example shows how to get a list of all PDB ID for this week's newly released structures.

Membrane Proteins

This example shows how to get a list of PDB ID of entries that are annotated as membrane protein by at least one relevant external resource.

Symmetry and Enzyme Classification

This example shows how to get assembly counts per symmetry types, further broken down by Enzyme Classification (EC) classes. The assemblies are first filtered to homo-oligomers only.

Computed Structure Models

This example shows how to find PDB structures and Computed Structure Models for a given UniProt sequence.

Structure Search with Custom Data

This example showcases how to search with structures not deposited in the PDB archive by pointing to external URLs such as predictions from AlphaFold DB, ModelArchive, or SWISS-MODEL. Any publicly available URL can be referenced. This feature can be used for structure (3D-shape) and strucmotif (structure motif) searches. Required inputs are the file location (url) and format ('cif' or 'bcif' for BinaryCIF). Gzipped content is supported as well.

Integrative Structures

Search API delivers integrative structures alongside the experimental structures. IHMs combine data from multiple experimental methods (e.g., X-ray crystallography, cryo-EM, NMR, SAXS, crosslinking MS, etc.) to produce structural models. IHMs expand structural coverage to systems difficult to solve using a single method, such as macromolecular machines and dynamic complexes.

The rcsb_entry_info.structure_determination_methodology field indicates the methodology used to determine the structure. Its value determines whether the structure is:

experimental - determined using experimental techniques such as X-ray crystallography, NMR, cryo-EM, etc
integrative - determined using a combination of experimental and computational methods
computational (predicted) - generated purely through computational prediction methods, without direct experimental data

Use this field to distinguish between different types of structure determination approaches in your searches.

Find all IHM entries currently released in the PDB archive.

Find all IHM entries of human proteins that use crosslinking mass spectrometry as part of the modeling process.

Migration Guides

Migrating from Legacy Search API

Applications written on top of the Legacy Search APIs no longer work because these services have been discontinued. This migration guide describes the necessary steps to convert applications from using Legacy Search API Web Service to a new RCSB Search API.

Migrating from v1 to v2

The following guide will help you migrate from API v1 to v2. This page contains information you need to know when migrating from deprecated API version v1 to a newer version v2.

Acknowledgements

To cite this service, please reference:

Rose, Y., Duarte, J. M., Lowe, R., Segura, J., Bi, C., Bhikadiya, C., ... & Westbrook, J. D. (2021). RCSB Protein Data Bank: architectural advances towards integrated searching and efficient access to macromolecular structure data from the PDB archive. Journal of molecular biology, 433(11), 166704. DOI: 10.1016/j.jmb.2020.11.003
Bittrich, S., Bhikadiya, C., Bi, C., Chao, H., Duarte, J. M., Dutta, S., ... & Rose, Y. (2023). RCSB Protein Data Bank: Efficient Searching and Simultaneous Access to One Million Computed Structure Models Alongside the PDB Structures Enabled by Architectural Advances. Journal of Molecular Biology, 167994. DOI: 10.1016/j.jmb.2023.167994

Related publications:

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., ... & Bourne, P. E. (2000). The protein data bank. Nucleic acids research, 28(1), 235-242. DOI: 10.1093/nar/28.1.235
Burley, S. K., Berman, H. M., Bhikadiya, C., Bi, C., Chen, L., Di Costanzo, L., ... & Zardecki, C. (2019). RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic acids research, 47(D1), D464-D474. DOI: 10.1093/nar/gky1004
Burley, S. K., Bhikadiya, C., Bi, C., Bittrich, S., Chao, H., Chen, L., ... & Zardecki, C. (2023). RCSB Protein Data Bank (RCSB. org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Research, 51(D1), D488-D508. DOI: 10.1093/nar/gkac1077

Contact Us

Contact info@rcsb.org with questions or feedback about this service.

RCSB PDB Search API

Introduction

Build Your Search

Search Services

Return Type

Query Language

Basic Search

Attribute Search

Negation

Case-Sensitive Search

Boolean Expressions

Scoring Strategy

Sorting

Pagination

Counting Results

Include Computed Models

Faceted Queries

Terms Facets

Histogram Facets

Date Histogram Facets

Range Facets

Date Range Facets

Cardinality Facets

Filter Facets

Multi-Dimensional Facets

Search Operators

Exact Match Operators

exact_match

in

Full-Text Operators

contains_words

contains_phrase

Comparison Operators

Range Operator

Exists Operator

Date Math Expressions

Search Attributes

Search Results

Response Body

Results Set

Result Identifiers

Relevancy Score

Service Metadata

Results Verbosity Level

Empty Results

Dealing with Redundancy

Group By Parameters

Group By Return Type

Return Grouped Results

Group By Sequence Identity

Group By UniProt Accession

Remove Redundant Results

Group Members Ranking

Ranking Options For UniProt Groups

Faceting Upon Grouped Results

Sorting Grouped Results

Paging Grouped Results

Counting Grouped Results

API Clients

Python

Examples

Biological Assembly Search

X-Ray Structures Search

Protein Sequence Search

3D-shape Search

Free Ligand Search

Sequence Motif Search

Chemical Similarity Search

Search by UniProt Accession

Structure Motif Search

Combining Search Services

Sequence Cluster Statistics

Newly Released Structures

Membrane Proteins

Symmetry and Enzyme Classification

Computed Structure Models

Structure Search with Custom Data

Integrative Structures

Migration Guides

Migrating from Legacy Search API