RCSB PDB Search API
- Reference Documentation: RCSB PDB Search API Reference
- Query Editor: RCSB PDB Search API Query Editor
- Examples: RCSB PDB Search API Examples
Stay current with API announcements by subscribing to the RCSB PDB API mailing list:
- signing in with existing google account and subscribe
- or send an email to api+subscribe@rcsb.org
Introduction
The Search API accepts HTTP GET or POST requests with JSON payloads. The base URI of search endpoint is
https://search.rcsb.org/rcsbsearch/v2/query
. In GET method, search request should be sent
as a URL-encoded query string in json
parameter:
https://search.rcsb.org/rcsbsearch/v2/query?json={search-request}.
Query syntax for the {search-request} is detailed in the Query Language section of this guide. See Build Your Search section for general information on how to construct the {search-request} object.
The search API is designed to return only the identifiers of relevant hits (see Return Type section for more information on the identifiers types that can be requested) and additional metadata. See Response Body section for more information. If you need to extract information on released date, macromolecules, organisms, resolution, modified residues, ligands etc., you should use RCSB Data API: https://data.rcsb.org.
Build Your Search
A search request is a complete specification of what should be returned in a result set. The search request is represented as a JSON object. The building blocks of the request are:
Context | Description |
---|---|
return_type |
Required. Specifies the type of the returned identifiers, e.g. entry, polymer entity, assembly, etc. See Return Type section for more information. |
query |
Optional.
Specifies the search expression. Can be omitted if, instead of IDs retrieval, facets or count
operation should be performed. In this case the request must be configured via the
request_options context. |
request_options |
Optional. Controls various aspects of the search request including pagination, sorting, scoring and faceting. If omitted, the default parameters for sorting, scoring and pagination will be applied. |
request_info |
Optional.
Specifies an additional information about the query, e.g. query_id . It's an optional
property and used internally at RCSB PDB for logging purposes. When query_id is sent with
the search request, it will be included into the corresponding response object. |
query
context may consist of two types of clauses:
- Terminal node - performs an atomic search operation, e.g. searches for a particular value in a particular field. Parameters in the terminal query clause provide match criteria for finding relevant search hits. The set of parameters differs for different search services.
- Group node - wraps other terminal or group nodes and is used to combine multiple queries in a logical fashion.
The simplest query requires specifying only return_type
parameter and query
context. With unspecified parameters property in the query
object,
a query matches all documents, returning PDB IDs if the return_type property is
set to "entry".
Refer to Examples section for more examples.
Search Services
The RCSB PDB Search API consolidates requests to heterogeneous search services. The list of available services is below:
Service | Description |
---|---|
text |
Performs attribute searches against textual annotations associated with PDB structures. Refer to Structure Attributes Search page for a full list of annotations. |
text_chem |
Performs attribute searches against textual annotations associated with PDB molecular definitions. Refer to Chemical Attributes Search page for a full list of annotations. |
full_text |
Performs unstructured searches against textual annotations associated with PDB structures or molecular definitions. Unstructured search performs a full-text searches against multiple text attributes. |
sequence |
This service employs the MMseqs2
software and performs fast sequence matching searches (BLAST-like) based on a user-provided FASTA
sequence (with E-value or % Identity cutoffs). Following searches are available:
|
seqmotif |
Performs short motif searches against nucleotide or protein sequences, using three different
types of input format:
|
structure |
Performs fast searches matching a global 3D shape of assemblies or chains of a given entry
(identified by PDB ID), in either strict (strict_shape_match ) or relaxed
(relaxed_shape_match ) modes, using
a BioZernike descriptor
strategy.
|
strucmotif |
Performs structure motif searches on all available PDB structures. |
chemical |
Enables queries of small-molecule constituents of PDB structures, based on chemical formula and chemical structure. Both molecular formula and formula range searches are supported. Queries for matching and similar chemical structures can be performed using SMILES and InChI descriptors as search targets. Graph and chemical fingerprint searches are implemented using tools from the OpenEye Chemical Toolkit. Descriptor Matching Criteria: The following graph matching searches use a fingerprint prefilter so these are designed to find only similar molecules. These graph matching comparisons include:
The following graph matching searches perform an exhaustive substructure search with no pre-screening. These substructure graph matching comparisons include:
|
Return Type
The search can return one of the following result types:
Type | Description |
---|---|
entry |
Returns a list of PDB IDs. |
assembly |
Returns a list of PDB IDs appended with assembly IDs in the format of a [pdb_id]-[assembly_id], corresponding to biological assemblies. |
polymer_entity |
Returns a list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to polymeric molecular entities. |
non_polymer_entity |
Returns a list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to non-polymeric entities (or ligands). |
polymer_instance |
Returns a list of PDB IDs appended with asym IDs in the format of a [pdb_id].[asym_id], corresponding to instances of certain polymeric molecular entities, also known as chains. Note, that asym_id in the instance identifier corresponds to the _label_asym_id from the mmCIF schema (assigned by the PDB). It can differ from _auth_asym_id (selected by the author at the time of deposition). |
mol_definition |
Returns a list of molecular definition identifiers that include:
|
Query Language
The Search API provides a full query DSL (domain-specific language) based on JSON to define queries.
Basic Search
The query language allows to perform unstructured (basic) searches. An unstructured query refers to the search of textual annotation associated with PDB structures when the field name is unknown. Such query will search across all fields, available for search, and return a hit if match happens in any field.
To perform an unstructured search, you should send the parameters
object
without an explicit attribute property:
Refer to Examples section for more examples.
Complex boolean queries in the basic search can be built with following operators:
+
signifies AND operation|
signifies OR operation-
negates a single token"
wraps a number of tokens to signify a phrase for searching(
and)
signify precedence
For example, using a interferon + response + factor
query string is equivalent
to running interferon
AND response
AND factor
search.
You can use (
and )
to signify precedence. For example, searching with
a query string isopeptide + ( collagen | fibrinogen )
returns structures that contain
isopeptide
AND either collagen
OR fibrinogen
.
Attribute Search
Attribute query allows searching for terms with relation to a specific attribute. To perform an attribute search,
you should send the parameters
object with an explicit attribute
property set to a field name, value property set to a search term, and
operator property set to a search operator.
Refer to the Examples section for more examples.
When using attribute search, you must observe the following rules:
- The field must be a valid field name listed in Structure Attributes Search or Chemical Attributes Search.
- The operator must be compatible with the field. Full list of the operators is available in the Search Operators section.
-
The values entered must match the type of the field and be compatible with the operator.
Date values should be specified in ISO 8601 formats:
-
Date:
YYYY-mm-DD
-
Date and Time:
YYYY-mm-DD'T'HH:MM:SS'Z'
, where the 'Z' means UTC
-
Date:
Negation
To perform negation on the operator, the negation
property should be set to
true in the query parameters
object. The
following search returns non-protein polymeric entities:
Refer to the Examples section for more examples.
Case-Sensitive Search
By default, searches performed using exact match operators are
case-insensitive. You can make your search case-sensitive by setting the case_sensitive
property in the query parameters
object to true.
This option can be useful when capitalization rules help convey additional information. For example,
gene symbols can differ in capitalization between homologous from different species, i.e. human genes
are always upper case.
The following search returns human kinases encoded by the ABL1 gene. It excludes results where the case doesn't match, such as non-receptor tyrosine-protein kinase from mouse encoded by the Abl1 gene.
Refer to the Examples section for more examples.
Boolean Expressions
The query language supports two boolean operators: AND
and OR
. Boolean operators
should be added to the group node as logical_operator property.
The group nodes can be used to logically combine search expressions (terminal nodes) or other group nodes:
Refer to the Examples section for more examples.
Scoring Strategy
You can customize how scores from different services impact the final relevancy ranking of your search
results by setting a scoring_strategy
in the request_options
context. Following
scoring strategies are available: combined
(default), sequence
,
seqmotif
, strucmotif
, structure
, chemical
, and
text
. For example, you might want to boost search results based on the relevance score
produced by sequence search service, then sequence
scoring strategy should be used.
The final relevancy score is calculated as weighted sum of normalized scores
produced by different search services (all search result scores are rescaled to the interval [0, 1], 0
still means it met the criteria of the search). When combined
strategy is used, equal weights
are applied. For other strategies, higher weight is used for select service scores making their contribution
to the final score bigger and therefore promoting ranking that is influenced by select service.
Sorting
Sorting is determined by the sort
object in the request_options
context.
It allows you to add one or more sorting conditions to control the order of the search result hits.
The sort operation is defined on a per field level, with special field name for score
to sort by score (the default).
Structure Attributes Search and
Chemical Attributes Search pages to
find all searchable attributes. Any attribute listing exact_match
or equals
operators can be used for sorting.
By default sorting is done in descending order ("desc"). The sort can be reversed by setting direction property to "asc". This example demonstrates how to sort the search results by release date:
Refer to the Examples section for more examples.
Pagination
By default, only first 10 hits are included in the search result list. Pagination can be configured by
the start
and rows
parameters of the paginate
object in the
request_options
context.
To retrieve all hits use the return_all_hits
parameter in the request_options
context.
Please note that returning all hits is generally not desirable and may be the source of performance issues.
Refer to the Examples section for more examples.
Counting Results
By default, the search results contains a list of matched identifiers and additional metadata. See
Search Results for more details. The return_counts
flag
in the request_options
context allows you to execute a search query and get back only
the number of matches for that query. The following query returns a number of current structures in
the PDB archive:
Refer to the Examples section for more examples.
Include Computed Models
RCSB PDB has integrated Computed Structure Models from
AlphaFold DB and
ModelArchive.
To include Computed Structure Models into your search results, add results_content_type
parameter to the request_options
context. This parameter allows to specify the content
type filter that can include experimental, computational structures or both.
Refer to the Examples section for more examples.
Faceted Queries
Faceted queries (or facets) provide you with the ability to group and perform calculations and statistics on PDB data by using a simple search query. Facets are the arrangement of search results into categories (buckets) based on the requested field values.
If the facets property is specified in the request_options
context, the search results are presented along with numerical counts of how many matching IDs were found
for each term requested in the facets. If the query
context is omitted in the search request
with facets specified, the response will contain only the facet counts.
This example calculates the breakdown by experimental method of PDB structures, released after 2019-08-20:
By default, searches containing a faceted query return both search hits and aggregation results.
To return only aggregation results, set rows
to 0 in the pagination
context:
Refer to Examples section for more examples.
Terms Facets
Terms faceting is a multi-bucket aggregation where buckets are dynamically built - one per unique value.
For each bucket terms faceting counts the documents (entry, polymer_entity, etc.) that contain a given
value in a given field. For example, you can run the terms aggregation on the field
rcsb_primary_citation.rcsb_journal_abbrev
which holds the abbreviated name of a journal
associated with an entry. In return, we have buckets for each journal, each with their PDB entry counts.
You can specify a threshold value for a count associated with a bucket for that bucket to be returned.
Use min_interval_population
parameter, e.g. in this example only journals associated with
at least 1000 entries are returned:
You can also control the returned number of buckets using max_num_intervals
parameter (up
to 65536 limit). Larger values of max_num_intervals
use more memory to compute and, push
the whole aggregation close to the limit. You’ll know you’ve gone too large if the request fails with
a message about max_buckets.
Refer to Examples section for more examples.
Histogram Facets
Histogram faceting is a multi-bucket aggregation that can be applied on numeric values. It builds fixed size
(a.k.a. interval) buckets over the values. For example, for the rcsb_polymer_entity.formula_weight
field that holds a formula mass (KDa) of the entity, we can configure this aggregation to build buckets with
interval 50 KDa:
You can use the min_interval_population
parameter to request buckets with a higher or equal
count associated with it.
Refer to Examples section for more examples.
Date Histogram Facets
This multi-bucket aggregation is similar to the histogram aggregation, but
it can only be used with date values. Calendar-aware intervals are configured with the interval
parameter. For example, we can configure this aggregation to build buckets with 1 year intervals:
Refer to Examples section for more examples.
Range Facets
A multi-bucket aggregation that enables the user to define a set of numeric ranges - each representing a
bucket. Note that this aggregation includes the from
value and excludes the to
value for each range. Omitted from
or to
parameters creates a bucket with min
or max boundaries. Example:
Refer to Examples section for more examples.
Date Range Facets
This multi-bucket aggregation is similar to the range aggregation but dedicated
for date values. The main difference between this aggregation and the normal range aggregation is that the
from
and to
values can be expressed in date math
expressions. Example:
Refer to Examples section for more examples.
Cardinality Facets
Cardinality faceting is single-value metrics aggregation that calculates a count of distinct values returned for a given field. For example, you can count unique source organism name assignments in the PDB archive:
Refer to Examples section for more examples.
Filter Facets
As its name suggests, the filter aggregation helps you filter documents that contribute to bucket count. In the example below, we are filtering only protein chains which adopt 2 different beta propeller arrangements according to the CATH classification:
Refer to Examples section for more examples.
Multi-Dimensional Facets
Complex, multi-dimensional aggregations are possible as in the example below:
Refer to Examples section for more examples.
Search Operators
Search operators are commands that help you make your search more specific and focused. The following operators can be used to perform a field search:
Exact Match Operators
Exact match operators indicate that the input value should match a field value exactly (including whitespaces, special characters and case).
exact_match
You can use the exact_match
operator to find exact occurrences of the input value.
Comparisons with exact_match
operator are case-insensitive by default.
See the Case-Sensitive Search paragraph of the Attribute Search
section to learn how to configure case-sensitive exact searches.
A single value input is required for this operator and must be a string.
in
The in
operator allows you to specify multiple values in a single search expression.
It returns results if any value in a list of input values matches. It can be used instead of
multiple OR conditions. Comparisons with in
operator are case-sensitive.
An input value is required for this operator and it must be a list of strings, numbers or dates.
Full-Text Operators
The full-text operators enable you to perform linguistic searches against text data by operating on words and phrases. The input text is analyzed before performing a search. The analysis includes following transformations:
- Most punctuation is removed
- The remaining content is broken into individual words, called tokens
- Tokens are lowercased which makes search case-insensitive
The standard grammar based tokenization is used to break input text into tokens. Refer to the Unicode Text Segmentation documentation for more information on tokenization rules.
contains_words
The contains_words
operator performs a full-text search by operating on words in provided
text. After text is broken into tokens, more basic queries are constructed and OR boolean logic used to
interpret the query. For example, "actin-binding protein" will be interpreted as "actin"
OR "binding" OR "protein". The search will return results if any of these tokens match.
This operator can match multiple tokens in any order.
A single value input is required for this operator and it must be a string.
contains_phrase
The contains_phrase
operator performs a full-text search by operating on phrases. The operator
will require the following criteria fulfilled in order to return results:
- All the tokens must appear in the field
- They must have the same order as in the input text
For example, "actin-binding protein" will be interpreted as "actin" AND "binding" AND "protein" occurring in a given order.
A single value input is required for this operator and it must be a string.
Comparison Operators
greater
, less
, greater_or_equal
, less_or_equal
,
equals
operators match fields whose values are larger, smaller, larger or equal,
smaller or equal to the given input value.
A single value input is required for this operator and it must be a number or date.
Range Operator
The range
operator can be used to match values within a provided range.
A single value input is required for this operator and it must be an object as follows:
By default, lower and upper bounds are excluded. They can be included by setting include_lower
and include_upper
to true respectively. An inclusive bound
means that the boundary point itself is included in the range as well, while an exclusive bound means that
the boundary point is not included in the range.
Refer to Examples section for more examples.
Exists Operator
The exists
is a logical operator that allows you to check whether a given field contains
any value. To be deemed as non-existent the value must be null
or []
. The
following values will indicate the field does exist:
- Empty strings, such as " " or "-"
- Arrays containing null and another value, such as [null, "foo"]
The operator doesn't require a value.
Date Math Expressions
Comparison and range operators support
using date math expression. The expression starts with an "anchor" date, which can be: a) now
or b) a date string (in the applicable format) ending with ||
. The anchor can then be followed by a math
expression, supporting +
and -
, e.g. "2020-06-01||-12M", "now-1w".
The units supported are:
-
y
(year) -
M
(month) -
w
(week)
Search Attributes
The attributes available for search include the annotations described by
mmCIF dictionary, annotations coming from
external resources and attributes added by RCSB PDB. Both internal additions to the mmCIF dictionary
and external resources annotations are prefixed with rcsb_
.
Refer to the Structure Attributes Search and Chemical Attributes Search pages for a full list of the attributes that are available for text searches.
Search Results
The HTTP Status 200 (OK) status code indicates that the search API request has been processed successfully and that server returns search results data. The response data is formatted in JSON and its structure is determined by parameters in the query. Query parameters can be used to structure the result set in the following ways:
- Specify the granularity of the returned identifiers. See Return Type.
- Order results. See Sorting.
- Limit the number of hits in the results (10 by default). See Pagination.
- Include only the results count. See Counting Results.
- Include search facets. See Requesting Facets.
Response Body
The search response body provides details about the search execution itself as well as an array of the individual search hits. Following information is available in the search results response body:
Name | Description |
---|---|
query_id |
Required. Unique query ID assigned to the request or passed as a query parameter. |
result_type |
Required. Specifies the granularity of the returned identifiers requested in the query. See Return Type. |
total_count |
Required. The total number of matched identifiers. |
explain_metadata |
Optional. Contains details on the query execution time (in milliseconds). |
result_set |
Optional. Search results set is returned as PDB identifiers and accompanying metadata. |
group_set |
Optional. Search results are returned as groups. |
facets |
Optional. Facets array contains search facets for requested attributes. |
An example of search response is shown below:
Results Set
Results set is an array of objects representing search hits. Each hit contains the matching identifier, score, and metadata produced by search services.
Result Identifiers
While a search query might include a large number of attributes, only the matching PDB identifiers, representing a desired level of granularity, are included in the result set. Following notation is used for PDB identifiers:
- [pdb_id] - for PDB entries (e.g. 4HHB)
- [pdb_id]_[entity_id] - for polymer, branched, or non-polymer entities (e.g. 4HHB_1)
- [pdb_id].[asym_id] - for polymer, branched, or non-polymer entity instances (e.g. 4HHB.A)
- [pdb_id]-[assembly_id] - for biological assemblies (e.g. 4HHB-1)
Relevancy Score
The final relevancy score is calculated as weighted sum of normalized scores produced by different search services. By default, scores from all services are weighted equally. See Scoring Strategy section for more details on how to configure scoring. The higher the score, the more relevant result hit is.
Service Metadata
Different search services produce different metadata and use different scoring metrics. Set the
results verbosity level to verbose
return
the additional metadata and raw scores reported as described below:
Name | Description |
---|---|
node_id |
Required. Distinct numeric ID is assigned to results produced by each search service. |
original_score |
Required. The original (raw) score produced by a search service chosen as relevance score for this service. For example, the bit score of the alignment is chosen as raw relevance score for a sequence search service. |
norm_score |
Required. The original score transformed onto a scale between 0 and 1 using min-max normalization algorithm (higher means more significant). |
match_context |
Optional. Additional metadata produced by search services. Match context will be
included only for select return types. For example, is sequence search was performed and
polymer_entity is specified as return type, the results will include
matching_context with additional metadata such as sequence identity, E-value,
bit-score values and the residue boundary positions of the matching sequence. The
matching_context will not be included if same search is performed, but the
return type is set to entry or assembly .
|
The following snippet shows an example of search results for a query that combines 4 different search services. Here, the search results set contains one search hit at the granularity of PDB entry:
Results Verbosity Level
By default, search results are returned with additional metadata (see Search Results
for more details). Results verbosity level can be adjusted by setting the results_verbosity
parameter in the request_options
context. The results' verbosity levels from the most verbose
to the least are as follows:
-
verbose
- every search hit is returned in a format described in Result Identifiers with all metadata items set -
minimal
(default) - every search hit is returned in a format described in Result Identifiers with only a relevancy score set -
compact
- every search hit is returned as a simple string, e.g. "4HHB", with no additional metadata
Empty Results
The HTTP Status 204 (No Content) status code indicates that the search API request has been processed successfully but no search hits were found.
Dealing with Redundancy
The PDB archive includes multiple structures of same molecule, providing snapshots of the structure, interactions, and functions of these particular molecules which leads to redundancy. For example, the same protein studied in different experimental conditions or with different ligands bound. This leads to data redundancy that may present some challenges in bioinformatics analyses. It is helpful to be able to remove redundancy and group search results as this helps ensuring that similar and homologous proteins that appear in high numbers in a set of results do not introduce undesirable biases. Also, as the size of the PDB continues to grow, reducing redundancy helps when one seeks to obtain smaller datasets of distinct representatives.
Redundancy occurs at many levels (such as the level of sequence or structure similarity), and different grouping methods can be applied to PDB data in order to provide a non-redundant view.
Group By Parameters
To enable results grouping, the group_by
parameters must be defined in the
request_options
context. Different grouping methods are available for a given
Return Type:
Return Type | Grouping Options |
---|---|
entry |
|
polymer_entity |
|
Group By Return Type
The group_by_return_type
parameter in the request_options
context
controls the form in which the grouped results are returned. Following options are available:
-
representatives
(default) - a single representative is selected from each group and a flat list of representatives is returned in the main results format. Representative is selected as a top ranked group member. The ranking criteria is controlled by theranking_criteria_type
parameter (see Group Members Ranking). -
groups
- search results are divided into groups and and each group is returned with all associated search hits (members of that group that satisfy given search constraints).
Return Grouped Results
It can be useful to study the variability among similar (redundant) search hits. You can use the
group_by
parameters in combination with the group_by_return_type
parameter
set to groups
to return results as groups of similar objects. Few examples are listed below:
Group By Sequence Identity
This example groups together identical human sequences from high-resolution (1.0-2.0Å) structures determined by X-ray crystallography. Among the resulting groups, there is a cluster of human glutathione transferases in complex with different substrates.
Group By UniProt Accession
This example demonstrates how to use matching_uniprot_accession
grouping to get distinct
Spike protein S1 proteins released from the beginning of 2020 with. Here, all entities are represented
by distinct groups of SARS-CoV, SARS-CoV-2 and Pangolin coronavirus spike proteins.
Although it’s true that a search hit will only appear once within a grouped set of search hits, it’s important to note that in some cases multiple groups can contain the same search hit. For example, when results are grouped by the UniProt accession, chimeric entities will appear in multiple groups.
Remove Redundant Results
It can be useful to remove redundant search hits from your results. You can use the group_by
parameters in combination with the group_by_return_type
parameter set to representatives
to return only a single representative from each of resulting groups. For example, you may want to remove
similar sequences with specific levels of mutual sequence identity. Non-redundant result set will consist
solely of representative search hits from the original redundant search results that satisfy given search
constraints.
This example shows how to retrieve a set of polymer entities from protein-protein complexes with the following constraints:
- Must be from a protein-protein complex, not a single protein
- Complexes must consist of proteins only
- Experimental Method: X-ray or EM
- Resolution: <= 2 Angstrom
- R-observed <= 0.2
- Sequence identity cutoff to remove redundancy: 30%
Group Members Ranking
Group members ranking is designed to order the search hits in each of the resulting groups to present most relevant, useful hits first so that you can more easily find what you’re looking for.
The ranking system is made up of a series of options:
- ranking by member attribute - this option works in the same way as Sorting. You can use this option to order group members by any property that is available for sorting, for example, resolution, release date, etc.
- score (default) - this option orders groups members in a way that puts the most relevant for a given search query hits on top.
- ranking options specific to aggregation method - these options are predefined for each aggregation method and typically involve pre-computation based on certain metrics.
For example, you can search for rhodopsins and rhodopsin-like proteins, request all proteins related by sharing at least 50% sequence identity to be grouped and order polymer entities within each group by sequence similarity score:
Examples of ranking options specific to aggregation method are detailed below:
Ranking Options For UniProt Groups
-
coverage
the percent coverage of the UniProt sequence by the PDB polymer entity sequence
Faceting Upon Grouped Results
By default, facet counts are based upon the original query results, not the grouped results. This means that whether or not you turn grouping on for a query, the facet counts will be the same.
To return non-redundant facet counts the group_by_return_type
parameter must be set
to representatives
.
Sorting Grouped Results
An important aspect is the way sorting interacts with grouping. By default, all groups are sorted based
upon the number of search hits in the group (in descending order by default). You can reverse the order
in which groups are sorted. Inside each group, the search hits are sorted based on the ranking score. The
type of the ranking score is specified by the ranking_criteria_type
parameter.
Another important difference is that multi-sort operations are not enabled for grouped results.
Paging Grouped Results
The Pagination section describes how the Search API uses rows
parameter to determine how many search hits to return for a search query. When grouped results are
requested, this parameter is putting a limit on how many groups to return. When using start
parameter with grouped results, it controls paging through available groups. There is no paging through
the results within a group, all search hits per group are returned.
Counting Grouped Results
The Counting Results section of this guide describes the parameter that allows returning only the total count of hits returned by the query. When using it with grouped results, it returns a total count of all resulting groups or representatives.
API Clients
Python
The rcsbsearchapi
package provides a
Python interface to the RCSB PDB Search API. You can use it to fetch lists of PDB IDs corresponding to advanced
query searches. This package was originally developed by Spencer Bliven,
and a new version is now being maintained by RCSB PDB on GitHub.
Examples
This section demonstrates how to use the RCSB PDB Search API to perform complex searches.
Biological Assembly Search
This query finds symmetric dimers having a twofold rotation with the DNA-binding domain of a heat-shock transcription factor.
X-Ray Structures Search
This query finds PDB structures of virus's thymidine kinase with substrate/inhibitors, determined by X-ray crystallography at a resolution better than 2.5 Å.
Protein Sequence Search
In this example, using sequence
search, we find macromolecular PDB entities that
share 90% sequence identity with GTPase HRas protein from Gallus gallus (Chicken).
3D-shape Search
This example demonstrates how structure
search can be used to find PDB structures of
calmodulin with conformational changes upon Ca2+ binding.
Calmodulin (CaM) protein has two homologous globular domains connected by a flexible linker.
Ca2+ binding to each globular domain causes a change from a “closed” to an “open”
conformation. This query finds calmodulin structures in “open” conformation.
As a structure
query input parameter we will use the crystal structure of
Ca2+-loaded calmodulin (PDB entry 1CLL). This query is combined with the text
search for CA chemical component ID. Note: if you leave out the query clause matching Ca2+
ions, you will also get calmodulin structures in complex with other metals (e.g. strontium in 4BW7).
Free Ligand Search
Ligands are considered “free ligands” when they interact non-covalently with macromolecules. This example shows how to find non-polymeric entities of ATP molecule that is found as “free ligand”.
Sequence Motif Search
A sequence motif search finds macromolecular PDB entities that contain a specific sequence motif. This examples retrieves occurrences of the His2/Cys2 Zinc Finger DNA-binding domain as represented by its PROSITE signature.
Chemical Similarity Search
This example demonstrates how to find molecular definitions chemically similar to Tylenol
defined by the InChI string. Note, that the parameter match_type="graph-strict"
does
not imply exact structure match and you are getting acetaminophen molecules (TYL) together with
methoxy (T9V) and ethoxy (N4E) analogs in the result set.
Search by UniProt Accession
This example shows how to search for PDB entities using associated UniProt accession code.
Structure Motif Search
A structure motif search finds macromolecular PDB assemblies that contain a specific arrangement of a small number of residues in a certain geometric arrangement (e.g. residue that constitute the catalytic center or a binding site). This examples retrieves occurrences of the enolase superfamily, a group of proteins diverse in sequence and structure that are all capable of abstracting a proton from a carboxylic acid. Position-specific exchanges are crucial to represent this superfamily accurately.
Combining Search Services
This example shows how to compose text
, sequence
,
structure
, and chemical
queries employing the Boolean operator AND.
The search yields structures (entries) matching all criteria, including co-crystal structures
with the desired bound inhibitor, matching the SMILES string for a small-molecule inhibitor
designated 7J (QYS).
Sequence Cluster Statistics
This example shows how to get the number of distinct protein sequences in the PDB archive.
Newly Released Structures
This example shows how to get a list of all PDB ID for this week's newly released structures.
Membrane Proteins
This example shows how to get a list of PDB ID of entries that are annotated as membrane protein by at least one relevant external resource.
Symmetry and Enzyme Classification
This example shows how to get assembly counts per symmetry types, further broken down by Enzyme Classification (EC) classes. The assemblies are first filtered to homo-oligomers only.
Computed Structure Models
This example shows how to find PDB structures and Computed Structure Models for a given UniProt sequence.
Structure Search with Custom Data
This example showcases how to search with structures not deposited in the PDB archive by pointing to
external URLs such as predictions from AlphaFold DB, ModelArchive, or SWISS-MODEL. Any publicly
available URL can be referenced.
This feature can be used for structure (3D-shape) and strucmotif (structure motif) searches. Required
inputs are the file location (url
) and format
('cif' or 'bcif' for BinaryCIF).
Gzipped content is supported as well.
Migration Guides
Migrating from Legacy Search API
Applications written on top of the Legacy Search APIs no longer work because these services have been discontinued. This migration guide describes the necessary steps to convert applications from using Legacy Search API Web Service to a new RCSB Search API.
Migrating from v1 to v2
The following guide will help you migrate from API v1 to v2. This page contains information you need to know when migrating from deprecated API version v1 to a newer version v2.
Acknowledgements
To cite this service, please reference:
- Rose, Y., Duarte, J. M., Lowe, R., Segura, J., Bi, C., Bhikadiya, C., ... & Westbrook, J. D. (2021). RCSB Protein Data Bank: architectural advances towards integrated searching and efficient access to macromolecular structure data from the PDB archive. Journal of molecular biology, 433(11), 166704. DOI: 10.1016/j.jmb.2020.11.003
- Bittrich, S., Bhikadiya, C., Bi, C., Chao, H., Duarte, J. M., Dutta, S., ... & Rose, Y. (2023). RCSB Protein Data Bank: Efficient Searching and Simultaneous Access to One Million Computed Structure Models Alongside the PDB Structures Enabled by Architectural Advances. Journal of Molecular Biology, 167994. DOI: 10.1016/j.jmb.2023.167994
Related publications:
- Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., ... & Bourne, P. E. (2000). The protein data bank. Nucleic acids research, 28(1), 235-242. DOI: 10.1093/nar/28.1.235
- Burley, S. K., Berman, H. M., Bhikadiya, C., Bi, C., Chen, L., Di Costanzo, L., ... & Zardecki, C. (2019). RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic acids research, 47(D1), D464-D474. DOI: 10.1093/nar/gky1004
- Burley, S. K., Bhikadiya, C., Bi, C., Bittrich, S., Chao, H., Chen, L., ... & Zardecki, C. (2023). RCSB Protein Data Bank (RCSB. org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Research, 51(D1), D488-D508. DOI: 10.1093/nar/gkac1077
Contact Us
Contact info@rcsb.org with questions or feedback about this service.