RCSB PDB: Search API Documentation

  • Search API Basics
  • Query Language
  • Faceted Aggregations
  • Search Operators
  • Date Math Expressions
  • Search Attributes
  • Search Results
  • API Clients
  • Examples
  • Acknowledgements
  • Contact Us
  • RCSB PDB Search API

    This document explains how to use the RCSB PDB Search API, which allows users to run queries across RCSB PDB Search Services and retrieve a list of relevant identifiers such as PDB IDs, entity IDs, assembly IDs, etc.

    The Search API is a RESTful API over HTTP with JSON payloads. The Search API accepts HTTP GET or POST requests. Refer to the RCSB PDB Search API Full Reference for a full API documentation.

    Introduction

    The base URI for calls to the Search API is https://search.rcsb.org/rcsbsearch/v1/query.

    The search request body should be specified as a URL-encoded query string inside the json parameter as https://search.rcsb.org/rcsbsearch/v1/query?json={search-request}. The query syntax for the {search-request} is detailed in the Query Language section of this guide. See Build Your Search section for general information on how to construct the {search-request} body.

    The search API is designed to return only the identifiers of relevant hits (see Return Type section for more information on the identifiers types that can be requested) and additional metadata. See Response Body section for more information. If you need to extract information on released date, macromolecules, organisms, resolution, modified residues, ligands etc., you should use RCSB Data API: https://data.rcsb.org.

    Build Your Search

    A search request is a complete specification of what should be returned in a result set. The search request is represented as a JSON object. The building blocks of the request are:

    Context Description
    return_type Required. Specifies the type of the returned identifiers, e.g. entry, polymer entity, assembly, etc. See Return Type section for more information.
    query Optional. Specifies the search expression. Can be omitted if, instead of IDs retrieval, facets or count operation should be performed. In this case the request must be configured via the request_options context.
    request_options Optional. Controls various aspects of the search request including pagination, sorting, scoring and faceting. If omitted, the default parameters for sorting, scoring and pagination will be applied.
    request_info Optional. Specifies an additional information about the query, e.g. query_id. It's an optional property and used internally at RCSB PDB for logging purposes. When query_id is sent with the search request, it will be included into the corresponding response object.
    The query context may consist of two types of clauses:

    The simplest query requires specifying only return_type parameter and query context. With unspecified parameters property in the query object, a query matches all documents, returning PDB IDs if the return_type property is set to "entry".

    Refer to Examples section for more examples.

    Search Services

    The RCSB PDB Search API consolidates requests to heterogeneous search services. The list of available services is below:

    Service Description
    text Performs linguistic searches against textual annotations associated with PDB structures. Refer to Search Attributes page for a full list of annotations.
    sequence This service employs the MMseqs2 software and performs fast sequence matching searches (BLAST-like) based on a user-provided FASTA sequence (with E-value or % Identity cutoffs). Following search targets are available:
    • pdb_protein_sequence: all current protein sequences in PDB
    • pdb_dna_sequence: all current DNA sequences in PDB
    • pdb_rna_sequence: all current RNA sequences in PDB
    seqmotif Performs short motif searches against nucleotide or protein sequences, using three different types of input format:
    • simple (e.g., CXCXXL)
    • prosite (e.g., C-X-C-X(2)-[LIVMYFWC])
    • regex (e.g., CXCX{2}[LIVMYFWC])
    structure Performs fast searches matching a global 3D shape of assemblies or chains of a given entry (identified by PDB ID), in either strict (strict_shape_match) or relaxed (relaxed_shape_match) modes, using a BioZernike descriptor strategy.
    strucmotif Performs structural motif searches on all available PDB structures.
    chemical

    Enables queries of small-molecule constituents of PDB structures, based on chemical formula and chemical structure. Both molecular formula and formula range searches are supported. Queries for matching and similar chemical structures can be performed using SMILES and InChI descriptors as search targets. Graph and chemical fingerprint searches are implemented using tools from the OpenEye Chemical Toolkit.

    Descriptor Matching Criteria:

    The following graph matching searches use a fingerprint prefilter so these are designed to find only similar molecules. These graph matching comparisons include:

    • graph-exact: atom type, formal charge, bond order, atom and bond chirality, aromatic assignment, valence degree, and atom hydrogen count are used as matching criteria for this search type. Graph matching is performed on the subset of molecules satisfying a fingerprint screening search. Results will include isomorphic and substructure matches within this screened subset.
    • graph-strict: atom type, formal charge, bond order, atom and bond chirality, aromatic assignment, ring membership, and valence degree are used as matching criteria for this search type. Graph matching is performed on the subset of molecules satisfying a fingerprint screening search. Results will include isomorphic and substructure matches within this screened subset.
    • graph-relaxed: atom type, formal charge and bond order are used as matching criteria for this search type. Graph matching is performed on the subset of molecules satisfying a fingerprint screening search. Results will include isomorphic and substructure matches within this screened subset.
    • graph-relaxed-stereo: atom type, formal charge, bond order, atom and bond chirality are used as matching criteria for this search type. Graph matching is performed on the subset of molecules satisfying a fingerprint screening search. Results will include isomorphic and substructure matches within this screened subset.
    • fingerprint-similarity: Tanimoto similarity is used as the matching criteria for molecular fingerprints. Matches include molecules with scores exceeding 0.6 for TREE type fingerprints or 0.9 for MACCS type fingerprints.

    The following graph matching searches perform an exhaustive substructure search with no pre-screening. These substructure graph matching comparisons include:

    • sub-struct-graph-exact (atom type, formal charge, aromaticity, bond order, atom/bond stereochemistry, valence degree, atom hydrogen count)
    • sub-struct-graph-strict (atom type, formal charge, aromaticity, bond order, atom/bond stereochemistry, ring membership, valence degree)
    • sub-struct-graph-relaxed (atom type, formal charge, bond type)
    • sub-struct-graph-relaxed-stereo (atom type, formal charge, bond type, atom/bond stereochemistry)

    Return Type

    The search can return one of the following result types:

    Type Description
    entry Returns a list of PDB IDs.
    assembly Returns a list of PDB IDs appended with assembly IDs in the format of a [pdb_id]-[assembly_id], corresponding to biological assemblies.
    polymer_entity Returns a list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to polymeric molecular entities.
    non_polymer_entity Returns a list of PDB IDs appended with entity IDs in the format of a [pdb_id]_[entity_id], corresponding to non-polymeric entities (or ligands).
    polymer_instance Returns a list of PDB IDs appended with asym IDs in the format of a [pdb_id].[asym_id], corresponding to instances of certain polymeric molecular entities, also known as chains. Note, that asym_id in the instance identifier corresponds to the _label_asym_id from the mmCIF schema (assigned by the PDB). It can differ from _auth_asym_id (selected by the author at the time of deposition).

    Query Language

    The Search API provides a full query DSL (domain-specific language) based on JSON to define queries.

    Basic Search

    The query language allows to perform unstructured (basic) searches. An unstructured query refers to the search of textual annotation associated with PDB structures when the field name is unknown. Such query will search across all fields, available for search, and return a hit if match happens in any field.

    To perform an unstructured search, you should send the parameters object without an explicit attribute property:

    Refer to Examples section for more examples.

    Complex boolean queries in the basic search can be built with following operators:

    For example, using a interferon + response + factor query string is equivalent to running interferon AND response AND factor search.

    You can use ( and ) to signify precedence. For example, searching with a query string isopeptide + ( collagen | fibrinogen ) returns structures that contain isopeptide AND either collagen OR fibrinogen.

    Attribute Search

    Attribute query allows searching for terms with relation to a specific attribute. To perform an attribute search, you should send the parameters object with an explicit attribute property set to a field name, value property set to a search term, and operator property set to a search operator.

    Refer to the Examples section for more examples.

    When using attribute search, you must observe the following rules:

    Negation

    To perform negation on the operator, the negation property should be set to true in the query parameters object. The following search returns non-protein polymeric entities:

    Refer to the Examples section for more examples.

    Case-Sensitive Search

    By default, searches performed using exact match operators are case-insensitive. You can make your search case-sensitive by setting the case_sensitive property in the query parameters object to true. This option can be useful when capitalization rules help convey additional information. For example, gene symbols can differ in capitalization between homologous from different species, i.e. human genes are always upper case.

    The following search returns human kinases encoded by the ABL1 gene. It excludes results where the case doesn't match, such as non-receptor tyrosine-protein kinase from mouse encoded by the Abl1 gene.

    Refer to the Examples section for more examples.

    Boolean Expressions

    The query language supports two boolean operators: AND and OR. Boolean operators should be added to the group node as logical_operator property. The group nodes can be used to logically combine search expressions (terminal nodes) or other group nodes:

    Refer to the Examples section for more examples.

    Scoring Strategy

    You can customize how scores from different services impact the final relevancy ranking of your search results by setting a scoring_strategy in the request_options context. Following scoring strategies are available: combined (default), sequence, seqmotif, strucmotif, structure, chemical, and text. For example, you might want to boost search results based on the relevance score produced by sequence search service, then sequence scoring strategy should be used.

    The final relevancy score is calculated as weighted sum of normalized scores produced by different search services. When combined strategy is used, equal weights are applied. For other strategies, higher weight is used for select service scores making their contribution to the final score bigger and therefore promoting ranking that is influenced by select service.

    Sorting

    Sorting is determined by the sort object in the request_options context. It allows you to add one or more sorting conditions to control the order of the search result hits. The sort operation is defined on a per field level, with special field name for score to sort by score (the default).

    Refer to the Search Attributes page to find all searchable attributes. Any attribute listing exact_match or equals operators can be used for sorting.

    By default sorting is done in descending order ("desc"). The sort can be reversed by setting direction property to "asc". This example demonstrates how to sort the search results by release date:

    Refer to the Examples section for more examples.

    Pagination

    By default, only first 10 hits are included in the search result list. Pagination can be configured by the start and rows parameters of the pager object in the request_options context.

    Returning all hits is generally not desirable and may be the source of performance issues. However, if it's needed to retrieve all matched hits, consider adding return_all_hits parameter to the request_options context.

    Refer to the Examples section for more examples.

    Counting Results

    By default, the search results contains a list of matched identifiers and additional metadata. See Search Results for more details. The return_counts flag in the request_options context allows you to execute a search query and get back only the number of matches for that query. The following query returns a number of current structures in the PDB archive:

    Refer to the Examples section for more examples.

    Faceted Aggregations

    Faceted aggregations (or facets) provide you with the ability to group and perform calculations and statistics on PDB data by using a simple search query. Facets are the arrangement of search results into categories (buckets) based on the requested field values.

    If the facets property is specified in the request_options context, the search results are presented along with numerical counts of how many matching IDs were found for each term requested in the facets. If the query context is omitted in the search request with facets specified, the response will contain only the facet counts.

    This example calculates the breakdown by experimental method of PDB structures, released after 2019-08-20:

    Refer to Examples section for more examples.

    Terms Facets

    Terms faceting is a multi-bucket aggregation where buckets are dynamically built - one per unique value. For each bucket terms faceting counts the documents (entry, polymer_entity, etc.) that contain a given value in a given field. For example, you can run the terms aggregation on the field rcsb_primary_citation.rcsb_journal_abbrev which holds the abbreviated name of a journal associated with an entry. In return, we have buckets for each journal, each with their PDB entry counts. You can also specify a threshold for a document count, e.g. here only journals associated with at least 1000 entries are returned:

    Refer to Examples section for more examples.

    Histogram Facets

    Histogram faceting is a multi-bucket aggregation that can be applied on numeric values. It builds fixed size (a.k.a. interval) buckets over the values. For example, for the rcsb_polymer_entity.formula_weight field that holds a formula mass (KDa) of the entity, we can configure this aggregation to build buckets with interval 50 KDa:

    Refer to Examples section for more examples.

    Date Histogram Facets

    This multi-bucket aggregation is similar to the histogram aggregation, but it can only be used with date values. Calendar-aware intervals are configured with the interval parameter. For example, we can configure this aggregation to build buckets with 1 year intervals:

    Refer to Examples section for more examples.

    Range Facets

    A multi-bucket aggregation that enables the user to define a set of numeric ranges - each representing a bucket. Note that this aggregation includes the from value and excludes the to value for each range. Omitted from or to parameters creates a bucket with min or max boundaries. Example:

    Refer to Examples section for more examples.

    Date Range Facets

    This multi-bucket aggregation is similar to the range aggregation but dedicated for date values. The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in date math expressions. Example:

    Refer to Examples section for more examples.

    Cardinality Facets

    Cardinality faceting is single-value metrics aggregation that calculates a count of distinct values returned for a given field. For example, you can count unique source organism name assignments in the PDB archive:

    Refer to Examples section for more examples.

    Filter Facets

    As its name suggests, the filter aggregation helps you filter documents that contribute to bucket count. In the example below, we are filtering only protein chains which adopt 2 different beta propeller arrangements according to the CATH classification:

    Refer to Examples section for more examples.

    Multi-Dimensional Facets

    Complex, multi-dimensional aggregations are possible as in the example below:

    Refer to Examples section for more examples.

    Search Operators

    Search operators are commands that help you make your search more specific and focused. The following operators can be used to perform a field search:

    Exact Match Operators

    Exact match operators indicate that the input value should match a field value exactly (including whitespaces, special characters and case).

    exact_match

    You can use the exact_match operator to find exact occurrences of the input value. Comparisons with exact_match operator are case-sensitive.

    A single value input is required for this operator and must be a string.

    in

    The in operator allows you to specify multiple values in a single search expression. It returns results if any value in a list of input values matches. It can be used instead of multiple OR conditions. Comparisons with in operator are case-sensitive.

    An input value is required for this operator and it must be a list of strings, numbers or dates.

    Full-Text Operators

    The full-text operators enable you to perform linguistic searches against text data by operating on words and phrases. The input text is analyzed before performing a search. The analysis includes following transformations:

    The standard grammar based tokenization is used to break input text into tokens. Refer to the Unicode Text Segmentation documentation for more information on tokenization rules.

    contains_words

    The contains_words operator performs a full-text search by operating on words in provided text. After text is broken into tokens, more basic queries are constructed and OR boolean logic used to interpret the query. For example, "actin-binding protein" will be interpreted as "actin" OR "binding" OR "protein". The search will return results if any of these tokens match. This operator can match multiple tokens in any order.

    A single value input is required for this operator and it must be a string.

    contains_phrase

    The contains_phrase operator performs a full-text search by operating on phrases. The operator will require the following criteria fulfilled in order to return results:

    For example, "actin-binding protein" will be interpreted as "actin" AND "binding" AND "protein" occurring in a given order.

    A single value input is required for this operator and it must be a string.

    Comparison Operators

    greater, less, greater_or_equal, less_or_equal, equals operators match fields whose values are larger, smaller, larger or equal, smaller or equal to the given input value.

    A single value input is required for this operator and it must be a number or date.

    Range Operator

    The range operator can be used to match values within a provided range.

    A single value input is required for this operator and it must be an object as follows:

    By default, lower and upper bounds are excluded. They can be included by setting include_lower and include_upper to true respectively. An inclusive bound means that the boundary point itself is included in the range as well, while an exclusive bound means that the boundary point is not included in the range.

    Refer to Examples section for more examples.

    Exists Operator

    The exists is a logical operator that allows you to check whether a given field contains any value. To be deemed as non-existent the value must be null or []. The following values will indicate the field does exist:

    The operator doesn't require a value.

    Date Math Expressions

    Comparison and range operators support using date math expression. The expression starts with an "anchor" date, which can be either now or a date string (in the applicable format) ending with ||. It can be followed by a math expression, supporting + and -, e.g. "2020-06-01||-12M".

    The units supported are:

    Search Attributes

    The attributes available for search include the annotations described by mmCIF dictionary, annotations coming from external resources and attributes added by RCSB PDB. Both internal additions to the mmCIF dictionary and external resources annotations are prefixed with rcsb_.

    Refer to the Search Attributes page for a full list of the attributes that are available for text search.

    Search Results

    The HTTP Status 200 (OK) status code indicates that the search API request has been processed successfully and that server returns search results data. The response data is formatted in JSON and its structure is determined by parameters in the query. Query parameters can be used to structure the result set in the following ways:

    Response Body

    The search response body provides details about the search execution itself as well as an array of the individual search hits. Following information is available in the search results response body:

    Name Description
    query_id Required. Unique query ID assigned to the request or passed as a query parameter.
    result_type Required. Specifies the granularity of the returned identifiers requested in the query. See Return Type.
    total_count Required. The total number of matched identifiers.
    explain_meta_data Required. Contains details on the query execution time (in milliseconds).
    result_set Optional. Search results set is returned as PDB identifiers and accompanying metadata.
    drilldown Optional. Drilldown array contains search facets for requested attributes.

    All responses have this general structure (here result_set and drilldown may or may not be included depending on the query parameters):

    Results Set

    Results set is an array of objects representing search hits. Each hit contains the matching identifier, score, and metadata produced by search services.

    Result Identifiers

    While a search query might include a large number of attributes, only the matching PDB identifiers, representing a desired level of granularity, are included in the result set. Following notation is used for PDB identifiers:

    Relevancy Score

    The final relevancy score is calculated as weighted sum of normalized scores produced by different search services. By default, scores from all services are weighted equally. See Scoring Strategy section for more details on how to configure scoring. The higher the score, the more relevant result hit is.

    Service Metadata

    Different search services produce different metadata and use different scoring metrics. This metadata and raw scores are reported as described below:

    Name Description
    node_id Required. Distinct numeric ID is assigned to results produced by each search service.
    original_score Required. The original (raw) score produced by a search service chosen as relevance score for this service. For example, the bit score of the alignment is chosen as raw relevance score for a sequence search service.
    norm_score Required. The original score transformed onto a scale between 0 and 1 using min-max normalization algorithm (higher means more significant).
    match_context Optional. Additional metadata produced by search services. Match context will be included only for select return types. For example, is sequence search was performed and polymer_entity is specified as return type, the results will include matching_context with additional metadata such as sequence identity, E-value, bit-score values and the residue boundary positions of the matching sequence. The matching_context will not be included if same search is performed, but the return type is set to entry or assembly.

    The following snippet shows an example of search results for a query that combines 4 different search services. Here, the search results set contains one search hit at the granularity of PDB entry:

    Empty Results

    The HTTP Status 204 (No Content) status code indicates that the search API request has been processed successfully but no search hits were found.

    API Clients

    Python

    The rcsbsearch package developed by Spencer Bliven provides a python interface to the RCSB Search API. You can use it to fetch lists of PDB IDs corresponding to advanced query searches.

    Examples

    This section demonstrates how to use the RCSB PDB Search API to perform complex searches.

    Biological Assembly Search

    This query finds symmetric dimers having a twofold rotation with the DNA-binding domain of a heat-shock transcription factor.

    X-Ray Structures Search

    This query finds PDB structures of virus's thymidine kinase with substrate/inhibitors, determined by X-ray crystallography at a resolution better than 2.5 Å.

    Protein Sequence Search

    In this example, using sequence search, we find macromolecular PDB entities that share 90% sequence identity with GTPase HRas protein from Gallus gallus (Chicken).

    3D-shape Search

    This example demonstrates how structure search can be used to find PDB structures of calmodulin with conformational changes upon Ca2+ binding. Calmodulin (CaM) protein has two homologous globular domains connected by a flexible linker. Ca2+ binding to each globular domain causes a change from a “closed” to an “open” conformation. This query finds calmodulin structures in “open” conformation.

    As a structure query input parameter we will use the crystal structure of Ca2+-loaded calmodulin (PDB entry 1CLL). This query is combined with the text search for CA chemical component ID. Note: if you leave out the query clause matching Ca2+ ions, you will also get calmodulin structures in complex with other metals (e.g. strontium in 4BW7).

    Free Ligand Search

    Ligands are considered “free ligands” when they interact non-covalently with macromolecules. This example shows how to find non-polymeric entities of ATP molecule that is found as “free ligand”.

    Sequence Motif Search

    A sequence motif search finds macromolecular PDB entities that contain a specific sequence motif. This examples retrieves occurrences of the His2/Cys2 Zinc Finger DNA-binding domain as represented by its PROSITE signature.

    Chemical Similarity Search

    This example demonstrates how to find non-polymeric entities chemically similar to Tylenol defined by the InChI string. Note, that the parameter match_type="graph-strict" does not imply exact structure match and you are getting acetaminophen molecules (TYL) together with methoxy (T9V) and ethoxy (N4E) analogs in the result set.

    Search by UniProt Accession

    This example shows how to search for PDB entities using associated UniProt accession code.

    Structural Motif Search

    A structural motif search finds macromolecular PDB assemblies that contain a specific arrangement of a small number of residues in a certain geometric arrangement (e.g. residue that constitute the catalytic center or a binding site). This examples retrieves occurrences of the enolase superfamily, a group of proteins diverse in sequence and structure that are all capable of abstracting a proton from a carboxylic acid. Position-specific exchanges are crucial to represent this superfamily accurately.

    Combining Search Services

    This example shows how to compose text, sequence, structure, and chemical queries employing the Boolean operator AND. The search yields structures (entries) matching all criteria, including co-crystal structures with the desired bound inhibitor, matching the SMILES string for a small-molecule inhibitor designated 7J (QYS).

    Sequence Cluster Statistics

    This example shows how to get the number of distinct protein sequences in the PDB archive.

    Newly Released Structures

    This example shows how to get a list of all PDB ID for this week's newly released structures.

    Acknowledgements

    To cite this service, please reference:

    Related publications:

    Contact Us

    Contact info@rcsb.org with questions or feedback about this service.

    shell