genome_metadata
ensembl.io.genomio.genome_metadata
¶
Genome metadata handling module.
PROVIDER_DATA = {'GenBank': {'assembly': {'provider_name': 'GenBank', 'provider_url': 'https://www.ncbi.nlm.nih.gov/datasets/genome'}, 'annotation': {'provider_name': 'GenBank', 'provider_url': 'https://www.ncbi.nlm.nih.gov/datasets/genome'}}, 'RefSeq': {'assembly': {'provider_name': 'RefSeq', 'provider_url': 'https://www.ncbi.nlm.nih.gov/datasets/genome'}, 'annotation': {'provider_name': 'RefSeq', 'provider_url': 'https://www.ncbi.nlm.nih.gov/datasets/genome'}}}
module-attribute
¶
MetadataError
¶
Bases: Exception
When a metadata value is not expected.
Source code in src/python/ensembl/io/genomio/genome_metadata/prepare.py
68 69 |
|
MissingNodeError
¶
Bases: Exception
When a taxon XML node cannot be found.
Source code in src/python/ensembl/io/genomio/genome_metadata/prepare.py
64 65 |
|
add_assembly_version(genome_data)
¶
Adds version number to the genome's assembly information if one is not present already.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
genome_data
|
Dict
|
Genome information of assembly, accession and annotation. |
required |
Source code in src/python/ensembl/io/genomio/genome_metadata/prepare.py
108 109 110 111 112 113 114 115 116 117 118 119 |
|
add_genebuild_metadata(genome_data)
¶
Adds genebuild metadata to genome information if not present already.
The default convention is to use the current date as "version"
and "start_date"
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
genome_data
|
Dict
|
Genome information of assembly, accession and annotation. |
required |
Source code in src/python/ensembl/io/genomio/genome_metadata/prepare.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
|
add_provider(genome_metadata, ncbi_data)
¶
Updates the genome metadata adding provider information for assembly and gene models.
Assembly provider metadata will only be added if it is missing, i.e. neither "provider_name"
or
"provider_url"
are present. The gene model metadata will only be added if gff3_file
is provided.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
genome_data
|
Genome information of assembly, accession and annotation. |
required | |
ncbi_data
|
Dict
|
Report data from NCBI datasets. |
required |
Raises:
Type | Description |
---|---|
MetadataError
|
If accession's format in genome metadata does not match with a known provider. |
Source code in src/python/ensembl/io/genomio/genome_metadata/prepare.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
add_species_metadata(genome_metadata, ncbi_data)
¶
Adds taxonomy ID, scientific name and strain (if present) from the NCBI dataset report.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
genome_metadata
|
Dict
|
Genome information of assembly, accession and annotation. |
required |
ncbi_data
|
Dict
|
Report data from NCBI datasets. |
required |
Source code in src/python/ensembl/io/genomio/genome_metadata/prepare.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
|
amend_genome_metadata(genome_infile, genome_outfile, report_file=None, genbank_file=None)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
genome_infile
|
PathLike
|
Genome metadata following the |
required |
genome_outfile
|
PathLike
|
Amended genome metadata file. |
required |
report_file
|
Optional[PathLike]
|
INSDC/RefSeq sequences report file. |
None
|
genbank_file
|
Optional[PathLike]
|
INSDC/RefSeq GBFF file. |
None
|
Source code in src/python/ensembl/io/genomio/genome_metadata/extend.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
|
check_assembly_version(genome_metadata)
¶
Updates the assembly version of the genome metadata provided.
If version
meta key is not and integer or it is not available, the assembly accession's version
will be used instead.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
genome_metadata
|
dict[str, Any]
|
Nested metadata key values from the core metadata table. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If both |
Source code in src/python/ensembl/io/genomio/genome_metadata/dump.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
|
check_genebuild_version(genome_metadata)
¶
Updates the genebuild version (if not present) from the genebuild ID, removing the latter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
genome_metadata
|
dict[str, Any]
|
Nested metadata key values from the core metadata table. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If there is no genebuild version or ID available. |
Source code in src/python/ensembl/io/genomio/genome_metadata/dump.py
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
filter_genome_meta(genome_metadata, metafilter, meta_update)
¶
Returns a filtered metadata dictionary with only the predefined keys in METADATA_FILTER.
Also converts to expected data types (to follow the genome JSON schema).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
genome_metadata
|
dict[str, Any]
|
Nested metadata key values from the core metadata table. |
required |
metafilter
|
dict | None
|
Input JSON containing subset of meta table values to filter on. |
required |
meta_update
|
bool
|
Deactivates additional meta updating. |
required |
Source code in src/python/ensembl/io/genomio/genome_metadata/dump.py
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
get_additions(report_path, gbff_path)
¶
Returns all seq_regions
that are mentioned in the report but that are not in the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
report_path
|
PathLike
|
Path to the report file. |
required |
gbff_path
|
Optional[PathLike]
|
Path to the GBFF file. |
required |
Source code in src/python/ensembl/io/genomio/genome_metadata/extend.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
get_gbff_regions(gbff_path)
¶
Returns the seq_region
data from a GBFF file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
gbff_path
|
Optional[PathLike]
|
GBFF file path to use. |
required |
Source code in src/python/ensembl/io/genomio/genome_metadata/extend.py
63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
get_genome_metadata(session, db_name)
¶
Returns the meta table content from the core database in a nested dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
session
|
Session
|
Session for the current core. |
required |
db_name
|
str | None
|
Target database name |
required |
Source code in src/python/ensembl/io/genomio/genome_metadata/dump.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
get_report_regions_names(report_path)
¶
Returns a list of GenBank-RefSeq seq_region
names from the assembly report file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
report_path
|
PathLike
|
Path to the assembly report file from INSDC/RefSeq. |
required |
Source code in src/python/ensembl/io/genomio/genome_metadata/extend.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
metadata_dump_setup(db_url, input_filter, meta_update, append_db)
¶
Setup main stages of genome meta dump from user input arguments provided. Args: db_url: Target core database URL. input_filter: Input JSON containing subset of meta table values to filter on. no_update: Deactivate additional meta updating. append_db: Append target core database name to output JSON.
Source code in src/python/ensembl/io/genomio/genome_metadata/dump.py
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 |
|
prepare_genome_metadata(input_file, output_file, ncbi_meta)
¶
Updates the genome metadata JSON file with additional information.
In particular, more information is added about the provider, the assembly and its gene build version, and the taxonomy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_file
|
PathLike
|
Path to JSON file with genome metadata. |
required |
output_file
|
PathLike
|
Output directory where to generate the final |
required |
ncbi_meta
|
PathLike
|
JSON file from NCBI datasets. |
required |
Source code in src/python/ensembl/io/genomio/genome_metadata/prepare.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|