Data Migration¶
Prerequisites
- Access to a CDF Project.
- Know how to use a terminal, so you can run
pygen
from the command line to generate the SDK. - Knowledge of your the data and data models.
Introduction to Problem¶
In the development of a solution, it is common that you move from one data model to another. Typically, the physicaly storage of data, the containers, stays the same, or are extended, and only the views of a data model change. In this case, there is no need for a data migration and all you do is update the solution code for the new model.
If you, however, need to change how the data is stored in the containers, you need to move the data from one model to the other. A common to do this is to use CDF Transformations. An alternative
ways is to use pygen
for data migration.
Advantage of CDF Transformations¶
- Can handle large volume of data, millions of instances.
- You can write the transformations in the UI using SQL.
Advantages of using pygen
¶
- Edges are automatically created.
- Can handle medium volume of data, hundreds of thousands of instances.
- Support doing very custom migration (what ever you can do in Python code).
Why would you chose pygen
over CDF Transformations? If you are more comfortable with Python
than writing SQL
, you have data models with a of one-to-many edges, the number of nodes and edges (instances) is in the order of hundreds of thousands.
Use Case¶
As we see in the illustrations above (click to enlarge), we have many more edges in the destination than the source model,
whiche means this is a good use case for using pygen
for the data migration.
In this guide, we will not do the entire migration, but instead focus on the APM_Template
to Template
migration.
Generating SDKs¶
For demo purposes we will generate the SDKs in this notebook, however, depending on your use case it might be useful to use the pygen
CLI to generate the SDKs.
Migration scrips are often a one of, meaning you are not expected to regenerat the SDK as the model changes, thus if you generate it locally, you can do changes to the generated code as you are not expecting this to change later. This can be useful if you want to do some customization to the generated code.
Note that we have set up a config.toml
with credentials to connect to CDF.
from cognite.pygen import generate_sdk_notebook, load_cognite_client_from_toml
client = load_cognite_client_from_toml("config.toml")
source = generate_sdk_notebook(("APM_AppData_4", "APM_AppData_4", "7"), client)
Successfully retrieved data model(s) ('APM_AppData_4', 'APM_AppData_4', '7') Writing SDK to C:\Users\ANDERS~1\AppData\Local\Temp\pygen Done! Added C:\Users\ANDERS~1\AppData\Local\Temp\pygen to sys.path to enable import Imported apm_app_data_4.client
destination = generate_sdk_notebook(("IntegrationTestsImmutable", "ApmAppData", "v3"), client)
Successfully retrieved data model(s) ('IntegrationTestsImmutable', 'ApmAppData', 'v3') Writing SDK to C:\Users\ANDERS~1\AppData\Local\Temp\pygen Done! C:\Users\ANDERS~1\AppData\Local\Temp\pygen already in sys.path Imported apm_app_data.client
Retrieving Data¶
It is common to store the nodes and edges, i.e., instances, in a differen space than the data model. To only get instances from an instance space, sourceSpace
,
we filter for it. In addition, it important that we retrieve all edges so we can connect the templates
with the templates_items
.
templates = source.apm_template.list(space="sourceSpace", retrieve_edges=True, limit=-1)
template_items = source.apm_template_item.list(space="sourceSpace", limit=-1)
len(templates), len(template_items)
(50, 105)
Converting the Data¶
We do the conversion in in the following steps
- We show some of the data of the source templates. This is usefult to have as a reference when writing the transformation code
- We put the
APMTemplateItem
s into a dictionary,template_items_by_external_id
so we can easily reference it. - We include a few other dictionaries,
visiblity_by_item_external_id
,schedules_by_item_external_id
,measurements_by_external_id
, andvisiblity_by_external_id
. This is to show how we can include data from different sources. - We iterate through all the
APMTemplates
and do the transformation for eachtemplate_item
(s)s andtemplate
# Importing the new destination data classes
from apm_app_data.data_classes import TemplateItemWrite, TemplateWrite, TemplateWriteList
templates.to_pandas().head(5)
external_id | assigned_to | created_by_external_id | is_archived | root_location_external_id | status | title | updated_by_external_id | template_items | space | version | last_updated_time | created_time | deleted_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 108c5106-f638-4d9c-8ef2-4b470ed94495 | None | 80d32fb4-bcbc-44ae-abed-fee53b148d82 | False | 31d9f834-a1ab-452a-b2a5-875f0e9f384a | OnHold | good | f1e7d13d-3fd2-42e4-bebc-68ad2aa905eb | [bef8179e-9e0c-44a8-8577-cf6e5ce0bd8f, 479ef47... | sourceSpace | 1 | 2023-11-05 10:53:03.351000+00:00 | 2023-11-05 10:53:03.351000+00:00 | None |
1 | 32f92434-e5fb-49c1-b4c4-1a454ed699ea | None | 6dbb225f-5059-4ff2-9543-7ab400e235e9 | False | dac4077b-61d6-445c-a4f4-f6f3247125ab | OnHold | three | df561fd7-dad6-484e-95f7-ae967fb00ed3 | [f6ce45c7-b5a6-4a3b-b549-91e4a05580a7, bbbdda6... | sourceSpace | 1 | 2023-11-05 10:53:03.351000+00:00 | 2023-11-05 10:53:03.351000+00:00 | None |
2 | 6b072855-8408-4c21-b1a3-5caf4bd22204 | None | 863a0cd1-f55e-4529-ac50-34090596a279 | True | 1b582c43-bc07-4ab7-959e-cf7e434711a0 | Completed | also | 37d17cd1-2bbe-416a-b7b1-111e44c64b36 | [8b3670e5-a6e4-4636-8066-ee54c0a52582, 29989d8... | sourceSpace | 1 | 2023-11-05 10:53:03.351000+00:00 | 2023-11-05 10:53:03.351000+00:00 | None |
3 | 95c51139-438b-4fd8-a5ce-b80dbcd77db4 | None | ac5c7d3f-564b-44d2-bdc6-6eaf23b075ce | True | 1b582c43-bc07-4ab7-959e-cf7e434711a0 | Completed | likely | caba29e9-fa98-438d-b5b7-6d43677764d3 | [cc2adb6c-34e8-4258-8634-3213c789854c, ffede5f... | sourceSpace | 1 | 2023-11-05 10:53:03.351000+00:00 | 2023-11-05 10:53:03.351000+00:00 | None |
4 | 9544b2c9-9abe-4a9b-b106-3db9510363c2 | None | b3fd00cb-8b06-4365-aac5-b0fcf6f8044c | True | 41ad2ff1-88b6-4c45-9bfd-a76d0c0ba9c6 | Cancelled | thing | ac5c7d3f-564b-44d2-bdc6-6eaf23b075ce | [502250e9-237b-4c87-85a3-ee100601bdba, 1881a55... | sourceSpace | 1 | 2023-11-05 10:53:03.351000+00:00 | 2023-11-05 10:53:03.351000+00:00 | None |
template_items.to_pandas().sample(5)
external_id | asset_external_id | created_by_external_id | labels | order | template_external_id | title | updated_by_external_id | space | version | last_updated_time | created_time | deleted_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
68 | 79bd1733-afa5-4b6c-9d2f-e8458f042e7e | d810d514-7fbf-4e2d-a8f8-cfc4770b4539 | 09e50d8e-9fe8-4340-916a-349e722eaf73 | [#f495e3, #7fe0a9, #dd6388] | 9134 | 78bbec25-03a3-4f4d-bfb5-7e047717d27c | car | None | sourceSpace | 1 | 2023-11-05 10:53:03.351000+00:00 | 2023-11-05 10:53:03.351000+00:00 | None |
83 | 156fc529-b365-4341-8d7b-c87cb155b50c | 72786d8a-bbb4-4e83-8cf0-bca2fc3ed796 | 8a717d87-09c0-4b1c-abdb-721e1ff5dfa5 | [] | 4868 | bb479412-b9bf-44c6-a067-58df6ba233c3 | decision | None | sourceSpace | 1 | 2023-11-05 10:53:03.351000+00:00 | 2023-11-05 10:53:03.351000+00:00 | None |
79 | 11ec48cc-b2d2-43dd-9f43-60f3f8b3b1b4 | d810d514-7fbf-4e2d-a8f8-cfc4770b4539 | bb5f75ed-e6ce-4269-ad5c-3ef4da6598b0 | [#dd6388] | 5920 | acd07034-830d-420f-b163-080145aa7844 | hard | None | sourceSpace | 1 | 2023-11-05 10:53:03.351000+00:00 | 2023-11-05 10:53:03.351000+00:00 | None |
0 | bef8179e-9e0c-44a8-8577-cf6e5ce0bd8f | 4dc95bc8-49e8-4ee8-a5f5-c7aa47f139de | 3caa43df-2fc3-4bba-b267-6e57c2ab826a | [#f495e3, #8516fc] | 4071 | 108c5106-f638-4d9c-8ef2-4b470ed94495 | half | None | sourceSpace | 1 | 2023-11-05 10:53:03.351000+00:00 | 2023-11-05 10:53:03.351000+00:00 | None |
2 | a92cfb02-09cd-4b47-b870-a7aded8aed07 | 72786d8a-bbb4-4e83-8cf0-bca2fc3ed796 | 2b0f4e82-f515-42ee-9545-03e08799c7c7 | [#7fe0a9, #dd6388] | 5920 | 108c5106-f638-4d9c-8ef2-4b470ed94495 | behind | None | sourceSpace | 1 | 2023-11-05 10:53:03.351000+00:00 | 2023-11-05 10:53:03.351000+00:00 | None |
template_item_by_external_id = {item.external_id: item for item in template_items}
# The visiblity, schedules, and measurements are changes to the new template,
# and thus the source of these needs to come from elsewhere. The point of including
# these data structers is to show how
visiblity_by_item_external_id: dict[str, str] = {}
schedules_by_item_external_id: dict[str, list[str]] = {}
measurements_by_external_id: dict[str, list[str]] = {}
visiblity_by_external_id: dict[str, str] = {}
destination_templates = TemplateWriteList([])
for template in templates:
destination_items = []
for item_external_id in template.template_items or []:
item = template_item_by_external_id[item_external_id]
destination_item = TemplateItemWrite(
external_id=item.external_id,
space="destinationSpace",
title=item.title,
labels=item.labels,
visibility=visiblity_by_item_external_id.get(item.external_id),
created_by=item.created_by_external_id,
updated_by=item.updated_by_external_id,
is_archived=template.is_archived,
order=item.order,
asset=item.asset_external_id,
schedules=schedules_by_item_external_id.get(item.external_id),
measurements=measurements_by_external_id.get(item.external_id),
)
destination_items.append(destination_item)
destination_template = TemplateWrite(
external_id=template.external_id,
space="destinationSpace",
title=template.title,
visibility=visiblity_by_external_id.get(template.external_id),
created_by=template.created_by_external_id,
updated_by=template.updated_by_external_id,
is_archived=template.is_archived,
status=template.status,
labels=list({label for item in destination_items for label in item.labels}),
root_location=template.root_location_external_id,
assigned_to=template.assigned_to,
template_items=destination_items,
)
destination_templates.append(destination_template)
How pygen
Creates Edges¶
pygen
create edges by looking at the relationships in the generated data classes.
For example, in the current example we have the TemplateItem
node which have the fields schedules
and measurements
. When writing to CDF pygen
iterates through all the TemplateItems
and check these fields, and if there is a string
it is assumed to be the external_id
of the end node, and pygen
creates an edge to it. If it is an object
that object is assumed to be the end node and pygen
creates an edge to it.
We can inspect what pygen is doing by calling the .to_instances_write()
method and check the results.
destination_templates[0]
value | |
---|---|
space | destinationSpace |
external_id | 108c5106-f638-4d9c-8ef2-4b470ed94495 |
existing_version | None |
title | good |
description | None |
labels | [#8516fc, #7fe0a9, #dd6388, #f4d89f, #f495e3] |
visibility | None |
created_by | 80d32fb4-bcbc-44ae-abed-fee53b148d82 |
updated_by | f1e7d13d-3fd2-42e4-bebc-68ad2aa905eb |
is_archived | False |
status | OnHold |
root_location | 31d9f834-a1ab-452a-b2a5-875f0e9f384a |
assigned_to | None |
template_items | [{'space': 'destinationSpace', 'external_id': ... |
len(destination_templates[0].template_items)
4
instances = destination_templates[0].to_instances_write()
instances.nodes
instance_type | space | external_id | sources | |
---|---|---|---|---|
0 | node | destinationSpace | 108c5106-f638-4d9c-8ef2-4b470ed94495 | [{'properties': {'title': 'good', 'labels': ['... |
1 | node | destinationSpace | bef8179e-9e0c-44a8-8577-cf6e5ce0bd8f | [{'properties': {'title': 'half', 'labels': ['... |
2 | node | destinationSpace | 479ef472-a2c1-4f03-b38d-7e29e4f90227 | [{'properties': {'title': 'agreement', 'labels... |
3 | node | destinationSpace | a92cfb02-09cd-4b47-b870-a7aded8aed07 | [{'properties': {'title': 'behind', 'labels': ... |
4 | node | destinationSpace | 54e8a840-624d-41cb-91a3-1b978e292565 | [{'properties': {'title': 'wall', 'labels': ['... |
instances.edges
instance_type | space | external_id | type | start_node | end_node | |
---|---|---|---|---|---|---|
0 | edge | destinationSpace | 108c5106-f638-4d9c-8ef2-4b470ed94495:bef8179e-... | {'space': 'IntegrationTestsImmutable', 'extern... | {'space': 'destinationSpace', 'external_id': '... | {'space': 'destinationSpace', 'external_id': '... |
1 | edge | destinationSpace | 108c5106-f638-4d9c-8ef2-4b470ed94495:479ef472-... | {'space': 'IntegrationTestsImmutable', 'extern... | {'space': 'destinationSpace', 'external_id': '... | {'space': 'destinationSpace', 'external_id': '... |
2 | edge | destinationSpace | 108c5106-f638-4d9c-8ef2-4b470ed94495:a92cfb02-... | {'space': 'IntegrationTestsImmutable', 'extern... | {'space': 'destinationSpace', 'external_id': '... | {'space': 'destinationSpace', 'external_id': '... |
3 | edge | destinationSpace | 108c5106-f638-4d9c-8ef2-4b470ed94495:54e8a840-... | {'space': 'IntegrationTestsImmutable', 'extern... | {'space': 'destinationSpace', 'external_id': '... | {'space': 'destinationSpace', 'external_id': '... |
We can see here that for one Template
item with 4 TemplateItem
s pygen
creates one node for each =5
, and an edge connecting each of the TemplateItem
to Template
=4
.
all_instances = destination_templates.to_instances_write()
By doing the .to_instances_write
call on the all the destination Templates
we can see how
many nodes and edges we are expecting to create.
len(all_instances.nodes), len(all_instances.edges)
(155, 105)
Note that for the use case we are using in this example one of the changes to the new model is that a lot of the relationships are made explicit. For example, in the old template we have root_location_external_id
, while now we have direct link root_location
. Due to the way pygen
automatically creates edges this becomes an easier migration as if pygen
knows that there should be an edge and have a string
it is assumed to be an external_id
.
Writing Data¶
We are no ready to write the new Template
data to the new model.
We use the destination
client and call .upsert
.
results = destination.upsert(destination_templates)
len(results.nodes), len(results.edges)
(155, 105)
Note that it is not necessary to do a call to destination.template_items.apply(...)
as the above call iterates recursively through all linked objects and finds the nodes and edges.
We can verify that the data has been uploaded correctly by calling the .list()
method on the destination client.
created = destination.template.list(space="destinationSpace", limit=-1)
len(created)
50
created.to_pandas().head()
external_id | title | description | labels | visibility | created_by | updated_by | is_archived | status | root_location | assigned_to | template_items | space | version | last_updated_time | created_time | deleted_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 108c5106-f638-4d9c-8ef2-4b470ed94495 | good | None | [#8516fc, #7fe0a9, #dd6388, #f4d89f, #f495e3] | None | 80d32fb4-bcbc-44ae-abed-fee53b148d82 | f1e7d13d-3fd2-42e4-bebc-68ad2aa905eb | False | OnHold | 31d9f834-a1ab-452a-b2a5-875f0e9f384a | None | [bef8179e-9e0c-44a8-8577-cf6e5ce0bd8f, 479ef47... | destinationSpace | 1 | 2023-11-05 14:57:29.950000+00:00 | 2023-11-05 14:57:29.950000+00:00 | None |
1 | 32f92434-e5fb-49c1-b4c4-1a454ed699ea | three | None | [#8516fc, #7fe0a9, #dd6388, #f4d89f, #f495e3] | None | 6dbb225f-5059-4ff2-9543-7ab400e235e9 | df561fd7-dad6-484e-95f7-ae967fb00ed3 | False | OnHold | dac4077b-61d6-445c-a4f4-f6f3247125ab | None | [f6ce45c7-b5a6-4a3b-b549-91e4a05580a7, bbbdda6... | destinationSpace | 1 | 2023-11-05 14:57:29.950000+00:00 | 2023-11-05 14:57:29.950000+00:00 | None |
2 | 6b072855-8408-4c21-b1a3-5caf4bd22204 | also | None | [#8516fc, #dd6388, #7fe0a9] | None | 863a0cd1-f55e-4529-ac50-34090596a279 | 37d17cd1-2bbe-416a-b7b1-111e44c64b36 | True | Completed | 1b582c43-bc07-4ab7-959e-cf7e434711a0 | None | [8b3670e5-a6e4-4636-8066-ee54c0a52582, 29989d8... | destinationSpace | 1 | 2023-11-05 14:57:29.950000+00:00 | 2023-11-05 14:57:29.950000+00:00 | None |
3 | 95c51139-438b-4fd8-a5ce-b80dbcd77db4 | likely | None | [#8516fc, #f4d89f, #7fe0a9] | None | ac5c7d3f-564b-44d2-bdc6-6eaf23b075ce | caba29e9-fa98-438d-b5b7-6d43677764d3 | True | Completed | 1b582c43-bc07-4ab7-959e-cf7e434711a0 | None | [cc2adb6c-34e8-4258-8634-3213c789854c, ffede5f... | destinationSpace | 1 | 2023-11-05 14:57:29.950000+00:00 | 2023-11-05 14:57:29.950000+00:00 | None |
4 | 9544b2c9-9abe-4a9b-b106-3db9510363c2 | thing | None | [#dd6388, #7fe0a9] | None | b3fd00cb-8b06-4365-aac5-b0fcf6f8044c | ac5c7d3f-564b-44d2-bdc6-6eaf23b075ce | True | Cancelled | 41ad2ff1-88b6-4c45-9bfd-a76d0c0ba9c6 | None | [502250e9-237b-4c87-85a3-ee100601bdba, 1881a55... | destinationSpace | 1 | 2023-11-05 14:57:29.950000+00:00 | 2023-11-05 14:57:29.950000+00:00 | None |