Data Population¶
Prerequisites
- Access to a CDF Project.
- Know how to use a terminal, so you can run
pygen
from the command line to generate the SDK. - Knowledge of your the data and data model.
Introduction to Problem¶
pygen
can be used to ingest data into an existing data model. It is well suited when the source data is nested and comes in a format such as JSON
.
Before you can ingest data you need the following:
- A Data Model Deployed to CDF.
- Generated an SDK for it.
In this guide, we will use some windmill data as an example. First, we already have a deployed a model and generated an SDK for it.
The model was generated with the follwing config from the pyproject.toml
[tool.pygen]
data_models = [
["power-models", "Windmill", "1"],
]
top_level_package = "windmill"
client_name = "WindmillClient"
The model is illustrated in Cognite Data Fusions interface below:
First, we will inspect some of the data we have available
from tests.constants import WindMillFiles
print(WindMillFiles.Data.wind_mill_json.read_text()[:500])
[ { "name": "hornsea_1_mill_3", "windfarm": "Hornsea 1", "capacity": 7.0, "rotor": { "rotor_speed_controller": "V52-WindTurbine.ROT", "rpm_low_speed_shaft": "V52-WindTurbine.cnt0" }, "nacelle": { "gearbox": { "displacement_x": "V52-WindTurbine.Gear_D_X", "displacement_y": "V52-WindTurbine.Gear_D_Y", "displacement_z": "V52-WindTurbine.Gear_D_Z" },
As we see in the snippet above this is nested data, which is well suited for pygen
supported ingestion
External ID Hook¶
All data in CDF data models needs to have an external_id
set. Often, source data does not come with an external_id
set, and to help this pygen
comes with a built in hook that enables you to set external_id
when you are ingesting the data. The name of this hook is an external_id_factory
and you can set it importing the DomainModelWrite
from your generated data classes.
from cognite.pygen.utils.external_id_factories import create_incremental_factory, uuid_factory
from windmill.data_classes import DomainModelWrite
DomainModelWrite.external_id_factory = uuid_factory
The external_id_factory
is a function that takes in two arguments, first a type
which is the data class for the object and then a dict
with the data for that partuclar object. pygen
comes with a few generic external id factories you can use, see External ID factory These can be good for testing an exploration, but we recommend that you write your own factory function for (at least) the most important classes.
In the example below, we write a factory method that sets the ID for all windmills. Looking at the snippet below we note that the windmill have an name
from the source system, so we would like to use this as the external_id
.
from windmill.data_classes import WindmillWrite
incremental_factory = create_incremental_factory()
def windmill_factory(domain_cls: type, data: dict) -> str:
if domain_cls is WindmillWrite:
return data["name"]
else:
# Fallback to incremental
return incremental_factory(domain_cls, data)
# Finally, we set the new factory
DomainModelWrite.external_id_factory = windmill_factory
Ingesting the Data¶
After we have set the external_id_factory
we are all good to go. pygen
is generating pydantic
data classes which means we can use the built in support for json validation in pydantic
We not that we had a list of windmills, in pydantic
we use a TypeAdapter
to parse a list of objects
from pydantic import TypeAdapter
windmills = TypeAdapter(list[WindmillWrite]).validate_json(WindMillFiles.Data.wind_mill_json.read_text())
pygen
also support pydantic
v1. The same line above for v1 is
from pydantic import parse_as_obj
windmills = parse_as_obj(list[WindmillWrite], WindMillFiles.Data.wind_mill_json.read_text())
from windmill.data_classes import WindmillWriteList
# The WindmillWriteList has a few helper methods and nicer display than a regular list
windmills = WindmillWriteList(windmills)
windmills
space | external_id | blades | capacity | metmast | nacelle | name | rotor | windfarm | node_type | data_record | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | windmill-instances | hornsea_1_mill_3 | [{'space': 'windmill-instances', 'external_id'... | 7.0 | [] | {'space': 'windmill-instances', 'external_id':... | hornsea_1_mill_3 | {'space': 'windmill-instances', 'external_id':... | Hornsea 1 | None | {'existing_version': None} |
1 | windmill-instances | hornsea_1_mill_2 | [{'space': 'windmill-instances', 'external_id'... | 7.0 | [] | {'space': 'windmill-instances', 'external_id':... | hornsea_1_mill_2 | {'space': 'windmill-instances', 'external_id':... | Hornsea 1 | None | {'existing_version': None} |
2 | windmill-instances | hornsea_1_mill_1 | [{'space': 'windmill-instances', 'external_id'... | 7.0 | [] | {'space': 'windmill-instances', 'external_id':... | hornsea_1_mill_1 | {'space': 'windmill-instances', 'external_id':... | Hornsea 1 | None | {'existing_version': None} |
3 | windmill-instances | hornsea_1_mill_4 | [{'space': 'windmill-instances', 'external_id'... | 7.0 | [] | {'space': 'windmill-instances', 'external_id':... | hornsea_1_mill_4 | {'space': 'windmill-instances', 'external_id':... | Hornsea 1 | None | {'existing_version': None} |
4 | windmill-instances | hornsea_1_mill_5 | [{'space': 'windmill-instances', 'external_id'... | 7.0 | [] | {'space': 'windmill-instances', 'external_id':... | hornsea_1_mill_5 | {'space': 'windmill-instances', 'external_id':... | Hornsea 1 | None | {'existing_version': None} |
We note that the external_id
field is set to the name
for the windmill. If we check the other objects we see these gets an external_id
= class_name.lower():counter
windmills[0].nacelle
value | |
---|---|
space | windmill-instances |
external_id | nacellewrite:1 |
data_record | {'existing_version': None} |
node_type | None |
acc_from_back_side_x | V52-WindTurbine.Acc1N |
acc_from_back_side_y | V52-WindTurbine.Acc2N |
acc_from_back_side_z | V52-WindTurbine.Acc3N |
gearbox | {'space': 'windmill-instances', 'external_id':... |
generator | {'space': 'windmill-instances', 'external_id':... |
high_speed_shaft | {'space': 'windmill-instances', 'external_id':... |
main_shaft | {'space': 'windmill-instances', 'external_id':... |
power_inverter | {'space': 'windmill-instances', 'external_id':... |
yaw_direction | V52-WindTurbine.yaw |
yaw_error | V52-WindTurbine.YawErr |
We can now upload this data by creating a domain client and call the windmill.upsert
method.
from windmill import WindmillClient
wind = WindmillClient.from_toml("config.toml")
result = wind.upsert(windmills);
print(f"{len(result.nodes)} nodes and {len(result.edges)} uploaded")
145 nodes and 105 uploaded
Note that pygen
have the method .to_instances_write()
you can use to check which nodes
and edges
were created.
We note that pygen
created in total 145 nodes and 105 edges between these nodes.
The edges were of 2 different types, and then nodes were ingested into 10 different views
instances = windmills.to_instances_write()
len(instances.nodes), len(instances.edges)
(145, 105)
unique = set(edge.type.external_id for edge in instances.edges)
len(unique), unique
(2, {'Blade.sensor_positions', 'Windmill.blades'})
unique = set([source.source for node in instances.nodes for source in node.sources])
len(unique), unique
(10, {ViewId(space='power-models', external_id='Blade', version='1'), ViewId(space='power-models', external_id='Gearbox', version='1'), ViewId(space='power-models', external_id='Generator', version='1'), ViewId(space='power-models', external_id='HighSpeedShaft', version='1'), ViewId(space='power-models', external_id='MainShaft', version='1'), ViewId(space='power-models', external_id='Nacelle', version='1'), ViewId(space='power-models', external_id='PowerInverter', version='1'), ViewId(space='power-models', external_id='Rotor', version='1'), ViewId(space='power-models', external_id='SensorPosition', version='1'), ViewId(space='power-models', external_id='Windmill', version='1')})
instances.nodes
space | instance_type | external_id | sources | |
---|---|---|---|---|
0 | windmill-instances | node | hornsea_1_mill_3 | [{'properties': {'capacity': 7.0, 'nacelle': {... |
1 | windmill-instances | node | bladewrite:1 | [{'properties': {'is_damaged': False, 'name': ... |
2 | windmill-instances | node | sensorpositionwrite:1 | [{'properties': {'flapwise_bend_mom': 'V52-Win... |
3 | windmill-instances | node | sensorpositionwrite:2 | [{'properties': {'edgewise_bend_mom_offset': '... |
4 | windmill-instances | node | sensorpositionwrite:3 | [{'properties': {'edgewise_bend_mom_crosstalk_... |
... | ... | ... | ... | ... |
140 | windmill-instances | node | generatorwrite:5 | [{'properties': {'generator_speed_controller':... |
141 | windmill-instances | node | highspeedshaftwrite:5 | [{'properties': {'bending_moment_y': 'V52-Wind... |
142 | windmill-instances | node | mainshaftwrite:5 | [{'properties': {'bending_x': 'V52-WindTurbine... |
143 | windmill-instances | node | powerinverterwrite:5 | [{'properties': {'active_power_total': 'V52-Wi... |
144 | windmill-instances | node | rotorwrite:5 | [{'properties': {'rotor_speed_controller': 'V5... |
145 rows × 4 columns
instances.edges
space | instance_type | external_id | type | start_node | end_node | |
---|---|---|---|---|---|---|
0 | windmill-instances | edge | hornsea_1_mill_3:bladewrite:1 | {'space': 'power-models', 'external_id': 'Wind... | {'space': 'windmill-instances', 'external_id':... | {'space': 'windmill-instances', 'external_id':... |
1 | windmill-instances | edge | bladewrite:1:sensorpositionwrite:1 | {'space': 'power-models', 'external_id': 'Blad... | {'space': 'windmill-instances', 'external_id':... | {'space': 'windmill-instances', 'external_id':... |
2 | windmill-instances | edge | bladewrite:1:sensorpositionwrite:2 | {'space': 'power-models', 'external_id': 'Blad... | {'space': 'windmill-instances', 'external_id':... | {'space': 'windmill-instances', 'external_id':... |
3 | windmill-instances | edge | bladewrite:1:sensorpositionwrite:3 | {'space': 'power-models', 'external_id': 'Blad... | {'space': 'windmill-instances', 'external_id':... | {'space': 'windmill-instances', 'external_id':... |
4 | windmill-instances | edge | bladewrite:1:sensorpositionwrite:4 | {'space': 'power-models', 'external_id': 'Blad... | {'space': 'windmill-instances', 'external_id':... | {'space': 'windmill-instances', 'external_id':... |
... | ... | ... | ... | ... | ... | ... |
100 | windmill-instances | edge | bladewrite:15:sensorpositionwrite:86 | {'space': 'power-models', 'external_id': 'Blad... | {'space': 'windmill-instances', 'external_id':... | {'space': 'windmill-instances', 'external_id':... |
101 | windmill-instances | edge | bladewrite:15:sensorpositionwrite:87 | {'space': 'power-models', 'external_id': 'Blad... | {'space': 'windmill-instances', 'external_id':... | {'space': 'windmill-instances', 'external_id':... |
102 | windmill-instances | edge | bladewrite:15:sensorpositionwrite:88 | {'space': 'power-models', 'external_id': 'Blad... | {'space': 'windmill-instances', 'external_id':... | {'space': 'windmill-instances', 'external_id':... |
103 | windmill-instances | edge | bladewrite:15:sensorpositionwrite:89 | {'space': 'power-models', 'external_id': 'Blad... | {'space': 'windmill-instances', 'external_id':... | {'space': 'windmill-instances', 'external_id':... |
104 | windmill-instances | edge | bladewrite:15:sensorpositionwrite:90 | {'space': 'power-models', 'external_id': 'Blad... | {'space': 'windmill-instances', 'external_id':... | {'space': 'windmill-instances', 'external_id':... |
105 rows × 6 columns