# Data preprocessing - Extracting IDyOM output data

This tutorial will cover how to extract certain IDyOM outputs (for analysis in python) and export them in different
formats using py2lispIDyOM, given that you already have the `.dat` file output. For an overview of the py2lispIDyOM functionality, see the [README](../README.md).



Given that you already have the `.dat` file output, you can extract certain properties of certain melodies from that file.

We will continue the sample example as in the [1_running_IDyOM_tutorial.ipynb](1_running_IDyOM_tutorial.ipynb), and extract some IDyOM outputs from that experiment, where the log folder is `experiment_history/21-05-22_17.05.05/`


In [1]:
# import ExperimentInfo from extract module
import py2lispIDyOM as py2lispIDyOM
from py2lispIDyOM.extract import ExperimentInfo

## 1. Indicate the experiment log folder that you want to work with:

To start, users need to indicate the experiment log folder that you want to work with by providing the log
path `experiment_folder_path`. 

In [2]:
# Set experiment_folder_path:
my_experiment = ExperimentInfo(experiment_folder_path='experiment_history/21-05-22_17.05.05/')


## 2. Experiment Info: to access some melodies in that experiment:

There are two ways to access melodies in the experiment:
1. Access a specific melody using `melodies_dict` by passing the melody name.
 - This returns a DataFrame of all IDyOM outputs for this melody.
 
 
2. Access specific melodies using the method `access_melodies(starting_index=None, ending_index=None, melody_names=None)` 
 - This returns a list of DataFrame of all IDyOM outputs for selected melodies.
 

### 2.1 Access a melody using `melodies_dict`:


In [3]:
# Access the melody named '"chor-012"' using `melodies_dict` by providing it melody name as the key

selected_melody = my_experiment.melodies_dict['"chor-012"']

print(selected_melody)

 dataset.id melody.id note.id melody.name vertint12 articulation \
0 6.605212e+13 12.0 1.0 "chor-012" NA 0.0 
1 6.605212e+13 12.0 2.0 "chor-012" NA 0.0 
2 6.605212e+13 12.0 3.0 "chor-012" NA 0.0 
3 6.605212e+13 12.0 4.0 "chor-012" NA 0.0 
4 6.605212e+13 12.0 5.0 "chor-012" NA 0.0 
5 6.605212e+13 12.0 6.0 "chor-012" NA 0.0 
6 6.605212e+13 12.0 7.0 "chor-012" NA 0.0 
7 6.605212e+13 12.0 8.0 "chor-012" NA 0.0 
8 6.605212e+13 12.0 9.0 "chor-012" NA 0.0 
9 6.605212e+13 12.0 10.0 "chor-012" NA 0.0 
10 6.605212e+13 12.0 11.0 "chor-012" NA 0.0 
11 6.605212e+13 12.0 12.0 "chor-012" NA 0.0 
12 6.605212e+13 12.0 13.0 "chor-012" NA 0.0 
13 6.605212e+13 12.0 14.0 "chor-012" NA 0.0 
14 6.605212e+13 12.0 15.0 "chor-012" NA 0.0 
15 6.605212e+13 12.0 16.0 "chor-012" NA 0.0 
16 6.605212e+13 12.0 17.0 "chor-012" NA 0.0 
17 6.605212e+13 12.0 18.0 "chor-012" NA 0.0 
18 6.605212e+13 12.0 19.0 "chor-012" NA 0.0 
19 6.605212e+13 12.0 20.0 "chor-012" NA 0.0 
20 6.605212e+13 12.0 21.0 "chor-012" NA 0.0 
21 6.60

### 2.2 Access a list of melodies using the `access_melodies` method by providing their names


In [4]:
# Access the melody named '"chor-010"' using the `access_melodies` method.

selected_melodies = my_experiment.access_melodies(melody_names = ['"chor-010"'])

print(selected_melodies)


[ dataset.id melody.id note.id melody.name vertint12 articulation \
0 6.605212e+13 10.0 1.0 "chor-010" NA 0.0 
1 6.605212e+13 10.0 2.0 "chor-010" NA 0.0 
2 6.605212e+13 10.0 3.0 "chor-010" NA 0.0 
3 6.605212e+13 10.0 4.0 "chor-010" NA 0.0 
4 6.605212e+13 10.0 5.0 "chor-010" NA 0.0 
5 6.605212e+13 10.0 6.0 "chor-010" NA 0.0 
6 6.605212e+13 10.0 7.0 "chor-010" NA 0.0 
7 6.605212e+13 10.0 8.0 "chor-010" NA 0.0 
8 6.605212e+13 10.0 9.0 "chor-010" NA 0.0 
9 6.605212e+13 10.0 10.0 "chor-010" NA 0.0 
10 6.605212e+13 10.0 11.0 "chor-010" NA 0.0 
11 6.605212e+13 10.0 12.0 "chor-010" NA 0.0 
12 6.605212e+13 10.0 13.0 "chor-010" NA 0.0 
13 6.605212e+13 10.0 14.0 "chor-010" NA 0.0 
14 6.605212e+13 10.0 15.0 "chor-010" NA 0.0 
15 6.605212e+13 10.0 16.0 "chor-010" NA 0.0 
16 6.605212e+13 10.0 17.0 "chor-010" NA 0.0 
17 6.605212e+13 10.0 18.0 "chor-010" NA 0.0 
18 6.605212e+13 10.0 19.0 "chor-010" NA 0.0 
19 6.605212e+13 10.0 20.0 "chor-010" NA 0.0 
20 6.605212e+13 10.0 21.0 "chor-010" NA 0.0 
21 6.6

### 2.3 Access consecutive melodies using the `access_melodies` method by providing the indices.


In [5]:
# Access the first 2 melodies using the `access_melodies` method.

selected_melodies = my_experiment.access_melodies(ending_index=2)

print(selected_melodies)


[ dataset.id melody.id note.id melody.name vertint12 articulation \
0 6.605212e+13 1.0 1.0 "chor-001" NA 0.0 
1 6.605212e+13 1.0 2.0 "chor-001" NA 0.0 
2 6.605212e+13 1.0 3.0 "chor-001" NA 0.0 
3 6.605212e+13 1.0 4.0 "chor-001" NA 0.0 
4 6.605212e+13 1.0 5.0 "chor-001" NA 0.0 
5 6.605212e+13 1.0 6.0 "chor-001" NA 0.0 
6 6.605212e+13 1.0 7.0 "chor-001" NA 0.0 
7 6.605212e+13 1.0 8.0 "chor-001" NA 0.0 
8 6.605212e+13 1.0 9.0 "chor-001" NA 0.0 
9 6.605212e+13 1.0 10.0 "chor-001" NA 0.0 
10 6.605212e+13 1.0 11.0 "chor-001" NA 0.0 
11 6.605212e+13 1.0 12.0 "chor-001" NA 0.0 
12 6.605212e+13 1.0 13.0 "chor-001" NA 0.0 
13 6.605212e+13 1.0 14.0 "chor-001" NA 0.0 
14 6.605212e+13 1.0 15.0 "chor-001" NA 0.0 
15 6.605212e+13 1.0 16.0 "chor-001" NA 0.0 
16 6.605212e+13 1.0 17.0 "chor-001" NA 0.0 
17 6.605212e+13 1.0 18.0 "chor-001" NA 0.0 
18 6.605212e+13 1.0 19.0 "chor-001" NA 0.0 
19 6.605212e+13 1.0 20.0 "chor-001" NA 0.0 
20 6.605212e+13 1.0 21.0 "chor-001" NA 0.0 
21 6.605212e+13 1.0 22.0 "c

## 3. Melody Info: to further access melody-specific information

To get the IDyOM model outputs for each melodies, you will need to use the `MelodyInfo` class.

### 3.1 Access the melody

For each melody in the experiment, all data are stored in the `MelodyInfo` class which is essentially a panda.DataFrame. To create an instance of `MelodyInfo`, we can use `ExperimentInfo.melodies_dict`, or `ExperimentInfo.access_melodies` as showed in part 2.

In [6]:
# first, access the melody by creating an instance of `MelodyInfo`

selected_melody = my_experiment.melodies_dict['"chor-002"']


Note that here, `selected_melody` is an instance of `MelodyInfo`.

### 3.2 Check the valid IDyOM output keywords for the selected melody

For each melody, you can check the valid IDyOM output keywords with the `get_idyom_output_keyword_list()` method.

In [7]:
# check the valid IDyOM output keywords for this selected_melody:

valid_idyom_output_key_list = selected_melody.get_idyom_output_keyword_list()

print(valid_idyom_output_key_list)

['dataset.id', 'melody.id', 'note.id', 'melody.name', 'vertint12', 'articulation', 'comma', 'voice', 'ornament', 'dyn', 'phrase', 'bioi', 'deltast', 'accidental', 'mpitch', 'cpitch', 'barlength', 'pulses', 'tempo', 'mode', 'keysig', 'dur', 'onset', 'cpitch.order.ltm.cpitch', 'cpitch.order.stm.cpitch', 'cpitch.weight.ltm', 'cpitch.weight.stm', 'cpitch.weight.ltm.cpitch', 'cpitch.weight.stm.cpitch', 'cpitch.probability', 'cpitch.information.content', 'cpitch.entropy', 'cpitch.55', 'cpitch.57', 'cpitch.58', 'cpitch.59', 'cpitch.60', 'cpitch.62', 'cpitch.63', 'cpitch.64', 'cpitch.65', 'cpitch.66', 'cpitch.67', 'cpitch.68', 'cpitch.69', 'cpitch.70', 'cpitch.71', 'cpitch.72', 'cpitch.73', 'cpitch.74', 'cpitch.75', 'cpitch.76', 'cpitch.77', 'cpitch.78', 'cpitch.79', 'cpitch.81', 'cpitch.82', 'cpitch.83', 'cpitch.84', 'cpitch.85', 'cpitch.86', 'cpitch.88', 'onset.order.ltm.onset', 'onset.order.stm.onset', 'onset.weight.ltm', 'onset.weight.stm', 'onset.weight.ltm.onset', 'onset.weight.stm.onset

The list above shows all the valid IDyOM output keys available for the melody '"chor-002"'.
Now, we want to access the following data: `cpitch.information.content`, `onset.information.content`, `entropy`

### 3.3 Access IDyOM output data via keywords

To extract the output values, we will use the `MelodyInfo` method called `access_idyom_output_keywords`.

You need to pass a list of keywords to the method, and it will return a dataframe.

In [8]:
# Accessing `onset.information.content`, `entropy` for '"chor-002"'

selected_idyom_outputs = selected_melody.access_idyom_output_keywords(['onset.information.content',
 'entropy'])

print(selected_idyom_outputs)

 onset.information.content entropy
0 2.880014 8.166577
1 3.583843 7.022451
2 0.902314 6.759365
3 1.487593 5.508813
4 2.377832 5.474818
5 3.686743 6.267564
6 3.251021 5.693554
7 1.028696 4.829847
8 0.899349 6.363213
9 1.044276 5.423235
10 5.122056 5.354833
11 2.590604 5.034352
12 3.083826 5.695037
13 0.624909 5.292224
14 0.973895 4.940844
15 1.630016 6.349461
16 1.662930 6.544269
17 1.441977 6.277746
18 1.441977 6.411900
19 1.220463 6.011344
20 0.978923 5.887434
21 4.352934 6.839952
22 3.806420 6.760173
23 3.806420 6.465666
24 3.806420 6.799645
25 3.806420 6.553230
26 3.806420 6.445196
27 3.806420 6.697904
28 3.806420 5.969671
29 3.806420 6.520202
30 3.806420 5.935160
31 3.806420 6.649951
32 3.806420 6.340725
33 3.806420 5.169728
34 3.806420 5.658613
35 3.806420 5.886921
36 3.806420 5.345791
37 3.806420 6.304415
38 3.806420 5.775718
39 3.806420 6.115581
40 3.806420 6.414777


Get get the data as a numpy array, you can use the `get_idyom_output_nparray` method.

In [9]:
# Get the values of `cpitch.entropy` as a numpy array for'"chor-002"'

selected_idyom_output_array = selected_melody.get_idyom_output_nparray('cpitch.entropy')

print(selected_idyom_output_array)

[4.8619256, 4.716329, 4.391901, 3.1552763, 4.0513153, 3.0147789, 3.5076356, 3.0511367, 4.0920258, 3.3372533, 3.1501474, 3.36347, 3.0170422, 3.643789, 2.8647585, 3.3451679, 3.466154, 3.354281, 3.4884357, 3.3005762, 3.4747243, 4.214637, 3.5233264, 3.2288165, 3.5627892, 3.3163767, 3.2083457, 3.4610538, 2.732824, 3.2833517, 2.698308, 3.4130876, 3.1038811, 1.9328707, 2.421765, 2.650072, 2.1089375, 3.0675576, 2.5388658, 2.878732, 3.1779296]
