相关文章推荐
忧郁的海龟  ·  JS 实现 ...·  2 周前    · 
朝气蓬勃的领结  ·  Django 2.1.7 ...·  1 年前    · 

Hi everyhwere,

I am using the YouTube Analytics and Reporting API to get performance data from a Youtube channel and to store it in Azure's Data Factory (ADF) programmitcally. From the YouTube API, I get the following JSON output format:

"kind": "youtubeAnalytics#resultTable", "columnHeaders": [ "name": string, "dataType": string, "columnType": string "rows": [ {value}, {value}, ...

See link here: https://developers.google.com/youtube/analytics/reference/reports/query#response

In a first step, I used the copy activity to copy the JSON into the blob storage. Everythine fine so far. Now, I wanted to create a data flow, flatten the json and write the values into derived columns before saving it in a data sink.

Issue #1
In the first step of my data flow - defining the source - I select the JSON from the blob storage and when I click on data preview, I will get this error message: SparkException: Malformed records are detected in schema inference. Parse Mode: FAILFAST, see screenshot:

I checked the JSON's schema on https://jsonformatter.curiousconcept.com/ at generally, it seems to be fine so I don't understand why ADF cannot read it properly.
I realized that ADF is putting my whole JSON into [ ] brackets which possibly causing the Spark exception error. When I delete the [ ] brackets manually by editing the JSON in the blob storage (go get the native JSON structure of the API back again), the data preview is working as excpected. In a nutshell: ADF is adding something which it later on does not like any longer, hmm hmmm. Any ideas what can be done here?

Issue #2
As soon as the parsing of the JSON is successful, I noticed that rows are not defined as object but as an array (in the data source linked service it's even defined as string), see both screenshots. Somehow, it should be defined as objects so that I am able to call the values within the rows, as it is for the "columnHeaders". I hope you get my pain point. Can anyone please tell me what I need to do here?

Looking forward to your replies. thanks in advance!

Okay, I am one step closer to the overall solution:

  • I realized that I screwed up one thing when copying the JSON into the blob storage. Actually, in the sink setting I used the data type "arrays of objects" which I think caused the additional [ ] brackets. When I change it to "group of objects" the brackets will not be added and in the data flow the JSON schema can be read properly.
  • The only thing which is still pending: how can make the "rows" processible for ADF? I have a pretty ugly workaround that I just parse the whole row "as is" but ADF is complaining that this is a complex data structure. I really would like to define rows as an object or complex data type but I don't know how. I have the impression that the "mapping" settings in the copy activity could be helpful but I honestly have no idea how to apply the settings.
  • Thanks,
    Daniel

    Hi Kiran,

    Thanks for your feedback. Yes, I want to use the flatten and derive to columns steps in the data flow and actually, I am able to do it for JSON's with a clear schema BUT obviously my JSON example above is not recognized as a clear schema.

    The value of the columnHeader key is an array of string, see screenshot attached.
    I would expect that the rows key is also an array of strings, showing me the value in the same way as for columnHeaders but this is not the case and I don't understand why. Any ideas? As long as the rows key is not defined as array of strings, I can not flatten and derive it into columns.

    If you look closer, the schema indeed mirrors the json. rows is an array of arrays not array of strings. columnHeader is an array of structures.

    Dataflow is showing the right schema. I am not sure what you are seeing in the dataset or copy activity.

    For anyone still searching for a solution on this issue, I find that ADF dataflow is notoriously bad when dealing with JSON files. If for some reason you REALLY need to use JSON as your source data and your architecture and knowhow permits switching to databricks, then you should do the switch. Otherwise, try to use the standard copy activity to change the file type from json to csv or parquet, then use that as your dataflow source. In most cases, the Copy Activty parser will better understand JSONs, but if for some reason you still have any problems, then your best bet is to parse the json files through an Azure Function. Trust me, by always using Parquet or CSV as a source on DataFlow you'll avoid a lot of headaches.