Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I wish to manipulate a standard JSON object to an object where each line must contain a separate, self-contained valid JSON object. See JSON Lines

JSON_file =
[{u'index': 1,
  u'no': 'A',
  u'met': u'1043205'},
 {u'index': 2,
  u'no': 'B',
  u'met': u'000031043206'},
 {u'index': 3,
  u'no': 'C',
  u'met': u'0031043207'}]

To JSONL:

{u'index': 1, u'no': 'A', u'met': u'1043205'}
{u'index': 2, u'no': 'B', u'met': u'031043206'}
{u'index': 3, u'no': 'C', u'met': u'0031043207'}

My current solution is to read the JSON file as a text file and remove the [ from the beginning and the ] from the end. Thus, creating a valid JSON object on each line, rather than a nested object containing lines.

I wonder if there is a more elegant solution? I suspect something could go wrong using string manipulation on the file.

The motivation is to read json files into RDD on Spark. See related question - Reading JSON with Apache Spark - `corrupt_record`

That's not valid JSON input, nor valid JSON output. You are handling Python objects here, not JSON serialisation. Even if your output was valid JSON, it would not be valid JSONL because you have trailing commas. – Martijn Pieters Aug 12, 2016 at 10:01 Also, if the objects in the output would be valid JSON, there would be no trailing commas. – user824425 Aug 12, 2016 at 10:03

Your input appears to be a sequence of Python objects; it certainly is not valid a JSON document.

If you have a list of Python dictionaries, then all you have to do is dump each entry into a file separately, followed by a newline:

import json
with open('output.jsonl', 'w') as outfile:
    for entry in JSON_file:
        json.dump(entry, outfile)
        outfile.write('\n')

The default configuration for the json module is to output JSON without newlines embedded.

Assuming your A, B and C names are really strings, that would produce:

{"index": 1, "met": "1043205", "no": "A"}
{"index": 2, "met": "000031043206", "no": "B"}
{"index": 3, "met": "0031043207", "no": "C"}

If you started with a JSON document containing a list of entries, just parse that document first with json.load()/json.loads().

A simple way to do this is with jq command in your terminal.

To install jq on Debian and derivatives:

$ sudo apt-get install jq

CentOS/RHEL users should run:

$ sudo yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum install jq -y

Basic usage:

$ jq -c '.[]' some_json.json >> output.jsonl

If you need to handle with huge files, i strongly recommend to use --stream flag. This will make jq parse your json in streaming mode.

$ jq -c --stream '.[]' some_json.json >> output.json

But, if you need to do this operation into a python file, you can use bigjson , a useful library that parses the JSON in streaming mode:

$ pip3 install bigjson

To read a huge json (In my case, it was 40 GB):

import bigjson
# Reads json file in streaming mode
with open('input_file.json', 'rb') as f:
    json_data = bigjson.load(f)
    # Open output file  
    with open('output_file.jsonl', 'w') as outfile:
        # Iterates over input json
        for data in json_data:
            # Converts json to a Python dict  
            dict_data = data.to_python()
            # Saves the output to output file
            outfile.write(json.dumps(dict_data)+"\n")

If you want, try to parallelize this code aiming to improve performance. Post the result here :)

Documentation and source code: https://github.com/henu/bigjson

As of now this answer only works with Debian and derivatives. Are there otheer possible installation instructions for other operating systems? – TheTechRobo the Nerd Mar 19, 2021 at 15:53 Yes, but is a quite long, so, follow this link to install on RHEL/CentOS: cyberithub.com/… – Jonas Ferreira Mar 19, 2021 at 17:39 {"index": 1,"no": "A","met": "1043205"}, {"index": 2,"no": "B","met": "000031043206"}, {"index": 3,"no": "C","met": "0031043207"}
import json
with open("test.json", 'r') as infile:
    data = json.load(infile)
    if len(data) > 0:
        print(json.dumps([t for t in data[0]]))
        for record in data:
            print(json.dumps([v for (k,v) in record.items()]))

result

["index", "no", "met"]
[1, "A", "1043205"]
[2, "B", "000031043206"]
[3, "C", "0031043207"]

Note that jsonl is a compacted json. You may need to pass separators without spaces:

with open(output_file_jsonl, 'a', encoding ='utf8') as json_file:
    for elem in rs:
        json_file.write(json.dumps(dict(elem), separators=(',', ':'), cls=DateTimeEncoder))
        json_file.write('\n')

This is an edit to this answer which takes into account the possibility of special symbols or using a different alphabet in the JSONL file. For example, I use Cyrillic and without the encoding and ensure_ascii parameters edited, I get really ugly results. I thought it could be useful:

with open('output.jsonl', 'w', encoding='utf8') as outfile:
    for entry in dataset_donut:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write('\n')
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.