#StackBounty: #javascript #json #parsing #tree #stream How to parse items from a large JSON stream in JavaScript?

Bounty: 250

So I have downloaded the Wikidata JSON dump and it’s about 90GB, too large to load into memory. It consists of a simple JSON structure like this:

[
  item,
  item,
  item,
  ...
]

Each "item" looks something like this:

{
  "type": "item",
  "id": "Q23",
  "labels": {
    "<lang>": obj
  },
  "descriptions": {
    "<lang>": {
      "language": "<lang>",
      "value": "<string>"
    },
  },
  "aliases": {
    "<key>": [
      obj,
      obj,
    ],
  },
  "claims": {
    "<keyID>": [
       {
        "mainsnak": {
          "snaktype": "value",
          "property": "<keyID>",
          "datavalue": {
            "value": {
              "entity-type": "<type>",
              "numeric-id": <num>,
              "id": "<id>"
            },
            "type": "wikibase-entityid"
          },
          "datatype": "wikibase-item"
        },
        "type": "statement",
        "id": "<anotherId>",
        "rank": "preferred",
        "references": [
          {
            "hash": "<hash>",
            "snaks": {
              "<keyIDX>": [
                {
                  "snaktype": "value",
                  "property": "P854",
                  "datavalue": obj,
                  "datatype": "url"
                }
              ]
            },
            "snaks-order": [
              "<propID>"
            ]
          }
        ]
      }
    ]
  },
  "sitelinks": {
    "<lang>wiki": {
      "site": "<lang>wiki",
      "title": "<string>",
      "badges": []
    }
  }
}

The JSON stream is configured like this:

const fs   = require('fs')
const zlib = require('zlib')
const { parser } = require('stream-json')

let stream = fs.createReadStream('./wikidata/latest-all.json.gz')
stream
  .pipe(zlib.createGunzip())
  .pipe(parser())
  .on('data', buildItem)

function buildItem(data) {
  switch (data.name) {
    case `startArray`:
      break
    case `startObject`:
      break
    case `startKey`:
      break
    case `stringChunk`:
      break
    case `endKey`:
      break
    case `keyValue`:
      break
    case `startString`:
      break
    case `endString`:
      break
    case `stringValue`:
      break
    case `endObject`:
      break
    case `endArray`:
      break
  }
}

Notice the buildItem has the key information, it shows that the JSON stream emits objects like this (these are the logs):

{ name: 'startArray' }
{ name: 'startObject' }
{ name: 'startKey' }
{ name: 'startString' }
{ name: 'stringValue', value: 'type' }
{ name: 'endString' }
...

How do you parse this into item objects like the above? Parsing this linear stream into a tree is very difficult to comprehend.

A sample of output from the JSON stream is here, which you could use to test a parser if it helps.


Get this bounty!!!

#StackBounty: #json #sed #split #jq Improving performance when using jq to process large files

Bounty: 250

Use Case

I need to split large files (~5G) of JSON data into smaller files with newline-delimited JSON in a memory efficient way (i.e., without having to read the entire JSON blob into memory). The JSON data in each source file is an array of objects.

Unfortunately, the source data is not newline-delimited JSON and in some cases there are no newlines in the files at all. This means I can’t simply use the split command to split the large file into smaller chunks by newline. Here are examples of how the source data is stored in each file:

Example of a source file with newlines.

[{"id": 1, "name": "foo"}
,{"id": 2, "name": "bar"}
,{"id": 3, "name": "baz"}
...
,{"id": 9, "name": "qux"}]

Example of a source file without newlines.

[{"id": 1, "name": "foo"}, {"id": 2, "name": "bar"}, ...{"id": 9, "name": "qux"}]

Here’s an example of the desired format for a single output file:

{"id": 1, "name": "foo"}
{"id": 2, "name": "bar"}
{"id": 3, "name": "baz"}

Current Solution

I’m able to achieve the desired result by using jq and split as described in this SO Post. This approach is memory efficient thanks to the jq streaming parser. Here’s the command that achieves the desired result:

cat large_source_file.json 
  | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' 
  | split --line-bytes=1m --numeric-suffixes - split_output_file

The Problem

The command above takes ~47 mins to process through the entire source file. This seems quite slow, especially when compared to sed which can produce the same output much faster.

Here are some performance benchmarks to show processing time with jq vs. sed.

export SOURCE_FILE=medium_source_file.json  # smaller 250MB

# using jq
time cat ${SOURCE_FILE} 
  | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' 
  | split --line-bytes=1m - split_output_file

real    2m0.656s
user    1m58.265s
sys     0m6.126s

# using sed
time cat ${SOURCE_FILE} 
  | sed -E 's#^[##g' 
  | sed -E 's#^,{#{#g' 
  | sed -E 's#]$##g' 
  | sed 's#},{#}n{#g' 
  | split --line-bytes=1m - sed_split_output_file

real    0m25.545s
user    0m5.372s
sys     0m9.072s

Questions

  1. Is this slower processing speed expected for jq compared to sed? It makes sense jq would be slower given it’s doing a lot of validation under the hood, but 4X slower doesn’t seem right.
  2. Is there anything I can do to improve the speed at which jq can process this file? I’d prefer to use jq to process files because I’m confident it could seamlessly handle other line output formats, but given I’m processing thousands of files each day, it’s hard to justify the speed difference I’ve observed.


Get this bounty!!!

#StackBounty: #javascript #json #google-apps-script #google-sheets #aws-lambda How to batch row data and send a single JSON payload?

Bounty: 100

I currently use a Google Apps Script on a Google Sheet, that sends individual row data to AWS API Gateway to generate a screenshot. At the moment, multiple single JSON payload requests are causing some Lambda function failures. So I want to batch the row data and then send as a single payload, so a single AWS Lambda function can then perform and complete multiple screenshots.

How can I batch the JSON payload after iterating the data on each line in the code below?

function S3payload () {
  var PAYLOAD_SENT = "S3 SCREENSHOT DATA SENT";
  
  var sheet = SpreadsheetApp.getActiveSheet(); // Use data from the active sheet
  
  // Add temporary column header for Payload Status new column entries
  sheet.getRange('E1').activate();
  sheet.getCurrentCell().setValue('payload status');
  
  var startRow = 2;                            // First row of data to process
  var numRows = sheet.getLastRow() - 1;        // Number of rows to process
  var lastColumn = sheet.getLastColumn();      // Last column
  var dataRange = sheet.getRange(startRow, 1, numRows, lastColumn) // Fetch the data range of the active sheet
  var data = dataRange.getValues();            // Fetch values for each row in the range
  
  // Work through each row in the spreadsheet
  for (var i = 0; i < data.length; ++i) {
    var row = data[i];  
    // Assign each row a variable   
    var index = row[0];     // Col A: Index Sequence Number
    var img = row[1];   // Col B: Image Row
    var url = row[2];      // Col C: URL Row
    var payloadStatus = row[lastColumn - 1];  // Col E: Payload Status (has the payload been sent)
  
    var siteOwner = "email@example.com";
    
    // Prevent from sending payload duplicates
    if (payloadStatus !== PAYLOAD_SENT) {  
        
      /* Forward the Contact Form submission to the owner of the site
      var emailAddress = siteOwner; 
      var subject = "New contact form submission: " + name;
      var message = message;*/
      
      //Send payload body to AWS API GATEWAY
      //var sheetid = SpreadsheetApp.getActiveSpreadsheet().getId(); // get the actual id
      //var companyname = SpreadsheetApp.getActiveSpreadsheet().getName(); // get the name of the sheet (companyname)
      
      var payload = {
        "img": img,
        "url": url
      };
      
      var url = 'https://hfcrequestbin.herokuapp.com/vbxpsavb';
      var options = {
        'method': 'post',
        'payload': JSON.stringify(payload)
      };
      
      var response = UrlFetchApp.fetch(url,options);
      
      sheet.getRange(startRow + i, lastColumn).setValue(PAYLOAD_SENT); // Update the last column with "PAYLOAD_SENT"
      SpreadsheetApp.flush(); // Make sure the last cell is updated right away
      
      // Remove temporary column header for Payload Status    
      sheet.getRange('E1').activate();
      sheet.getCurrentCell().clear({contentsOnly: true, skipFilteredRows: true});
      
    }
  }
}

Example individual JSON payload

{"img":"https://s3screenshotbucket.s3.amazonaws.com/realitymine.com.png","url":"https://realitymine.com"}

enter image description here

Example desired output result

[
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/gavurin.com.png","url":"https://gavurin.com"},
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/google.com.png","url":"https://google.com"},
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/amazon.com","url":"https://www.amazon.com"},  
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/stackoverflow.com","url":"https://stackoverflow.com"},
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/duckduckgo.com","url":"https://duckduckgo.com"},
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/docs.aws.amazon.com","url":"https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-features.html"},  
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/github.com","url":"https://github.com"},  
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/github.com/shelfio/chrome-aws-lambda-layer","url":"https://github.com/shelfio/chrome-aws-lambda-layer"},  
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/gwww.youtube.com","url":"https://www.youtube.com"},   
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/w3docs.com","url":"https://www.w3docs.com"}       
]


Get this bounty!!!

#StackBounty: #json #swift #xml #augmented-reality #arkit ARKit 4.0 – Is it possible to convert ARWorldMap data to JSON file?

Bounty: 50

I’d like to know whether it is possible to convert a worldMap binary data (that stores a space-mapping state and set of ARAnchors) to json or xml file?

func writeWorldMap(_ worldMap: ARWorldMap, to url: URL) throws {

    let data = try NSKeyedArchiver.archivedData(withRootObject: worldMap, 
                                         requiringSecureCoding: true)
    try data.write(to: url)
}

If this possible, what tools can I use for that?


Get this bounty!!!

#StackBounty: #javascript #json #parsing JSON.parse overwrite to handle passing in objects

Bounty: 50

I have a legacy system I’m maintaining. We are in the process of setting JSON response content to "application/json" (from "text/plain"). Since the responses were interpreted as plain text before there is a fairly substantial number of places where the code calls JSON.parse. We found that when we make the change, the parsing breaks since the response is now interpreted as a JSON object, which cannot be passed to JSON.parse without an error.

Now the obvious solution is to go in and properly fix all these in coordination with the back-end changes, but it’s A LOT. As a stop-gap measure and to make sure nothing gets broken, I’d like to add the following code. The idea is, if the passed input is already an object it simply gets returned. Anything else goes to the proper JSON.parse.

  //handle loose JSON parsing
  JSON.strictParse = JSON.parse;
  JSON.parse = function (input, reviver) {
    if (typeof input == "object") return input;
       return JSON.strictParse(input, reviver);
  }

My question is, is this a terrible idea, and if so why? The only thing I can think of is if some library somehow relied on the original behaviour, but that seems fairly far fetched.


Get this bounty!!!

#StackBounty: #java #json #jackson #jackson-databind Jackson multiple different schema for serialization of nested fields

Bounty: 50

I want to have several different schema when serializing with Jackson.
Suppose that I have the following classes:

public class Department {
      private Person head;
      private Person deputy;
      private List<Person> staff;
      // getters and setters
}

public class Person {
       private String name
       private int code;
      // getters and setters
}

Now, I want to have two different schema for Department class. The first one contains only head and deputy where head includes both name and code, but deputy has only name. The second schema should include all fields recursively.

Thus, we will have two different jsons. With the first schema:

{
    "head" : {
        "name" : "John",
        "code" : 123
     },
     "deputy" : { 
        "name" : "Jack"
     } 
}

, and with the second schema:

{
    "head" : {
        "name" : "John",
        "code" : 123
     },
     "deputy" : { 
        "name" : "Jack",
        "code" : "234"
     },
     "staff": [
        { 
            "name" : "Tom",
            "code" : "345"
         },
         { 
            "name" : "Matt",
            "code" : "456"
         },
     ]
}

QUESTION: How should I do it with Jackson?

NOTE: These classes are just examples. For this simple example, writing four different wrapper classes may be possible but think about a complex example with dozen of classes that each one has several fields. Using wrapper classes, we should generate a lot of boilerplate code.

Any help would be appreciated!


Get this bounty!!!

#StackBounty: #google-chrome #firefox #console #json How do I open JSON events in Firefox console?

Bounty: 50

I’d like to view output in my console in firefox.

Here’s an example that I’d like to see:
enter image description here

That’s an example on Chrome. It’s very easy to open the JSON object.

I can even go one level further and see this information:

enter image description here

Unfortunately in Firefox, this is all that it gives me:

json

Can I unfurl a json event in the console in Firefox in the same way that I can in Chrome?


Get this bounty!!!

#StackBounty: #php #arrays #json #laravel Reading big arrays from big json file in php

Bounty: 500

I know my question has a lot of answers on the internet but it’s seems i can’t find a good answer for it, so i will try to explain what i have and hope for the best,

so what i’m trying to do is reading a big json file that might be has more complex structure "nested objects with big arrays" than this but for simple example:

{
  "data": {
    "time": [
      1,
      2,
      3,
      4,
      5,
       ...
    ],
    "values": [
      1,
      2,
      3,
      4,
      6,
       ...
    ]
  }
}

this file might be 200M or more, and i’m using file_get_contents() and json_decode() to read the data from the file,

then i put the result in variable and loop over the time and take the time value with the current index to get the corresponding value by index form the values array, then save the time and the value in the database but this taking so much CPU and Memory, is their a better way to do this

a better functions to use, a better json structure to use, or maybe a better data format than json to do this

my code:

$data = json_decode(file_get_contents(storage_path("test/ts/ts_big_data.json")), true);
        
foreach(data["time"] as $timeIndex => timeValue) {
    saveInDataBase(timeValue, data["values"][timeIndex])
}

thanks in advance for any help

Update 06/29/2020:

i have another more complex json structure example

{
      "data": {
        "set_1": {
          "sub_set_1": {
            "info_1": {
              "details_1": {
                "data_1": [1,2,3,4,5,...],
                "data_2": [1,2,3,4,5,...],
                "data_3": [1,2,3,4,5,...],
                "data_4": [1,2,3,4,5,...],
                "data_5": 10254552
              },
              "details_2": [
                [1,2,3,4,5,...],
                [1,2,3,4,5,...],
                [1,2,3,4,5,...],
              ]
            },
            "info_2": {
              "details_1": {
                "data_1": {
                  "arr_1": [1,2,3,4,5,...],
                  "arr_2": [1,2,3,4,5,...]
                },
                "data_2": {
                 "arr_1": [1,2,3,4,5,...],
                  "arr_2": [1,2,3,4,5,...]
                },
                "data_5": {
                  "text": "some text"
                }
              },
              "details_2": [1,2,3,4,5,...]
            }
          }, ...
        }, ...
      }
    } 

the file size might be around 500MB or More and the arrays inside this json file might have around 100MB of data or more.

and my question how can i get any peace and navigate between nodes of this data with the most efficient way that will not take much RAM and CPU, i can’t read the file line by line because i need to get any peace of data when i have to,

is python for example more suitable for handling this big data with more efficient than php ?

please if you can provide a detailed answer i think it will be much help for every one that looking to do this big data stuff with php.


Get this bounty!!!

#StackBounty: #json #parsing #flutter #quicktype Accessing quicktype JSON object in flutter

Bounty: 50

I have a JSON string that is mapped with code generated in quicktype, into an instance of "Pax". Quicktype generated some 4000 rows of code mapping this so Im happy and confident that it works to some extent. I now want to print a specific small part of this sea of data to start with. It’s a string located at pax.instructions.id.

final String paxRaw = response.body;
final Pax xa = paxFromJson(paxRaw);
    import 'dart:convert';
    
    Pax paxFromJson(String str) => Pax.fromJson(json.decode(str));
    
    String paxToJson(Pwa data) => json.encode(data.toJson());
    
    class Pax {
      Pax({
        this.greeting,
        this.instructions,
      });
    
      String greeting;
      List<Instruction> instructions;
    
      factory Pax.fromRawJson(String str) => Pax.fromJson(json.decode(str));
    
      String toRawJson() => json.encode(toJson());
    
      factory Pax.fromJson(Map<String, dynamic> json) => Pax(
        greeting: json["greeting"] == null ? null : json["greeting"],
        instructions: json["instructions"] == null ? null : List<Instruction>.from(json["instructions"].map((x) => Instruction.fromJson(x))),
      );
    
      Map<String, dynamic> toJson() => {
        "greeting": greeting == null ? null : greeting,
        "instructions": instructions == null ? null : List<dynamic>.from(instructions.map((x) => x.toJson())),
      };
    }

I want to access a data member of the list instructions that is called id.

print(xa);

Returns console:

I/flutter ( 4535): Instance of 'Pax'

I know instructions is a list, but how do I acess the string that is called id in this list? My best guess is
print(xa.instructions<id>); but it doesn’t work. There’s clearly something built, but I can’t figure out how to inspect "xa" on a debug level (in android studio). Helpful for guidance.

UPDATE, still not working

  Future<Pwa> _futurePwa;

  Future<Pwa> getPwa() async {
    debugPrint("getPwa start");
[...]
    http.Response response = await http.get(baseUri);
    debugPrint('Response status: ${response.statusCode}');
    debugPrint(response.body);
    return Pwa.fromJson(json.decode(response.body));
  }
  @override
  void initState(){
    super.initState();
    setState(() {
      _futurePwa = getPwa();
    });
  }
Container (
                child: FutureBuilder<Pwa> (
                    future: _futurePax,
                    builder: (context, snapshot) {
                      debugPrint("Futurebuilder<Pax> buildpart");
                      debugPrint("Test snapshot content: ${snapshot.data.toString()}");
                      debugPrint("Test snapshot error: ${snapshot.error}");
                      debugPrint("Test snapshot has data (bool): ${snapshot.hasData}");
                      debugPrint(snapshot.data.instructions[0].id);
                      return Text("Snap: ${snapshot.data.instructions[0].id}");
                    }
              ),
              ),

Console:

Syncing files to device sdk gphone x86...
I/flutter ( 5126): Futurebuilder<Pax> buildpart
I/flutter ( 5126): Test snapshot content: Instance of 'Pax'
I/flutter ( 5126): Test snapshot error: null
I/flutter ( 5126): Test snapshot has data (bool): true

════════ Exception caught by widgets library ═══════════════════════════════════════════════════════
The following NoSuchMethodError was thrown building FutureBuilder<Pax>(dirty, state: _FutureBuilderState<Pax>#a2168):
The method '[]' was called on null.
Receiver: null
Tried calling: [](0)


Get this bounty!!!

#StackBounty: #mysql #amazon-rds #json #storage MySQL 8.0 table with JSON column that uses JSON "merge" operations has 500k r…

Bounty: 100

I have a database table that according to TABLE STATUS has 7MM rows, but when I SELECT COUNT(*) it only has 500k rows.

This is a problem because table growth is increasing and we’re running low on storage now.

here is schema: MySQL 8.0.15

CREATE TABLE `tasks` (
  `task_id` binary(24) NOT NULL,
  `task` json NOT NULL,
  `task_kryo` mediumblob,
  `task_type` varchar(180) COLLATE utf8_bin GENERATED ALWAYS AS (json_unquote(json_extract(`task`,_utf8mb4'$.t'))) STORED,
  `created` datetime GENERATED ALWAYS AS (cast(left(json_unquote(json_extract(`task`,_utf8mb4'$._task.timestamp')),19) as datetime)) STORED,
  `last_updated` datetime GENERATED ALWAYS AS (cast(left(json_unquote(json_extract(`task`,_utf8mb4'$._task.latestStatus.timestamp')),19) as datetime)) STORED,
  `latest_status` varchar(180) COLLATE utf8_bin GENERATED ALWAYS AS (json_unquote(json_extract(`task`,_utf8mb4'$._task.latestStatus.t'))) STORED,
  `marker` binary(24) GENERATED ALWAYS AS (json_unquote(json_extract(`task`,_utf8mb4'$.marker'))) STORED,
  PRIMARY KEY (`task_id`),
  KEY `task_type_index` (`task_type`),
  KEY `created_index` (`created`),
  KEY `last_updated_index` (`last_updated`),
  KEY `latest_status_index` (`latest_status`),
  KEY `marker_index` (`marker`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

This is MySQL RDS on AWS. We have system backups disabled because this table is entirely disposable data. As a result, there is no binary log for this table. AWS disables that when you disable system backups apparently

My suspicion is that we use JSON “Merge” operations in Update statements, and because binary logging is disabled by AWS (we don’t have any backups for this table as it is fully disposable / scratch data) somehow it is implementing the updates as inserts on the table (old records remain but are not deleted).

See
https://mysqlhighavailability.com/efficient-json-replication-in-mysql-8-0/

binlog-row-value-options=PARTIAL_JSON

Also, the MySQL log contains this warning on restart:

2020-05-21T20:02:08.506526Z 0 [Warning] [MY-013103] [Server] When binlog_row_image=FULL, the option binlog_row_value_options=PARTIAL_JSON will be used only for the after-image. Full values will be written in the before-image, so the saving in disk space due to binlog_row_value_options is limited to less than 50%.


Get this bounty!!!