Fun with jq
What is jq
?
jq
is a wonderful tool for processing JSON in the command line.
The prevalence of JSON in web APIs and data dumps makes jq
a very imporant tool for interactively manipulating JSON.
The tutorial is good enough to get you started with basic functionality, but I found that real usage quickly outgrew the scope of the tutorial.
In this post I will show examples of using jq
in slightly more complex scenarios than the tutorial.
The data in this post comes from Yelp’s Dataset Challenge, and I will be focusing on the business data file (that is yelp_academic_dataset_business.json
if you’re following at home).
Formatting
Usually I like to look at one row of a dataset to understand the data schema.
The Yelp data contains one entry per line, so we can use head -n 1
argument to get the first line of the data file and inspect the schema.
> head -n1 yelp_academic_dataset_business.json
{"business_id":"6iYb2HFDywm3zjuRg0shjw","name":"Oskar Blues Taproom","address":"921 Pearl St","city":"Boulder","state":"CO","postal_code":"80302","latitude":40.0175444,"longitude":-105.2833481,"stars":4.0,"review_count":86,"is_open":1,"attributes":{"RestaurantsTableService":"True","WiFi":"u'free'","BikeParking":"True","BusinessParking":"{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}","BusinessAcceptsCreditCards":"True","RestaurantsReservations":"False","WheelchairAccessible":"True","Caters":"True","OutdoorSeating":"True","RestaurantsGoodForGroups":"True","HappyHour":"True","BusinessAcceptsBitcoin":"False","RestaurantsPriceRange2":"2","Ambience":"{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'upscale': False, 'classy': False, 'casual': True}","HasTV":"True","Alcohol":"'beer_and_wine'","GoodForMeal":"{'dessert': False, 'latenight': False, 'lunch': False, 'dinner': False, 'brunch': False, 'breakfast': False}","DogsAllowed":"False","RestaurantsTakeOut":"True","NoiseLevel":"u'average'","RestaurantsAttire":"'casual'","RestaurantsDelivery":"None"},"categories":"Gastropubs, Food, Beer Gardens, Restaurants, Bars, American (Traditional), Beer Bar, Nightlife, Breweries","hours":{"Monday":"11:0-23:0","Tuesday":"11:0-23:0","Wednesday":"11:0-23:0","Thursday":"11:0-23:0","Friday":"11:0-23:0","Saturday":"11:0-23:0","Sunday":"11:0-23:0"}}
😅 A little hard to read. Let’s pipe that into jq
> head -n1 yelp_academic_dataset_business.json | jq
{
"business_id": "6iYb2HFDywm3zjuRg0shjw",
"name": "Oskar Blues Taproom",
"address": "921 Pearl St",
"city": "Boulder",
"state": "CO",
"postal_code": "80302",
"latitude": 40.0175444,
"longitude": -105.2833481,
"stars": 4,
"review_count": 86,
"is_open": 1,
"attributes": {
"RestaurantsTableService": "True",
"WiFi": "u'free'",
"BikeParking": "True",
"BusinessParking": "{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}",
"BusinessAcceptsCreditCards": "True",
"RestaurantsReservations": "False",
"WheelchairAccessible": "True",
"Caters": "True",
"OutdoorSeating": "True",
"RestaurantsGoodForGroups": "True",
"HappyHour": "True",
"BusinessAcceptsBitcoin": "False",
"RestaurantsPriceRange2": "2",
"Ambience": "{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'upscale': False, 'classy': False, 'casual': True}",
"HasTV": "True",
"Alcohol": "'beer_and_wine'",
"GoodForMeal": "{'dessert': False, 'latenight': False, 'lunch': False, 'dinner': False, 'brunch': False, 'breakfast': False}",
"DogsAllowed": "False",
"RestaurantsTakeOut": "True",
"NoiseLevel": "u'average'",
"RestaurantsAttire": "'casual'",
"RestaurantsDelivery": "None"
},
"categories": "Gastropubs, Food, Beer Gardens, Restaurants, Bars, American (Traditional), Beer Bar, Nightlife, Breweries",
"hours": {
"Monday": "11:0-23:0",
"Tuesday": "11:0-23:0",
"Wednesday": "11:0-23:0",
"Thursday": "11:0-23:0",
"Friday": "11:0-23:0",
"Saturday": "11:0-23:0",
"Sunday": "11:0-23:0"
}
}
Much easier to read!
Format the whole file
To completely format a file, we can pipe all lines into jq
and then redirect the result to a file:
> cat yelp_academic_dataset_business.json | jq > yelp_academic_dataset_business_formatted.json
Caution, this file is not valid JSON! It is only useful for reading through the data in a text editor.
Filtering
Let’s say we wanted to find all of the restaurants with ratings >= 4 and with more than 5 reviews. We can do it like this:
> cat yelp_academic_dataset_business.json | jq --compact-output 'select(.review_count > 5 and .stars >= 4)' > high_ratings.json
Note that I used
--compact-output
to preserve the 1-object-per-line structure of the original file. Without this option,high_ratings.json
would contain nicely formatted JSON.
The syntax is very easy: jq 'select(<boolean expression>)'
will filter objects where the boolean expression returns false.
Selecting
Or sometimes called projections, I will show 2 cases building on the previous filter example.
Select JSON to value
cat yelp_academic_dataset_business.json | jq --compact-output 'select(.review_count > 5 and .stars >= 4) | .business_id' > high_ratings.json
After the select
call, I pipe the resulting object into a property access .business_id
.
This produces a file that looks like:
"6iYb2HFDywm3zjuRg0shjw"
"tCbdrRPZA0oiIYSmHG3J0w"
"bvN78flM8NLprQ1a1y5dRg"
"PE9uqAjdw0E4-8mjGl3wVA"
Select JSON to JSON
cat yelp_academic_dataset_business.json | jq --compact-output 'select(.review_count > 5 and .stars >= 4) | {business_id}' > high_ratings.json
Note that the syntax | {business_id}
is a shortcut for | {"business_id": .business_id}
, the latter case can be useful if you want to rename a property or product a new property from different values.
This produces a file that looks like:
{"business_id":"6iYb2HFDywm3zjuRg0shjw"}
{"business_id":"tCbdrRPZA0oiIYSmHG3J0w"}
{"business_id":"bvN78flM8NLprQ1a1y5dRg"}
{"business_id":"PE9uqAjdw0E4-8mjGl3wVA"}
JSON to [C|T]SV
JSON is a very self-documenting format for data, which is great for humans but not very useful for computers. If I wanted to do some basic analysis on this data, the first thing I would do is load it into a spreadsheet program and start generating some graphs. To do that, we need to transform this file into a comma-separated value file or a tab-separated value file.
Unfortunately any command to transform JSON to CSV on anything but the most basic JSON has to make choices about what to do with object and array properties (recursively) so there is no easy answer.
The jq
filter we write depends on the source data and how we want the CSV to look like.
Or does it?
Denizens of the internet have graciously provided a jq
filter to perform this for arbitrary JSON.
This filter takes advantage of jq
’s programming features: that’s right, jq
is its own little programming language!
Gaze your eyes on this:
def json2header:
[paths(scalars)];
def json2array($header):
[$header[] as $p | getpath($p)];
# given an array of conformal objects, produce "CSV" rows, with a header row:
def json2csv:
(.[0] | json2header) as $h
| ([$h[]|join(".")], (.[] | json2array($h)))
| @csv ;
# `main`
json2csv
Put this into a file called json2csv.jq
and invoke it like this:
jq -rf json2csv.jq yelp_academic_dataset_business.json
Don’t forget to turn the json file into an array of objects first with something like:
> head -n 3 yelp_academic_dataset_business.json | jq -s '.' > high_ratings.json
Big thanks to this and this StackOverflow question.
Conclusion
jq
is great!