Command-line Tools

1. Getting Started

  • The command line provides a so-called read-eval-print-loop (REPL).

1) Environment

– Command-line Tools

– Terminal

– Shell

– Operating Systems

2) Types of Command-line Tools

– Binary Executables

  • programs in the classical sense.
  • created by compiling source codes to machine code.

– Shell Built-ins

  • Examples
    • cd
    • help
  • Like binary executables, they cannot be easily inspected or changed.

– Scripts

  • Text file executed by a binary executable.
  • Examples
    • Bash script
    • Python script
    • R script
  • Advantege: you can read and change it.

– Shell functions

  • a function executed by the shell (Bash) itself.
  • tends to be more personal.

  • a configuration file for Bash (e.g., ~/.bashrc) is a good place to define your shell functions.

– Alias

  • Alias are like macros.

[note] type

  • type, a shell builtin, finds out the type of a command-line tool.
    • -a: displays all of the types for the given executable name.

3) Combining Command-line Tools

– pipe

4) Redirecting Input and Output

– redirect: > (overwrite), >> (append)

– pipe

– pass the file’s contents to the standard input of command

5) Help!

– help (manual)

– help (help)

2. Obtaining Data

1) Remote Version of Data Science Toolbox

2) Decompressing Files

– compressing

– decompressing

– unpack

3) Converting Microsoft Excel Spreadsheets

– extract

– cut

– csvcut, csvlook

4) Querying Relational Databases

– sql2csv

5) Downloading from the Internet

– cURL

– download & write

– shortened URL

bitly site: http://bit.ly/

HTTP header

6) Calling Web APIs

– JSON

– twitter api

reperence: Create Your Own Dataset Consuming Twitter API

setup

load twitter api to JSON

3. Creating Reusable Command-Line Tools

1) Converting One-Liners into Shell Scripts

  • Get the top ten words of the ebook version of Adventures of Huckleberry Finn

– Description per liner

  • (1) Downloading the ebook using curl.
  • (2) Converting the entire text to lowercase using tr.
    • [:class:] Represents all characters belonging to the defined character class.
    • Class names are:
      alnum
      alpha
      blank
      cntrl
      digit
      graph
      ideogram
      lower
      phonogram print punct rune
      space
      special
      upper
      xdigit
  • (3) Extracting all the words using grep.

    • -o, –only-matching: prints only the matching part of the lines.
    • -E, –extended-regexp: interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).
      – The \w metacharacter is used to find a word character.

      • A word character is a character from a-z, A-Z, 0-9, including the _ (underscore) character.
  • (4) Sorting these words in alphabetical order using sort.

  • (5) Removing all the duplicates and counting how often each word appears in the list using uniq.

    • -c: precede each output line with the count of the number of times the line occurred in the input, followed by a single space.
  • (6) Sorting this list of unique words by their count in descending order using sort

    • -n, –numeric-sort: compare according to string numerical value
    • -r, –reverse: reverse the result of comparisons
  • (7) Keeping only the top 10 lines (i.e., words) using head.

– Step 1. Copy and Paste

  • tips
    • !!: will be substituted with the command you just ran.
    • sudo !!: you can run te previous command with superuser priviledges.
    • echo “!!” > scriptfilename.sh: you can save the previous command into a file.

– Step 2. Add Permission to Execute

  • Copy the file to new one.

  • you add permission to execute ‘top-words-2.sh’

  • have a look at the access permissions of both files

  • now you can execute the file as follows:

– Step 3. Define Shebang

  • shebang: a hash (she) and an exclamation mark (bang)

  • python

    • #!/usr/bin/env python

– Step 4. Remove Fixed Input

– Step 5. Parameterize

2) Creating Command-Line Tools with Python

3) Processing Streaming Data from Standard Input

  • Most command-line tools pipe data to the next command-line tool in a streaming fashion.

  • Python can process in a streaming matter by applying a function on a line-per-line basis.

4. Scrubbing (or Cleaning) Data

1) SED and AWK

– SED

  • Stream Editor
    • The sed utility reads the specified files or the standard input, modifying the input as specified by a list of commands.
    • The input is written to the standard output.

– AWK

  • Pattern-directed scanning and processing language
    • Awk scans each input file for lines that match any of a set of patterns specified literally in string or in a file.
    • With each pattern, there can be an associated action that will be performed when a line of a file matches the pattern.
    • Each line is matched against the pattern portion of every pattern-action statement; the associated action is performed for each matched pattern.

awk and sed tutorials