Presentation is loading. Please wait.

Presentation is loading. Please wait.

WORKING WITH COMMAND-LINE TOOLS Danielle Cunniff Plumer School of Information The University of Texas at Austin Summer 2014.

Similar presentations


Presentation on theme: "WORKING WITH COMMAND-LINE TOOLS Danielle Cunniff Plumer School of Information The University of Texas at Austin Summer 2014."— Presentation transcript:

1 WORKING WITH COMMAND-LINE TOOLS Danielle Cunniff Plumer School of Information The University of Texas at Austin Summer 2014

2 Download the dataset We will be working with a smallish (34M) dataset consisting of US Trademark Application Images from the USPTO. We will only be working with images from January 4, 2008. The data is made available by PublicResource.org. Wget GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. Because we are downloading only a single file, you do not need to specify any options. Open a terminal bcadmin@ubuntu:~$ cd Downloads/ bcadmin@ubuntu:~/Downloads$ wget https://bulk.resource.org/trademark/USTrademarkImages/hr08010 4.zip

3 Run a checksum on the zip file md5sum Print or check MD5 (128-bit) checksums. With no FILE, or when FILE is -, read standard input. In terminal make sure you are in the Downloads directory or other directory containing the zip file $ md5sum hr080104.zip Redirect the output to a file Syntax: command and arguments followed by > and name of file for output. In terminal $ md5sum hr080104.zip > hr080104zip_md5sum.txt $ less hr080104_md5sum.txt

4 Unzip the file using tar Unzip unzip will list, test, or extract files from a ZIP archive, commonly found on MS-DOS systems. The default behavior (with no options) is to extract into the current directory (and subdirectories below it) all files from the specified ZIP archive. Option: -d will extract into a directory (directory does not need to exist) In terminal $ unzip hr080104.zip –d hr080104 tar is an alternative to unzip, and more powerful in general, but it doesn’t work for zip files. man tar for details.

5 Inspect the files Install tree Tree is a recursive directory listing program that produces a depth indented listing of files $ sudo apt-get install tree Look at the files in the unzipped directory $ tree hr080104

6 Tree options Options $ man tree -a Includes hidden files (those beginning with a dot ‘.’). -f Prints the full path prefix for each file. -i Makes tree not print the indentation lines, useful when used in conjunction with the -f option. -p Print the file type and permissions for each file (as per ls -l). -s Print the size of each file in bytes along with the name. -h Print the size of each file but in a more human readable way. -D Print the date of the last modification time for the file listed. -o filename Send output to filename. -r Sort the output in reverse alphabetic order. -t Sort the output by last modification time instead of alphabetically. Look at the files again $ tree -afihD hr080104 –o hr080104.txt $ less hr080104.txt

7 Make a copy of a few files to play with $ mkdir temp $ cp hr080104/773621/77362188/* temp $ cd temp $ ls Remember that you can use the Ubuntu autocomplete options to help avoid typing mistakes tab will complete the name of a directory or a file after you’ve typed the first few characters, starting in the directory you’re currently in. tab tab will show you what files match the characters you’ve entered so far The up and down arrows will let you go back to commands you’ve previously entered.

8 Corrupt a file Calculate a checksum on the.xml files $ md5sum 00000001.XML > md5sum.txt Open the file (for simplicity, we’ll use gedit). Be sure to enter the file name correctly; if you see an empty document, gedit has created a new document with nothing in it. $ gedit 00000001.XML Change one character, save the file with a new name, and close gedit (either click the x in the top left, or do a Ctrl-C from the command line) Save as 00000001r.XML Run the checksum again, using >> to append the new output to the file you previously created $ md5sum 00000001r.XML >> md5sum.txt Compare the two checksums $ less md5sum.txt

9 Corrupt an image file Calculate a checksum on one of the.jpg files $ md5sum 00000002.JPG > md5sum_jpg.txt Open the file (for simplicity, we’ll use ghex) $ ghex 00000002.JPG Change one character, save the file with a new name, and close ghex Save as 00000002r.JPG Run the checksum again $ md5sum 00000002r.JPG >> md5sum_jpg.txt Compare the two checksums $ less md5sum_jpg.txt

10 JHOVE See http://jhove.sourceforge.net/using.htmlhttp://jhove.sourceforge.net/using.html Install JHOVE sudo apt-get install jhove Run JHOVE on the XML file in the directory that you DIDN’T edit $ jhove 00000001.XML Run JHOVE on the XML file in the directory that you corrupted $ jhove 00000001r.XML It might help to open these side-by-side in two terminal windows Repeat for the JPG files. What difference do you see? Why?

11 Extract metadata with ExifTool See http://www.sno.phy.queensu.ca/~phil/exiftool/http://www.sno.phy.queensu.ca/~phil/exiftool/ Run exiftool on your uncorrupted image file $ exiftool 00000002.JPG Try it on the corrupted image file $ exiftool 00000002r.JPG Output exiftool results to CSV $ cd.. $ exiftool –csv temp > out.csv Open results in LibreOffice Calc (be sure to select the “comma” option when importing

12 Bulk metadata operations with ExifTool Run exiftool over your complete download $ exiftool –r –csv hr080104 > hr080104.csv Open results in LibreOffice Calc For more work with exiftool, see the video tutorials by AVPreserve http://www.avpreserve.com/exiftool-tutorial-series/

13 FITS FITS is a powerful set of tool for extracting and validating metadata. FITS includes: Jhove Exiftool National Library of New Zealand Metadata Extractor DROID FFIdent File Utility (windows) To run FITS, locate the script fits.sh on your virtual machine. It is probably located in /home/bcadmin/Tools/fits/. Verify this: $ ls /home/bcadmin/Tools/fits/

14 FITS options -i The input file you want to examine -o The destination of the output XML file. -r process directories recursively when -i is a directory -h Prints the usage message -v Displays the FITS version number -x convert FITS output to a standard metadata schema -xc output using a standard metadata schema and include FITS xml If -o is not specified then the output is sent to the console window. The general syntax for our purposes is: $ /home/bcadmin/Tools/fits/fits.sh -i input_file -o output_file

15 Using FITS From the directory containing the temp directory and the hr080104 directory, try the following commands: $ /home/bcadmin/Tools/fits/fits.sh -i temp/0000001.XML You will probably see an error, followed by the output of the command printed to the screen. To save the output, add: $ /home/bcadmin/Tools/fits/fits.sh -i temp/0000001.XML -o xml_fits.txt Convert the output to a standard metadata scheme: $ /home/bcadmin/Tools/fits/fits.sh -x -i temp/0000001.XML -o xmlstd_fits.txt Repeat for JPG files. Note the different standard metadata schemas.

16 Using FITS over directories You can process an entire directory of files with FITS. You need to add the –r (recursive) option if there are sub-directories and specify a folder to hold the output/ $ mkdir fits_temp $ /home/bcadmin/Tools/fits/fits.sh -x –i temp/ -o fits_temp/ $ mkdir fits_hr080104 $ /home/bcadmin/Tools/fits/fits.sh -x –r –i hr080104/ -o fits_hr080104/ This will take a long time and you will see a lot of errors. Inspect the results. The main problem is that all the files are stored in a single directory and it’s difficult to see which fits output goes with which file in the original directory.

17 bash scripting Some of the problems we’re seen (such as with the FITS output) can be solved by careful use of scripting. For a good introduction to BASH, see The Linux Documentation Project. (n.d.) Bash Tutorial Intro & How- To. Available from http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO- 1.htmlhttp://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO- 1.html Other options include python and perl scripting. If you want to do this sort of work professionally, it’s highly recommended that you learn at least one of these.


Download ppt "WORKING WITH COMMAND-LINE TOOLS Danielle Cunniff Plumer School of Information The University of Texas at Austin Summer 2014."

Similar presentations


Ads by Google