Taco bell programming
- Do not overthink things
- Use pre-existing basic tools
- Functionality is an asset but code is a liability
The task
- Write a webscraper that pulls down a million webpages
- Search all pages for references of a given phrase
- Parallelise the code to run on a 32 core machine
The solution
- EC2 elastic compute cloud?
- Hadoop nosql database?
- SQS ZeroMQ?
- Parallelising using openMPI or mapreduce?
Download a million webpages
cat webpages.txt | xargs wget
Search for references to a given phrase...
grep -l reddit *.*
...and if found then do further processing
grep -l reddit *.* | xargs -n1 ./process.sh
parallelise to run on 32 cores
grep -l reddit *.* | xargs -n1 -P32 ./process.sh
Turning it into a program
#!/bin/bash
cat webpages.txt | xargs wget -P pages
grep -l reddit pages/*.* | xargs -n1 -P32 ./process.sh