Screen scraping can be effective at getting free data very quickly. When attempting to screen scrape large amounts of data, I often use Google Chrome’s “Developer Tools” to obtain the steps necessary to recreate a web request. Here is an example process I used to screen scrape data from pricescope.com which contains a database about diamonds for sale online. I will use this data in an upcoming post on how to build statistical models.
- Launch Google Chrome and navigate to pricescope.com
- Within pricescope.com, you will see a search option. Click the “Search” button with the default options provided:
- You will see thousands of loose diamonds available with data to grab – carat, color, clarity, cut, and price — plus a ton more data you can explore later
- You may notice that as you scroll down the server keeps loading more pages of diamonds. My search returned over 15,000 diamonds. Since it would be very taxing on your computer and their servers to load all the diamonds at once, they wait until the user scrolls down before loading more. This type of behavior known as an “asynchronous request” because your browser is performing an additional communication to the server but it is not doing it at the initial page HTML load (not in sync). Developers use JavaScript with XML to perform this capability – so many refer to this technique as “AJAX” – “asynchronous javascript and XML.” We need to keep this in mind because when we want to screen scrape, we want all the diamonds from a search, not just the first few diamonds that load on the initial request.
- At the top of the pricescope.com page, click on the “Show Filters” button and you can get an even more granular search menu:
Select the following options:
Shape = Round (Diamonds can be cut in many standard shapes!)
Price = $1 to $1,600,000
Carat=1 to 1.05
HCA Cut Rank = Excellent to n/a
Color = D to Z (“D” means “the best white color” and “Z” is “very yellow”)
Clarity = FL to I3 (“FL” means “flawless” and “I3” is “very visible inclusions”)
Only the GIA labs box is checked - Now that you know how to get data, we need to see how your browser is performing this work so that we can automate it. It would be very impractical for us to manually copy and paste tens of thousands of diamond data into a spreadsheet! The first step to automating a web scrape is seeing how our browser performs the task. Using Google Chrome’s Developer Tools we can see this normally hidden process.
- You can find the Developer Tools by clicking the three dots in the upper right hand corner of the toolbar, selecting “More Tools” and then “Developer Tools”
- Click on the “Network” tab in Developer Tools, and then press the circle with a slash through it next to a bright red dot. This will “Clear” any existing data. You can also press the circle with a slash through it under the “Console” tab at the bottom to clear any previous page errors. Your Developer Tools should now look like this:
- Navigate to pricescope.com and you will see Developer Tools fill up with network requests similar to this:
- Go back to your search filters for the follow diamonds:
Shape = Round
Price = $1 to $1,600,000
Carat=1 to 1.05
HCA Cut Rank = Excellent to n/a
Color = D to Z
Clarity = FL to I3
Ensure only the GIA labs box is checked - Clear your Developer Tools by pressing the Circle with the slash through it icons. Now, refresh the page with your search results by pressing the “Refresh” arrow icon next the URL.
- Great! You have successfully recorded exactly how Google Chrome communicated to pricescope.com to retrieve all those diamonds! We need to find which of these communications, known as “requests”, contains the data about our diamonds.
- Within Developer Tools, click on the “XHR” filter (XMLHttpRequest) to only show requests that were issued using AJAX. We know that as we scroll, the browser will load diamond data from the server – so this filter helps us narrow down the requests to the ones that are likely to contain our data!
- Sure enough, one of the requests “d_s” has a JSON array which contains diamond data in the “Response” tab, as shown below. If you inspect the “Headers” tab you can see that this request was submitted to pricescope.com as a POST – meaning that your browser “posted” some data to pricescope.com which constitutes the search question, and it returned the results in the response.
Your results will be slightly different, but it will look similar to this:
{“d”:[{“ID”:”52ad4b18-7bd3-496e-87b3-afef1c651646″,”DiamondType”:2,”VendorId”:12,”Shape”:”Round”,”Size”:1.05,”Color”:”I”,”FancyColor”:null,”Clarity”:”SI1″,”Price”:4214,”Lab”:”GIA”,”Depth”:60.40,”DTable”:59.00,”Girdle”:”Medium”,”Culet”:”None”,”Polish”:”ID”,”Symmetry”:”ID”,”FluorStrength”:”M”,”FluorColor”:1,”CrownAngle”:34.50,”CrownDepth”:14.00,”PavilionAngle”:40.60,”PavilionDepth”:42.50,”LengthMeasure”:0.00,”WidthMeasure”:0.00,”DepthMeasure”:0.00,”Measurements”:”6.52 x 6.58 x 3.96″,”Comment”:””,”Stones”:-1,”CertNumber”:”7218541836″,”StockNumber”:”2478798″,”Pair”:””,”PairSeparable”:false,”AGSCutGrade”:”Ideal”,”GIACutGrade”:”Excellent”,”HeartsArrows”:false,”Webpage”:”http://www.eternitybyyoni.com/diamond_detail.php?id=2478798&ref=pricescope”,”MobileWebpage”:””,”CertImage”:”http://www.eternitybyyoni.com/cert/gia_pdf/7218541836.pdf”,”Picture”:””,”IdealScopeImage”:””,”HAImage”:””,”SarinReport”:””,”DDDSarinFile”:””,”GemAdvisorFile”:””,”AssetFile”:””,”HCACutRank”:”EX”,”HCAValue”:”1.1-EX ex-ex-ex-vg”,”Price_Carat”:4013,”Vendor”:”Eternity By “,”AGSCutGradeTextComplex”:”Excellent”,”IdInt”:303510,”LinkToSmallLogo”:”Content/images/VendorsLogo/sm.eternity_listing.jpg”,”LinkAtPricescope”:”https://www.pricescope.com/dealer/eternity_diamonds”,”IsDeleted”:false,”MainImageBase64″:null,”DetailsUrl”:”https://www.pricescope.com/diamonds/round/105-carat-i-color-si1-clarity-303510″,”IsActive”:false},…
- Try scrolling down the page to load more diamond data and you should see additional XHR requests happen – with more data! If you right click on “d_s” in the Developer Tools, you can select “Copy – Copy as cURL.” Copy and paste two of the “d_s” responses in a text editor and, if you look very closely, see if you can spot the important difference:curl ‘https://www.pricescope.com/diamonds/api/ps_dl/d_s’ -H ‘origin: https://www.pricescope.com’ -H ‘accept-encoding: gzip, deflate, br’ -H ‘accept-language: en-US,en;q=0.8,zh;q=0.6’ -H ‘x-requested-with: XMLHttpRequest’ -H ‘cookie: __cfduid=df725ba647e38f4391488c6acdfac8cff1487208150; SESSf58986293761abff4873b1b3cc12d7de=a43b7c6db46cf59219a98475234fd6c6; phpbb3_6jk75_u=1; phpbb3_6jk75_k=; phpbb3_6jk75_sid=cce9f94e7ba1a5ef047823ec2a63dd9e; ASP.NET_SessionId=yov44h0huugyapchkhu1cfhk; uniqueuid=fdce2aaf-5268-4ed3-a8e0-ac9b67947cf7; _gat=1; is_posting=%2Ftools%2Fwhat-is-diamond-calc; _ga=GA1.2.395782978.1487208153; d_viewed=52ad4b18-7bd3-496e-87b3-afef1c651646; d_searchstring=%7B%22vendor%22%3A-1%2C%22inhouse%22%3A-1%2C%22shape%22%3A1%2C%22fancy_shape%22%3A1%2C%22minprice%22%3A1%2C%22maxprice%22%3A1600000%2C%22minfancy_price%22%3A0%2C%22maxfancy_price%22%3A1600000%2C%22mincarat%22%3A0.9%2C%22maxcarat%22%3A1.05%2C%22minfancy_carat%22%3A0.23%2C%22maxfancy_carat%22%3A21.7%2C%22mindepth%22%3A58%2C%22maxdepth%22%3A63%2C%22mintable%22%3A53%2C%22maxtable%22%3A62%2C%22mincut%22%3A1%2C%22maxcut%22%3A6%2C%22mincolor%22%3A4%2C%22maxcolor%22%3A6%2C%22fancy_color%22%3A1%2C%22minfancy_intensity%22%3A1%2C%22maxfancy_intensity%22%3A9%2C%22minfancy_overtone%22%3A1%2C%22maxfancy_overtone%22%3A9%2C%22minclarity%22%3A6%2C%22maxclarity%22%3A13%2C%22minsymmetry%22%3A1%2C%22maxsymmetry%22%3A8%2C%22minpolish%22%3A1%2C%22maxpolish%22%3A8%2C%22minflourescence%22%3A1%2C%22maxflourescence%22%3A6%2C%22checkbox_panel1%22%3A%2204%22%2C%22checkbox_panel2%22%3A%22%22%2C%22fancy_checkbox_panel2%22%3A%22%22%2C%22sort%22%3A%22%22%2C%22fancy_sort%22%3A%22%22%2C%22page%22%3A1%2C%22pageview%22%3A%2224%22%2C%22adv%22%3Afalse%2C%22fancy_adv%22%3Afalse%7D’ -H ‘pragma: no-cache’ -H ‘user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36’ -H ‘content-type: text/json’ -H ‘accept: */*’ -H ‘cache-control: no-cache’ -H ‘authority: www.pricescope.com’ -H ‘referer: https://www.pricescope.com/diamonds/search?vendor=-1&inhouse=-1&shape=1&minprice=1&maxprice=1600000&mincarat=0.9&maxcarat=1.05&mindepth=58&maxdepth=63&mintable=53&maxtable=62&mincut=1&maxcut=6&mincolor=4&maxcolor=6&minclarity=6&maxclarity=13&minsymmetry=1&maxsymmetry=8&minpolish=1&maxpolish=8&minflourescence=1&maxflourescence=6&checkbox_panel1=04&checkbox_panel2=&sort=&page=1&pageview=24&adv=false’ -H ‘dnt: 1’ –data-binary ‘{“df”:{“vendor”:-1,”inhouse”:-1,”shape”:1,”fancy_shape”:1,”minprice”:1,”maxprice”:1600000,”minfancy_price”:0,”maxfancy_price”:1600000,”mincarat”:0.9,”maxcarat”:1.05,”minfancy_carat”:0.23,”maxfancy_carat”:21.7,”mindepth”:58,”maxdepth”:63,”mintable”:53,”maxtable”:62,”mincut”:1,”maxcut”:6,”mincolor”:4,”maxcolor”:6,”fancy_color”:1,”minfancy_intensity”:1,”maxfancy_intensity”:9,”minfancy_overtone”:1,”maxfancy_overtone”:9,”minclarity”:6,”maxclarity”:13,”minsymmetry”:1,”maxsymmetry”:8,”minpolish”:1,”maxpolish”:8,”minflourescence”:1,”maxflourescence”:6,”checkbox_panel1″:”04″,”checkbox_panel2″:””,”fancy_checkbox_panel2″:””,”sort”:””,”fancy_sort”:””,“page”:1,”pageview”:”24″,”adv”:false,”fancy_adv”:false},”dp”:0}’ –compressedcurl ‘https://www.pricescope.com/diamonds/api/ps_dl/d_s’ -H ‘origin: https://www.pricescope.com’ -H ‘accept-encoding: gzip, deflate, br’ -H ‘accept-language: en-US,en;q=0.8,zh;q=0.6’ -H ‘x-requested-with: XMLHttpRequest’ -H ‘cookie: __cfduid=df725ba647e38f4391488c6acdfac8cff1487208150; SESSf58986293761abff4873b1b3cc12d7de=a43b7c6db46cf59219a98475234fd6c6; phpbb3_6jk75_u=1; phpbb3_6jk75_k=; phpbb3_6jk75_sid=cce9f94e7ba1a5ef047823ec2a63dd9e; ASP.NET_SessionId=yov44h0huugyapchkhu1cfhk; uniqueuid=fdce2aaf-5268-4ed3-a8e0-ac9b67947cf7; _gat=1; is_posting=%2Ftools%2Fwhat-is-diamond-calc; _ga=GA1.2.395782978.1487208153; d_viewed=52ad4b18-7bd3-496e-87b3-afef1c651646; d_searchstring=%7B%22vendor%22%3A-1%2C%22inhouse%22%3A-1%2C%22shape%22%3A1%2C%22fancy_shape%22%3A1%2C%22minprice%22%3A1%2C%22maxprice%22%3A1600000%2C%22minfancy_price%22%3A0%2C%22maxfancy_price%22%3A1600000%2C%22mincarat%22%3A0.9%2C%22maxcarat%22%3A1.05%2C%22minfancy_carat%22%3A0.23%2C%22maxfancy_carat%22%3A21.7%2C%22mindepth%22%3A58%2C%22maxdepth%22%3A63%2C%22mintable%22%3A53%2C%22maxtable%22%3A62%2C%22mincut%22%3A1%2C%22maxcut%22%3A6%2C%22mincolor%22%3A4%2C%22maxcolor%22%3A6%2C%22fancy_color%22%3A1%2C%22minfancy_intensity%22%3A1%2C%22maxfancy_intensity%22%3A9%2C%22minfancy_overtone%22%3A1%2C%22maxfancy_overtone%22%3A9%2C%22minclarity%22%3A6%2C%22maxclarity%22%3A13%2C%22minsymmetry%22%3A1%2C%22maxsymmetry%22%3A8%2C%22minpolish%22%3A1%2C%22maxpolish%22%3A8%2C%22minflourescence%22%3A1%2C%22maxflourescence%22%3A6%2C%22checkbox_panel1%22%3A%2204%22%2C%22checkbox_panel2%22%3A%22%22%2C%22fancy_checkbox_panel2%22%3A%22%22%2C%22sort%22%3A%22%22%2C%22fancy_sort%22%3A%22%22%2C%22page%22%3A2%2C%22pageview%22%3A%2224%22%2C%22adv%22%3Afalse%2C%22fancy_adv%22%3Afalse%7D’ -H ‘pragma: no-cache’ -H ‘user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36’ -H ‘content-type: text/json’ -H ‘accept: */*’ -H ‘cache-control: no-cache’ -H ‘authority: www.pricescope.com’ -H ‘referer: https://www.pricescope.com/diamonds/search?vendor=-1&inhouse=-1&shape=1&minprice=1&maxprice=1600000&mincarat=0.9&maxcarat=1.05&mindepth=58&maxdepth=63&mintable=53&maxtable=62&mincut=1&maxcut=6&mincolor=4&maxcolor=6&minclarity=6&maxclarity=13&minsymmetry=1&maxsymmetry=8&minpolish=1&maxpolish=8&minflourescence=1&maxflourescence=6&checkbox_panel1=04&checkbox_panel2=&sort=&page=2&pageview=24&adv=false’ -H ‘dnt: 1’ –data-binary ‘{“df”:{“vendor”:-1,”inhouse”:-1,”shape”:1,”fancy_shape”:1,”minprice”:1,”maxprice”:1600000,”minfancy_price”:0,”maxfancy_price”:1600000,”mincarat”:0.9,”maxcarat”:1.05,”minfancy_carat”:0.23,”maxfancy_carat”:21.7,”mindepth”:58,”maxdepth”:63,”mintable”:53,”maxtable”:62,”mincut”:1,”maxcut”:6,”mincolor”:4,”maxcolor”:6,”fancy_color”:1,”minfancy_intensity”:1,”maxfancy_intensity”:9,”minfancy_overtone”:1,”maxfancy_overtone”:9,”minclarity”:6,”maxclarity”:13,”minsymmetry”:1,”maxsymmetry”:8,”minpolish”:1,”maxpolish”:8,”minflourescence”:1,”maxflourescence”:6,”checkbox_panel1″:”04″,”checkbox_panel2″:””,”fancy_checkbox_panel2″:””,”sort”:””,”fancy_sort”:””,“page”:2,”pageview”:”24″,”adv”:false,”fancy_adv”:false},”dp”:0}’ –compressed
- Correct – each URL has a page number in it! If we manually change “page”: 2 into a 3, you will get the next page of diamond data. We have automated gathering some data, but with some scripting in a future post, I’ll show you how to automate this variable.
- With a little more experience, you will learn that you can slim down your curl command. Chrome gives you exactly what the browser used, but most of the time the extra detail is not required and is very confusing to code. I don’t have time to explain how I know how to slim this down – but it is an educated guess and check procedure of removing options unless it breaks the data retrieval! It turns out pricescope.com only allows known browsers (Mozilla Firefox, IE, Chrome, etc.) to access their website for security. It is important that we indicate our automation identifies itself as a browser using the “user-agent” header. As well, it requires you to explicitly specify the content-type as text/json. All the other headers are just ignored.Cleaner and Simpler:
curl ‘https://www.pricescope.com/diamonds/api/ps_dl/d_s’ -H ‘user-agent: Mozilla/5.0’ -H ‘content-type: text/json’ –data-binary ‘{“df”:{“vendor”:-1,”inhouse”:-1,”shape”:1,”fancy_shape”:1,”minprice”:1,”maxprice”:1600000,”minfancy_price”:0,”maxfancy_price”:1600000,”mincarat”:1.8,”maxcarat”:1.9,”minfancy_carat”:0.23,”maxfancy_carat”:21.7,”mindepth”:58,”maxdepth”:63,”mintable”:53,”maxtable”:62,”mincut”:1,”maxcut”:6,”mincolor”:1,”maxcolor”:23,”fancy_color”:1,”minfancy_intensity”:1,”maxfancy_intensity”:9,”minfancy_overtone”:1,”maxfancy_overtone”:9,”minclarity”:1,”maxclarity”:13,”minsymmetry”:1,”maxsymmetry”:8,”minpolish”:1,”maxpolish”:8,”minflourescence”:1,”maxflourescence”:6,”checkbox_panel1″:”04″,”checkbox_panel2″:””,”fancy_checkbox_panel2″:””,”sort”:””,”fancy_sort”:””,”page”:1,”pageview”:”96″,”adv”:false,”fancy_adv”:false},”dp”:0}’
Now that we know how to load data automatically, in the next post I’ll show how we can pull this data using python and begin to analyze it.
Hello Ben,
I am just learning to screen scrape and really enjoyed your article to cut my teeth on. I followed it exactly and all worked well up to Step #18. I can’t seem to get your slimmed down curl command to work for me. I removed several of the Headers so something is breaking it.
Also, You also mentioned you would have a followup post to this showing how to pull data using python and then analyze it. Did that article ever come out. I cannot seem to locate it.
Looking forward to your help/insight.
cheers,
Damon Manni
BASH coder for decades
Glad the approach helped you. Sadly, it’s likely they adjusted their server — and the slimmed down curl command may no longer work.
I do have an article here on how to analyze this data: https://benchodroff.com/2017/02/18/comparing-diamonds-with-linear-regressions-using-python-r-in-jupyter-notebooks/