Wednesday, October 9, 2019

Show HN: Concept search engine based on most popular HN submissions https://ift.tt/33myodD

Show HN: Concept search engine based on most popular HN submissions Thesis: Sites that do well on hacker news will tend to be sites with high quality content. Tools: Hacker News Big Query, python, Google CSE Steps: 1. Using HN Big Query, get all unique domains with more than 3 stories with more than 50 points (query link [1]). Sort by percentage of such stories to total number of stories. By doing that, at the top you will get sites like blog.geoffralston.com that have 3 out of 3 submitted stories get more than 50 points (100% !). Or lucumr.pocoo.org had 46 out of 124 total stories reach 50+ points! Talking about good writing. We cut the list at 2,500 sites , where the popular to submitted ratio is still at enviable 12%. 2. Add to this list all sites that had exactly one submission and that only submission ever from that domain had 300+ points on HN. I call them unexplored one hit wonders and thesis is that there are probably other gems on the domain just not ssubmitted yet. [2] 3. Now we have about 3,000 sites total. We will use Google CSE engine which allows up to 2,000 sites through annotations [3]. We have to clean the data now. - Check if the domain still resolves. Sadly about 400 these high quality sites do not anymore. - Check for redirects For example this site is no longer on its old address: https://ift.tt/2AUmvzr ... 302 Found (0.153) http://www.david.blog/ ... 200 OK (0.0676) - Check for all other sorts of weird errors This took most of the day :) I used modified version of [4] 4. Manually clean the list from news sites that made it on (nytimes, usatoday...) 5. If you want to check if your site is on the list, check [5]. If you are on the list congrats! 5. Finally, here is the end result: https://ift.tt/35inWFI Search cream of the crop of HN submitted sites! Let me know if you find this useful! October 10, 2019 at 05:03AM

No comments:

Post a Comment