CounterContent Scraper Group

Discussion in 'Tech Discussion' started by TamaSaga, Mar 17, 2018.

  1. TamaSaga

    TamaSaga Well-Known Member

    Joined:
    Oct 11, 2016
    Messages:
    1,726
    Likes Received:
    2,173
    Reading List:
    Link
    Considering all of the complaints, I'm curious about why someone hasn't put out a "call to programmers" because the easiest way to counter content thieves is to just glut the market with unique content protection. Make it so that they need to do far more manual work just to clean up everything.

    Basically we should work together with these kinds of goals:
    - Make it easy to use so that even plebian nonprogrammers can use it. I'm thinking like....Ctrl + Alt + F4 + Home + Numpad-0 kind of easy.
    - Bake in a bunch of encoding strategies so that content thieves can't just use a silver bullet to clear everything.
    - Make it so that the original text is as faithfully reproduced as possible to the eye. So we want to take the translated stuff and google translate it back to moon runes.
    - Make it so that we're not accountable if it doesn't work. If people complain, shrug them off since we would be doing it for free. Programming isn't that easy.
    - Convenience should not preempt security. I believe that the program shouldn't phone home at all. Without a central figure to guide its development, who's to stop someone malicious from making it so that an innocent "check for updates" sends the translator's hard work back instead.
    - Other groups have already created their own content protection. Let them keep it proprietary and roll our own using common protection schemes. The reason will be explained later below and it lets them add extra noise to our obfuscation efforts.

    Basically, my idea is this:
    - A program which you copy and paste text into / reads your clipboard. It'll encrypt the text somehow, and it'll spit out a usable encoded output that you just paste back onto your content form.
    - It has to have at least 10 encoding schemes. For instance, Random insertion of site messages. Div and p swapping, etc.
    - Here's the trick:

    Every time the translator goes to publish new content, they'll use this program and it will randomly choose 1 or 2 encoding schemes. The others will be held in reserve for future publications.

    Why is this clever? For those of us who have made content scrapers before, you generally mess with regex's and the like. That means we need to specifically tailor a scraper to a website and we generally don't count on their appearance to change much post to post. If it changes, it's usually something minor.

    So if the website keeps changing up its encoding scheme with every post, you either need to figure out a pattern that works with everything or you need to readjust the scraper every time while also manually cleaning up the scraped content.

    Now let 20 websites use this strategy and it can quickly become rough as there are umm...10 + 45 + 120 = 175 combinations you can come up with if you just randomly choose up to 3 encoding schemes from the 10. Let's make it so that if the thieves plan to content scrape everyone, it'll become a full-time job for them.

    We're going to run into a bunch of problems:
    - We need to figure out what platforms people use to host their translations. Self-hosted would be great but not everyone can handle a web server and one of our goals is to flood the market.
    - Going with the above, ideally it should be able to use javascript and css to broaden our possible encryption schemes. It might even go so far as to create images.
    - Need a programming language that eventually spits out an application that anyone can use. It'd be great if it can integrate into wordpress or the like since the encryption would come free then.
    - Some encoding schemes will conflict with each other. Others are so similar that they can be dealt with together. We need to come up with a bunch of orthogonal schemes that can work together.
    - Hasn't someone come up with something like this already? I haven't checked, but it'd be great if we don't have to reinvent the wheel, we just need to enhance it then.


    tldr; Make a bunch of programmers work together to develop a content-protection program that's widely available, well- featured, and easy to use for translators.
     
    Last edited: Mar 17, 2018
    Sharudeis and Mordiggian like this.
  2. SoulZer0

    SoulZer0 Heaven Refining

    Joined:
    Oct 25, 2016
    Messages:
    12,478
    Likes Received:
    24,483
    Reading List:
    Link
    That'd cost money, do you think they'll do it for free?
     
  3. chencking

    chencking [Daolord Grammar Nazi]

    Joined:
    Aug 1, 2016
    Messages:
    6,075
    Likes Received:
    4,160
    Reading List:
    Link
    I've done very little with webservers and near 0 security (student), but the challenges that stand out to me are that:

    if it doesn't decrypt automatically (in which case it might as well not be there) then you're just driving away tons of readers. Anyone who doesn't use NU definitely wouldn't be in the know, which means you lose all readers potentially using aggregators anyway. Also, a lot of TLers tried adding stuff, but it usually just loses them loyal readers who now prefer the scraped clean aggregator.

    I will watch to see if anything comes out of this though.
     
  4. DocB

    DocB "I see you, little mouse! Run along"

    Joined:
    Nov 10, 2015
    Messages:
    3,573
    Likes Received:
    8,110
    Reading List:
    Link
    Basically you are saying encrypt the data and make the key public. In half the time that you create the software to encrypt and the app for normal user to decrypt, aggregator will create a code to steal and decrypt automaticly and post. Now will you have aggregator that are easier to read than the main site.
    Best advice i can give is divulge NU, because with the existance of this type of tracker , aggregator become futile
     
  5. TamaSaga

    TamaSaga Well-Known Member

    Joined:
    Oct 11, 2016
    Messages:
    1,726
    Likes Received:
    2,173
    Reading List:
    Link
    Nope. I say encryption because that's what it essentially boils down to when you mess with the vanilla text. But it's really just adding color tags, switching font colors, inserting extra text. Nothing that requires mathematicians and rocket science. Just lots of text manipulation.

    You'd be surprised how many would volunteer if you just ask and you don't demand that they cure cancer and broker world peace while they're at it. Keep it reasonable, an hour at most...

    It's not encryption, prime number-game wise. It's just adding nonvisible noise that humans don't notice but bots will definitely get hit with.
     
  6. wonderer

    wonderer Well-Known Member

    Joined:
    Apr 5, 2017
    Messages:
    1,356
    Likes Received:
    1,051
    Reading List:
    Link
    I have no idea on this subject, but it sounds good? I don't know.

    But it got me wonderering, if it was this easy for such sites to protect themselves, then why like for example in china there's still pirate websites that pirate novels from other platforms, when the companies that run those platforms make millions, and could pay to implement this. I mean it's not like they want their novels stolen.
     
    TamaSaga likes this.
  7. UnGrave

    UnGrave ななひ~^^

    Joined:
    Jun 27, 2016
    Messages:
    4,072
    Likes Received:
    12,832
    Reading List:
    Link
    Sounds like it would be too much work for little to no benefit in the long run. It probably wouldn't take to long for a work around to come out, especially if we were to source it to community developers.

    Also, the text colour thing is really annoying for any user who tries to view the chapter in the "make site mobile friendly" feature that comes in most mobile browsers.
     
    TamaSaga likes this.
  8. chencking

    chencking [Daolord Grammar Nazi]

    Joined:
    Aug 1, 2016
    Messages:
    6,075
    Likes Received:
    4,160
    Reading List:
    Link
    In China, they have the great firewall. It's easier for the huge companies to shut down the pirate sites, but they pop up as fast as they go down. Or so hearsay told me, anyway.
     
    wonderer likes this.
  9. DocB

    DocB "I see you, little mouse! Run along"

    Joined:
    Nov 10, 2015
    Messages:
    3,573
    Likes Received:
    8,110
    Reading List:
    Link
    An "A" is ascii code 65 no matter if the font is huge red or small pink, the fact that you think that would affect data reveal your lack of knowledge
     
  10. TamaSaga

    TamaSaga Well-Known Member

    Joined:
    Oct 11, 2016
    Messages:
    1,726
    Likes Received:
    2,173
    Reading List:
    Link
    Because they're struggling to protect themselves. So they put in a ton of money to develop a bigger club.

    My solution is different in that I'll be giving a bunch of people sticks. It definitely isn't 100% effective. Maybe only 20%. But it also means that the content scrapers will be getting hit by something harder than a pillow so it might bruise.
     
    Last edited: Mar 17, 2018
    wonderer likes this.
  11. HnM_Pete

    HnM_Pete Well-Known Member

    Joined:
    Sep 13, 2016
    Messages:
    120
    Likes Received:
    212
    Reading List:
    Link
    Have you heard about the thing called OCR? Because it exists, and all this will just make the bots switch to OCR.
     
  12. bob3002

    bob3002 Well-Known Member

    Joined:
    May 20, 2016
    Messages:
    962
    Likes Received:
    664
    Reading List:
    Link
    It's been tried before. I remember a certain site posted pictures of text with crazy backgrounds, so entire chapters were like CAPTCHA puzzles. In the end it was basically unreadable (and didn't adapt to differently sized windows either.) Didn't work with screen reader software either, so if you were vision impaired you were out of luck. I understand making it hard for ripoff sites, but readers were probably suffering even more.

    By the way, the black text in your post is almost unreadable in night mode. Not sure if that was intentional or not.
     
    TamaSaga likes this.
  13. Way

    Way Crimeless Raubritter

    Joined:
    Nov 6, 2016
    Messages:
    232
    Likes Received:
    5,941
    Reading List:
    Link
    Funnily enough, Qidian (China) itself is hit the most by pirates
    Much less pirates of many other sites xd
     
  14. TamaSaga

    TamaSaga Well-Known Member

    Joined:
    Oct 11, 2016
    Messages:
    1,726
    Likes Received:
    2,173
    Reading List:
    Link
    *shrug* And to combat OCR, you use watermarks and...surprisingly, text color gradients. Plus it's not perfect, I had to hand edit quite a bit when I used OCR software. Or are those thieves going to leave their readers holding the bag?

    I'm impressed. Next, you should learn how to count in binary. It goes 0, 1, 10, 11, 100, 101, ...you'll be a programmer yet.

    Anyhow, Javascript fixes that problem. Go ask GM_Rusaku if you don't believe me.
     
    Last edited: Mar 17, 2018
  15. SoulZer0

    SoulZer0 Heaven Refining

    Joined:
    Oct 25, 2016
    Messages:
    12,478
    Likes Received:
    24,483
    Reading List:
    Link
    That would limit it to those who are familiar to this community.
     
    TamaSaga likes this.
  16. bob3002

    bob3002 Well-Known Member

    Joined:
    May 20, 2016
    Messages:
    962
    Likes Received:
    664
    Reading List:
    Link
    The site with the most popular works is pirated most frequently. It's not surprising at all.
     
  17. DocB

    DocB "I see you, little mouse! Run along"

    Joined:
    Nov 10, 2015
    Messages:
    3,573
    Likes Received:
    8,110
    Reading List:
    Link
    The great firewall don't remove site, it is the chinese service provider that block passage of certaint foreign site, if the site is hosted in a chinese server it is removed by the host provider after dmca, not retained at the firewall
     
  18. Way

    Way Crimeless Raubritter

    Joined:
    Nov 6, 2016
    Messages:
    232
    Likes Received:
    5,941
    Reading List:
    Link
    Aye, that's true. But I find it surprising because they pirate almost every novel there and I literally couldn't find the same treatment for any other site, which if are pirated usually stop after it becomes paid...

    Maybe I need to dig harder, but hmm.
     
  19. phoenom

    phoenom Well-Known Member

    Joined:
    Nov 22, 2015
    Messages:
    113
    Likes Received:
    72
    Reading List:
    Link
    the thing is , even we rule out incovenience for user ..
    encrypting or anything that you want to use to protect your content will tied to resource used in webserver . you want to make strong encrypted content ? sure , but how much resource it needed to do that ? resource is money . if they need big resource than normal wordpress resource, they need to spend more on server . the question is ? can community afford that ? unless you are backed with big financial like WW or QI , i dont think benefit encrypted content can outweight the cost
     
    TamaSaga likes this.
  20. TamaSaga

    TamaSaga Well-Known Member

    Joined:
    Oct 11, 2016
    Messages:
    1,726
    Likes Received:
    2,173
    Reading List:
    Link
    Pretty certain most translators don't even consider this. For those that do, they might want to figure out what's leeching their bandwidth because it sure as heck isn't the text. But yeah, I'll ball park like 3 times more resources. However, text content is tiny relative to pictures and stuff though so you might only see an increase of like 10% bandwidth.